Dearth of full-length HIV-1 sequences obscures the true HIV-1 genetic subtypes distribution in sub-Saharan Africa

HIV infection is still a public health problem in sub-Saharan Africa. The broad diversity exhibited by HIV1 may impact on transmission, disease progression, drug resistance and vaccine development. Most analyses of HIV-1 subtype distribution have been on partial HIV-1 gene sequences, which may not adequately reflect the circulating subtypes. The objective of this study was to estimate the HIV-1 subtype distribution in sub-Saharan Africa using only full-length genome sequences. Using available HIV-1 full-length genome sequences from sub-Saharan Africa, the HIV-1 distribution in the region was analysed and compared with a previous global analysis which was not based entirely on full-length sequences. A total of 934 HIV-1 full-length genome sequences were available from 27 sub-Saharan countries. There was a disproportionate distribution of HIV-1 subtypes among countries with Cameroon having all the four HIV-1 groups. The subtype C was the most available in addition to a large proportion of circulating and unique recombinant forms (CRFs/URFs) especially in Central and West African countries, with frequencies of 32.6 to 90%. There was decreased representation of subtypes A and G in regions where CRFs/URFs were common compared with previous analysis using partial sequences. There is a need for more HIV-1 full-length genome sequences from sub-Saharan Africa for the true distribution of HIV-1 subtypes to be known, as analysis of partial sequences is not truly representative of the circulating subtypes.


INTRODUCTION
Infection with the Human Immunodeficiency Virus (HIV) continues to be a global public health problem with devastating consequences in developing countries especially in sub-Saharan Africa, even though antiretroviral therapy has improved the quality of life of those infected.HIV exists in two genetically distinct forms (HIV-1 and HIV-2), with HIV-2 being restricted to West Africa and HIV-1 having a global spread and being responsible for the HIV pandemic.
HIV-1 exhibits genetic diversity in the form of viral quasispecies (Meyerhans et al., 1989) described as a heterogeneous viral population of related genomes (Domingo et al., 1997).This genetic diversity of HIV-1 is believed to result from a high mutation rate due to the infidelity or error-prone characteristic of reverse transcriptase during replication (Roberts et al., 1988;Boyer et al., 1992), a high replication rate of about 10 9 virions per day (Ho et al., 1995) and genomic recombination (Hu and Temin, 1990;Jetzt et al., 2000;Zhuang et al., 2002).HIV-1 is classified into three main genetic groupings each representing independent cross-species transmission, although a fourth group (Group P) (Plantier et al., 2009) has been suggested.The three groups are: Group M (major), Group O (outlier) and Group N (non-M, non-O).Groups O and N are mainly restricted to Cameroon and the Democratic Republic of Congo.The Group M has a global distribution and is further divided into nine subtypes and some sub-subtypes: subtypes A, B, C, D, F, G, H, J and K, with sub-subtypes A1 to A5 (Gao et al., 2001;Meloni et al., 2004;Vidal et al., 2006Vidal et al., , 2009)), and F1 and F2 (Triques et al., 1999).Combinations of two or more subtypes and/or subsubtypes exist, and when these mosaic forms become widely spread and fixed in the population, they are known as circulating recombinant forms (CRFs).A CRF is therefore defined as "intersubtype recombination for which at least three epidemiologically unlinked variants are monophyletic and share identical genetic structure along their full genomes" (Yebra et al., 2012), while a unique recombinant form (URF) is a variant that has not been isolated from three or more individuals.
Presently, about 58 CRFs have been characterized and are in the public domain (http://www.hiv.lanl.gov/content/sequence/HIV/CRFs/CRFs.html).HIV-1 subtypes may exhibit phenotypic differences.Subtypes are believed to impact on tropism, with some studies associating increased CXCR4 usage with infections with subtype C (Johnston et al., 2003;Connell et al., 2008), and others finding decreased CXCR4 usage in subtype C infections (Bjorndal et al., 1997;Abebe et al., 1999;Peeters et al., 1999;Esbjornsson et al., 2010).Subtypes may have an important effect on transmission of HIV-1, as the subtype B was associated with homosexual transmission and the subtype C with heterosexual transmission (van Harmelen et al., 1997;van Harmelen et al., 2001), but a heterosexually driven subtype B epidemic has been observed in Trinidad and Tobago (Cleghorn et al., 2000).Also, the subtype C is believed to be more likely to be transmitted vertically than the subtypes A and D (Blackard et al., 2001;Renjifo et al., 2001;Renjifo et al., 2003); and infection with the subtype D has been associated with faster CD4 T cell decline and a faster rate of disease progression (Alaeus et al., 1999;Kanki et al., 1999;Kaleebu et al., 2001;Vasan et al., 2006;Baeten et al., 2007;Kaleebu et al., 2007;Easterbrook et al., 2010).Subtypes could also be important in vaccine design and development (Hemelaar et al., 2011).
Comprehensive understanding of disease process and effective interventions could arise from correlating genomic profile of HIV with that of patients (Sampathkumar et al., 2012).With HIV subtypes playing important roles in transmission and outcome of disease, it becomes imperative for the actual distribution of subtypes to be known.HIV subtype distribution has largely been determined using partial genome sequences.Since recombinant forms are relatively common, there is the possibility that the actual subtype distribution is not accurately represented using these partial genome sequences, and recombinants could be artificially scored as pure subtypes.This analysis therefore sought to firstly determine the HIV-1 subtype distribution in sub-Saharan Africa using only full-length or near full-length HIV-1 genome sequences available in the public domain.Secondly, it intended to document available full-length HIV-1 sequences from the region.

MATERIALS AND METHODS
All available full-length or near full-length HIV-1 genome sequences of sub-Saharan African origin were obtained from GenBank (http://www.ncbi.nlm.nih.gov) and the Los Alamos HIV Sequence Database (http://www.hiv.lanl.gov).The full-length sequence is one that contains the entire protein coding region as well as the noncoding regions, while a near full-length sequence contains almost all of the coding region.
Duplications were rectified with only one sequence per patient, except for those with infection with viruses of different subtypes.Information was transferred to an Excel spreadsheet and analysed.The countries in sub-Saharan Africa were grouped into four regions using the United Nations geoscheme for Africa: Central Africa, East Africa, Southern Africa and West Africa.
For convenience, Malawi, Zambia and Zimbabwe were grouped under Southern Africa.
Central Africa includes Angola, Cameroon, Central African Republic (CAR), Chad, Democratic Republic of Congo (DRC), Equatorial Guinea, Gabon, Republic of Congo and Sao Tome and Principe.
Southern Africa includes Botswana, Lesotho, Malawi, Namibia, South Africa, Swaziland, Zambia and Zimbabwe.
Results from this analysis were then compared with those obtained in an earlier analysis which did not discriminate between partial and full-length HIV genome sequences (Hemelaar et al., 2006).

RESULTS
A total of 1084 full-length or near full-length HIV-1 genome sequences of sub-Saharan Africa origin, were retrieved from GenBank (http://www.ncbi.nlm.nih.gov) and the Los Alamos HIV Sequence Database (http://www.hiv.lanl.gov).The sequences had been submitted to the databases between 1983 and 2011.In contrast, there were 127,798 partial HIV-1 sequences from the region.
With the removal of duplications, there were 934 unique full-length HIV-1 sequences from patients within the sub-Saharan Africa region.Sequences were available from 27 of the 45 sub-Saharan countries.An overwhelming majority of the sequences (96.7%) were of the HIV-1 Group M, with Groups N and O accounting for 1.1 and 2.1%, respectively of the sequences (Table 1).The subtype C was the most common subtype accounting for almost half of the Group M sequences, with recombinants responsible for 29% of the Group M sequences.A global HIV-1 subtype analysis was done in 2004 (Hemelaar et al., 2006) and some of the results are compared with the present analysis (Table 1).

Central Africa
There were available sequences from five of nine countries.Over three-quarters of the sequences were from Cameroon.Recombinant forms made up 64.2% of the sequences (Table 2).All the HIV groups (M, N, O and P) were present among the Cameroonian sequences, and also six of the nine Group M subtypes.Recombinants made up 66.7% of the sequences from Cameroon.The Group O was present among sequences from Gabon, while recombinants accounted for 66.7% of the sequences from the Democratic Republic of Congo.

East Africa
Sequences were retrieved for seven out of 13 countries in the region.Most of the sequences (94.4%) were from three countries: Kenya, Uganda and Tanzania.The subtype distribution of sequences from East Africa was spread between recombinants (37.2%), subtype A (32%), subtype D (18.4%) and subtype C (11.2%).About half of the sequences from Kenya were of the subtype A, with recombinants making up 41.8%.Of the Ugandan sequences, 44.2 and 32.6% were subtype D and recombinants respectively, while subtype A accounted for 23.2%.Of the sequences from Tanzania, 44% each were recombinants and the subtype C.

Southern Africa
Sequences were available for four out of eight countries in the region.Most of the sequences were from South Africa (78.4%) and 97.6% of the sequences from South Africa were of the subtype C, with recombinants contributing just 1.2% of the sequences.All the sequences from Malawi and 94.7% of those from Zambia were also of the subtype C.

West Africa
Sequences were obtained for 10 out of 16 countries in the region.Over 75% of the sequences were from just three countries: Nigeria, Ghana and Senegal.In this region, 71.83% of the sequences were recombinants.The HIV-1 Group O was surprisingly present among the sequences from Senegal (Table 2).Among the Nigerian sequences, subtype G and recombinants were each responsible for 47.6% of the sequences.Recombinants were responsible for 90% of the sequences from Ghana and 46.1% of those from Senegal.

Circulating and unique recombinant forms (CRFs/URFs)
Recombinants accounted for a substantial proportion of the available sequences from certain countries in the West, Central and Eastern African regions (Table 3) ranging from 32.6 to 90%.Five countries had no recombinants (Chad, Djibouti, Ethiopia, Somalia and Malawi).Twenty-one out of the characterized 58 CRFs were present among the sequences.In terms of proportion, Ghana had the highest proportion of recombinant sequences, but Cameroon numerically had more CRFs/URFs.
Cameroon also had the greatest diversity of recombinants with nine different CRFs and 23 different URF types, followed by Kenya with 11 URF types and 2 CRFs; Democratic Republic of Congo (4 CRFs, 5 URFs) and Ghana (3 CRFs, 5 URFs) (Table 4).CRF02_AG was the most prevalent recombinant form accounting for 47.6% of all CRFs and 21.8% of all recombinants.It was common in Cameroon, Nigeria and Ghana.An intergroup recombinant, 02O was among the Cameroonian sequences.All the countries in the Southern African region had no CRFs but a few URFs, while the countries in East Africa had more URFs than CRFs.In contrast, there were more CRFs than URFs in the countries in Central and West Africa (Table 4).Also, more recombinants were detected in our analysis compared to earlier analysis using partial sequences.
Table 5 shows regional differences in the distribution of some subtypes when partial and full-length sequences were used in the analysis.The detection rate of the subtypes A and G were less when full-length genome sequences are used in analysis compared with partial sequences, especially in regions where recombinants are common.

DISCUSSION
Despite Africa bearing the brunt of HIV infection, there is limited information on the molecular epidemiology of HIV-1 due to the paucity and uneven availability of both partial and full-length genome sequences across the continent.
Results from previous analyses (Hemelaar et al., 2006), had shown that the subtype A accounted for 21% of HIV-1 infections in West Africa, but in our analysis, it represented only 4.2% of sequences.Also, the subtype G previously observed to represent 35% of infections, accounted for only 16.9% of sequences.Though the subtype A was projected to represent 29% of infections in Nigeria, there were no available full-length or near fulllength subtype A sequences from Nigeria.These analysed partial sequences in their survey, might really be part of CRF02_AG recombinants, but on their own appear as subtypes A or G.The earlier analysis had used sequences irrespective of the length, but weighted the distribution according to the number of HIV-infected people in each country.Disparities in subtype frequencies between our analysis and the Hemelaar study were also observed in some countries from the different sub-regions.This implies that the true HIV-1 subtype distribution might not have been captured using partial sequences.
Our analysis shows that recombinants (CRFs/URFs) constituted a substantial proportion of HIV-1 genotypes in sub-Saharan Africa.Our estimates of CRFs were higher than those obtained in a comprehensive audit of HIV distribution in 2004 (Hemelaar et al., 2006).
In that analysis, recombinants were projected to account for 42.6% of HIV-1 infections for West Africa, while we observed 71.8% for the same region.The differences in estimates might be due to the different approaches used, and also the timing because the Hemelaar study evaluated 2003/2004, while we analysed all available full-length sequences at the time of analysis.Our analysis sought to present the subtype distribution as based on full-length HIV-1 sequences available in the Los Alamos database.Whilst not a perfect approach, it presents the genetic diversity as determined by the unambiguity of full-length genome sequences.
There has been a consistent increase in the reporting of CRFs and URFs (Vidal et al., 2000;Nyombi et al., 2008).This is further buttressed by the fact that 21 new CRFs have been characterized between 2008 and 2013, implying that the already complex genetic diversity of HIV-1 is evolving further.
The increasing number of CRFs and their relative spread is also a reason for more full-length sequencing and analysis.This is important because the clinical implications of subtype variation with regards to recombinants are yet to be established.The spatial distribution of CRFs and URFs needs clarification as CRFs were common in Central and West Africa, while URFs were common in East Africa.
Our analysis further reveals the dearth of HIV-1 sequence information from sub-Saharan Africa, as there were only 934 full-length sequences from 27 countries having millions of people living with HIV.Hemelaar et al. (2011) in a later review had noted that the available   CRF02_AG;6,CRF06_cpx;9,CRF09_cpx;10,CRF10_CD;11,CRF11_cpx;13,CRF13_cpx;16,CRF16_A2D;18,CRF18_cpx;21,CRF21_A2D;22,CRF22_01A1;25,CRF25_cpx;26,CRF26_AU;27,CRF27_cpx;30,CRF30_0206;32,CRF32_06A1;36,CRF36_cpx;37,CRF37_cpx;45,CRF45_cpx;49,CRF49_cpx. sequences were not representative of the HIV-1 distribution in the countries of origin, and that some countries harbouring large numbers of infected individuals with high subtype diversity had a small amount of HIV data.This is particularly true of a country like Nigeria that has a relatively high HIV-1 burden, but has only 21 available full-length HIV-1 sequences.In a review of selected studies that have documented HIV subtype diversity in East, West and Southern Africa, only two studies were observed to have used near full-length and full-length genome sequences (Lihana et al., 2012).There is thus a need for more full-length sequencing.Cost and lack of requisite equipment and manpower are probably responsible for the gross underrepresentation of sub-Saharan HIV-1 full-length sequences.Worrisome is the fact that of over 2148 highthroughput sequencing machines in the world, there are only 17 in Africa (3 in Kenya, 14 in South Africa) (http://omicsmaps.com/).Due to cost or other limitations, most studies in Africa are limited to sequencing of partial HIV genomes, and even these studies are identifying recombination within partial gene sequences (Kiwelu et al., 2013), so the full extent of genetic variability and recombination could be obscured unless the full-length genome is sequenced and analysed.Studies using partial sequences, have shown an increase in the detection of URFs and drug resistant viruses in sub-Saharan Africa (Ragupathy et al., 2011;Jacobs et al., 2014).Analysis of full-length sequences could possibly lead to greater identification of these recombinant forms.
Analysis of full-length sequences could also help in the accurate identification of low frequency viral variants (Henn et al., 2012) and the use of multiple genes rather than single gene to identify HIV-1 subtypes can reduce the chances of false identification (Neogi et al., 2012).The fact that the full-length subtypes E and I isolates were never found, and have now been re-designated as circulating recombinant forms, CRF01_AE and CRF04_cpx respectively (Carr et al., 1996;Gao et al., 1998;Paraskevis et al., 2001), and the suggestion that the subtype G, was actually a recombinant, whose parental subtype included the CRF02_AG (Abecasis et al., 2007), justify the calls for subtype classification to be based only on analysis of full-length or near full-length genomes.
It follows that analysis of partial HIV-1 sequences could be misinterpreted and may not reveal the true picture of HIV-1 biology and pathogenesis.Therefore, there is the need to know the current incidence/distribution of HIV-1 and also the need to expand the subtype database as these may impact on diagnosis, therapy and vaccine design.Full-length sequences are probably the most accurate representation of HIV genetic diversity.

Conclusion
This analysis brings to light the need for more sequences of full-length genomes from the sub-Saharan Africa region.This is a herculean task because even partial sequences are difficult to come by in most countries in the region.It will require an understanding of the importance of sequencing, commitment from governments within the region and continuous hard work from scientists to achieve this objective.The periodic monitoring of HIV variants could help determine the extent of virus evolution.

Table 1 .
Frequency of the HIV-1 groups and subtypes using full-length genome sequences.Figures from an earlier study(Hemelaar et.al are indicated).
N/S -Not stated.

Table 2 .
Regional distribution of HIV-1 subtypes using full-length genome sequences.
* UNAIDS Report on the Global AIDS epidemic 2013; DRC, Democratic Republic of Congo; CAR, Central African Republic; N/A Not available.

Table 3 .
Distribution of circulating and unique recombinant forms.

Table 4 .
Frequency of recombinant forms in some countries.

Table 5 .
Hemelaar et.Al. (2006)f Recombinants.The proportion of regional sequences that are recombinants in this analysis and theHemelaar et.Al.(2006)study are indicated.