Analysis of factors affecting codon usage bias in human papillomavirus

Indices of codon usage pattern of human papillomavirus (HPV) were analyzed to understand the key determinants of synonymous codon usage in the HPV genome. The complete sequences of 39 HPV genomes were downloaded from the website of the National Center for Biotechnology Information. The relative synonymous codon usage values, effective number of codons, GC content, percentage of GCs at the third position of synonymous codons (GC3s), codon adaptation index, hydrophobicity, aromaticity of conceptually translated gene products were calculated using the Codon W 1.4.2 program. HPV preferentially used codons ending with A/U. By comparing relative synonymous codon usage of the HPV genome and human genome, the codon usage of HPV was almost entirely different from that of humans. Statistical significant of the separation between codons ending with A/U and G/C on the first axis was shown by the principal component analysis. The greater number of the effective number of codon values against the value of GC3s was below the expected values. The correlation between effective number of codon values and both aromaticity and hydrophobicity showed significant high negative correlation. These results showed that composition constraint is likely the key element for codon usage in the HPV genome.


INTRODUCTION
Analysis of codon usage of virus genomes enhances the understanding of virus evolution and virus-host interaction (Zhong et al., 2012).There are 64 different codons for 20 amino acids and 3 stop codons in nature.These different set of codons for the same amino acids are termed as synonymous codons, which, however, are not used at random (Karlin and Mrázek, 1996).The frequency of occurrence of synonymous codons is different for every gene and each organism (Grantham et al., 1980).A phenomenon called synonymous codon usage or codon usage bias.
Synonymous codon usage is related to DNA replication and transcription, open reading frame length, gene structure, protein secondary structure, mutation pressure, translational selection, natural selection, aromaticity and hydrophobicity of the corresponding polyprotein, and environmental conditions (Zhao et al., 2003;Bishal et al., 2013;Zhang et al., 2013).Natural selection and/or mutation pressure for efficiency and accuracy are the fundamental forces that influence synonymous codon usage (Jenkins and Holmes, 2003;Hu et al., 2014).
Significantly different synonymous codon usage exists between virus genomes and that of their host species, which reflects different codon usage bias (Zhou et al., 1999;Chen, 2013;Cristina et al., 2016;Xu et al., 2017).Evolution of viruses involves changes in virus nucleotide composition, which ultimately creates variations in the virus genome (Sablok et al., 2011;Zhang et al., 2011).Considering reliant on host"s machinery for transcription, replication, protein synthesis and transmission of virus genomes, the interplay of codon usage among viruses and their hosts is expected to affect the overall viral survival (Shackelton and Holmes, 2004).Therefore, to measure all the codon usage in virus genome can improve recognizing of the regulation of viral genes expression (Butt et al., 2014).
The human papillomavirus (HPV) is a non-enveloped, epitheliotropic, double-stranded DNA virus with a genome of approximately 8000 bp, which infects stratified squamous epithelial cells.Gene expression of HPV in squamous epithelial cells is linked to the differentiation and function of the epithelial cells (Zheng and Baker, 2006;Zhao and Chen, 2011).The complete nucleotide sequence of the HPV genome was determined in 1985 (Seedorf et al., 1985).Although, HPV causes a certain cancers such as oral cancer (Lee et al., 2010), uterine cervical cancer (Galloway and McDougall, 1989), and anal cancer (Daling et al., 2004), its pathogenicity is unclear.
Although there are several studies on the HPV genome (Zhou et al., 1999;Zhao et al., 2003;Zhao and Chen, 2011), there is dearth of information on synonymous codon usage of HPV and factors that influence it.To obtain better and integrated understanding of synonymous codon usage of HPV, codon usage patterns of 39 HPV genomes were analyzed.This study will provide new insights into codon usage of HPV genome.

Sequence collection
The complete sequences of the 39 HPV genomes downloaded from the National Center for Biotechnology Information (NCBI) website were used in this study (Table 1).The indices of codon preferences in the HPV genome were analyzed by the CodonW 1.4.2 program (Peden, 2005).

Relative synonymous codon usage
Relative synonymous codon usage (RSCU) values for all the codons of the 39 HPV genomes (excluding the codons for Met and Trp, each of which has only one codon triplet), were calculated to examine the feature of synonymous codon usage without  (Sharp and Li, 1986) .The formula is as follows: Where, Gij is the observed number of the ith codon for the jth amino acid, which has ni types of synonymous codons.The codons with a RSCU value higher than 1.0 have a positive codon usage bias, while codons with a RSCU value lower than 1.0 have a relatively negative codon usage bias.Additionally, comparative analysis of the RSCU values between HPV and humans (downloaded from http://www.kazusa.or.jp/codon/) was performed.

The principle component analysis
To investigate the major trends in codon usage variation of the 39 HPV genomes, the principle component analysis was performed.
By creating a series of orthogonal axes, the major trends present in the dataset were analyzed in the multidimensional space.The codons were plotted on the first two axes due to these two axes and showed the highest fraction of data variance (Gu et al., 2004).

The index of GC3s
Excluding those encoding Met or Trp and the termination codons, the index of GC3s was calculated as the fraction of GC content at the third position of synonymous codon (Epstein et al., 2000).

The effective number of codon values
The effective number of codon (ENC) values is the best estimator of absolute synonymous codon usage bias (Comeron and Aguade, 1998) that was analyzed for the quantification of the codon usage bias of each open reading frame (Wright, 1990).The predicted ENC values were calculated as: Where, s denotes the value of percentage of GC at the third position of the synonymous codons (GC3s).

Indices for measuring chemical properties of amino acids
Hydrophobicity (GRAVY) and aromaticity (AROMO) of conceptually translated gene products may be factors that influence codon usage bias patterns (Peden, 1999).
The hydrophobicity index (Peden, 1999) is calculated as: Where, N is the number of amino acids and ki is the hydrophobic index of the ith amino acid.
The aromaticity index (Peden, 1999) is calculated as: Where, vi is either 1 (for aromatic amino acids Phe, Tyr, and Trp) or 0 (for a non-aromatic amino acid), and N is the number of amino acids.

Codon adaptation index
The CAI is a measurement of the relative adaptiveness of codon usage of a gene with the codon usage of highly expressed genes.
For each genome sequence G and some set of coding sequences S in G, codon bias is measured with respect to its synonymous codon usage.Given an amino-acid j, its synonymous codons might Kamatani and Shirota 3 have different frequencies in S; if x i,j is the number of times that the codon i for the amino-acid j occurs in S, then one associates to i a weight w i,j relative to its sibling of maximal frequency y j in S.
w i,j = x i,j / y j A codon with maximal frequency in S is called preferred among its sibling codons.To each gene g in G, Sharp and Li associated a value in [0, 1], called CAI defined as: Where, L is the number of codons in the gene, and wk is the weight of the kth codon in the gene sequence.Genes with CAI value close to 1 are made by highly frequent codons (Sharp and Li, 1987).

Statistical analysis
Correlation analysis was calculated using Spearman"s rank correlation method of the R package (R Development Core Team, 2011).

Overall relative synonymous codon usage of the HPV genome
The overall RSCU values of all the codons in the 39 HPV genomes are summarized in Table 2.A significantly nonrandom usage of degenerate codons encoding 18 amino acids was found.The amino acids Arg, Leu, and Ser had six-type codon degeneracy.For Arg, AGA, CGA, and CGU codons (RSCU of 1.13-1.37)were more frequently used than other codons (RSCU of 0.63-0.

Relationship between the codon usage patterns of HPV and the host
Comparison of the genomes of HPV and humans revealed that the codon usage pattern of the virus was different from that of the host (Figure 1).There were only few similar synonymous codon usage patterns between HPV and humans; these similarities were found in Ala (GGU), Pro (CCA and CCU), Arg (AGA), and Ser (UCU) (Table 2).

Principle component analysis
Principal component analysis was performed for all the genes in the 39 HPV genomes.The analysis detected one major trend in the first axis and another major trend in the second axis.The plots of codons ending with A/U (Figure 2a) and G/C (Figure 2b) were scattered in different ways.Most of the codons ending with A/U were clustered around the origin (0, 0) while codons ending with G/C appeared on both sides of the first axis.The separation of these codons on the first axis is statistically significant by analysis of variance (p < 0.05).These results suggest that certain factors might influence codon usage, which results in the observed difference between the characteristics of the codon plots ending with A/U and G/C.

Relationship of ENC values with GC3s
The ENC values for the 39 HPV genomes varied from 25.93 to 61.0 with a mean of 46.98 and a standard deviation of 6.11.The ENC-GC3s plot showed that most of the ENC values were just below the expected curve (Figure 3).Only 2.2% of total genes had high codon bias (ENC < 35).About 28% (76 genes) of the total genes had high ENC values (ranging over 50), indicating that these genes had random codon usage in HPV.

Level of gene expression and codon bias
The level of gene expression of HPV was measured through codon adaptation index values, which varied from 0.108 to 0.268 with the mean of 0.180 and standard   deviation of 0.0256.A significant negative correlation was observed between ENC and CAI (Figure 4a) (Spearman, r = -0.37991,p < 0.001).

Correlation between ENC and both AROMO and GRAVY
We also investigated whether other factors could explain the codon usage bias seen in the HPV genome.Significant high negative correlation was observed in between ENC and both aromaticity (Figure 4b) (Spearman, r = -0.71464,p < 0.001) and GRAVY (Figure 4c) (Spearman, r = -0.4391,p < 0.001).

DISCUSSION
The term codon usage bias shows the unequal usage of synonymous codons for encoding amino acids which may differ significantly between genomes, genes, and within a single gene.That is the reason that codon usage bias has received much attention and various research about codon usage bias have been reported (Mazumder et al., 2014).RSCU results showed that all preferred codons ended in A/U, which accounted for the majority of the nucleotide composition in the HPV genome (Nasrullah et al., 2015).Principle component analysis showed that codons ended in A/U and G/C were statistically different.These results imply that nucleotide composition is a key factor in determining the preferred codon usage in HPV.
Investigation of synonymous codon usage have not only presented insight into the molecular evolution of genes, but also identified potential modulations of gene expression as a result of codon selection that influence efficiency (Heitzer et al., 2007;RoyChoudhury and Mukherjee, 2010).In this study, we observed that the RSCU of the HPV genome showed a complementary trend when compared to the RSCU of the human genome.This might be beneficial for virus survival and persistence by eliminating competition with the host translation machinery (Zhong et al., 2012).Moreover, this pattern might also be induced by a process of selective evolution of the virus.It has been proposed that differential synonymous codon usage of a virus and its host strongly influences both viral replication and gene expression (Zhao et al., 2003).
The ENC value, which is one of the best overall estimators of absolute synonymous codon usage bias, provides an intuitively meaningful measure of the extent of codon preference in a gene (Wright, 1990;Ma et al., 2014).Compared to the ENC values of other DNA viruses, ENC values of the HPV suggest that the low codon bias may result from an increase in its replication efficiency in order to adapt to the replication system of the host (Liu et al., 2012).
The ENC-GC3s plot is used to investigate patterns of synonymous codon usage visually (Wright, 1990;Comeron and Aguade, 1998;Gupta and Ghosh, 2001).The ENC-GC3s plot has been generally used to resolve whether codon usages of given genes are influenced by mutation alone (corresponding points would lie around the expected curve) or also by other components such as selection (corresponding points would depart away from, markedly below the expected curve) (Chen et al., 2014).The data points follow a curvilinear trend if the synonymous codon usage is only determined by the GC content on the third codon position (Gu et al., 2004).If the synonymous codon usage depends on compositional constraints, the data points occur on or just below the expected curve; however, if the synonymous codon usage is subject to natural selection, the points should be considerably below the expected curve (Wright, 1990;Ma et al., 2014).In this study, all the data points were immediately below the expected curve, suggesting that synonymous codon usage in these 39 HPV genomes was basically influenced by compositional constraints.The results from correlation analysis between percentage of ENC and GRAVY or AROMO indicate that these factors have ineffective on the synonymous codon usage of HPV.
The CAI is used to characterize translationally optimal codons that are used as a choice in highly expressed genes (Xia, 2007;RoyChoudhury and Mukherjee, 2010;Mazumder et al., 2014).This is expressed as a ratio whose value ranges from 0 to 1, where a higher value is likely to indicate stronger codon usage bias and a potential higher expression level.This information is useful for identifying highly expressed genes in any organism (Sharp and Li, 1987).In this study, the CAI value indicates that most of the HPV genes are not highly expressive in nature.Furthermore, significant negative correlation was detected between ENC and CAI.From these results, levels of gene expression have ineffective on the synonymous codon usage of HPV.
In summary, we hypothesize that the HPV codon usage may influence its pathogenic mechanism by striking a balance with the codon usage of the host and ensuring competition-free survival.It would also be useful for understanding the cell-host interaction and evolution of the HPV.

Figure 1 .
Figure 1.Comparison of the codon usage patterns between the HPV genome and the human genome.

Figure 2 .
Figure 2. Distributions of codons ending with A/U (a) and with G/C (b) are shown along the first and second axes of the principle component analysis.

Figure 3 .
Figure 3.The relationship between the effective number of codons (ENC) and the GC content of the third codon position (GC3s) of the human papilloma virus.The solid curve represents the expected curve between ENC and GC3s under random codon usage.Black dots indicate all the genes.

Figure 4 .
Figure 4. Correlation analysis plots showing (a) the relationship between ENC and CAI, (b) the relationship between ENC and AROMO, and (c) the relationship between ENC and GRAVY.

Table 1 .
Accession number and genome size of the human papillomavirus.

Table 2 .
Relative synonymous codon usage values of the human papillomavirus and humans with the exception of those encoding Met or Trp.The preferentially used codons for each amino acid are displayed in bold.