Genomic composition factors affect codon usage in porcine genome

The objective of the study was to determine the codon usage bias in the porcine genome and decipher its determinants. To investigate the underlying mechanisms of codon bias, the coding sequence (CDS) from the swine reference sequence (ssc10.2) was extracted using Biomart. An in house built Perl script was used to derive various genomic traits and codon indices. Analysis was done using R statistical package, and correlations and multivariate regressions were performed. We report the existence of codon usage bias that might suggest existence of weak translational selection. The codon bias is feebly related to nucleotide composition (GC%, GC3, CDS length). This study can be explored for designing degenerate primers, necessitate selecting appropriate hosts expression systems to manipulate the expression of target genes in vivo or in vitro and improve the accuracy of gene prediction from genomic sequences thus maximizing the effectiveness of genetic manipulations in synthetic biology.


INTRODUCTION
The availability of nearly complete genome sequences from different taxa has enabled tremendous advances in evolutionary biology, providing insight to the actions of natural selection on genomes (Whittle et al., 2012).These biological breakthroughs revealed the importance of studying the degeneracy of genetic code, which enables most amino acids to be coded by more than one o called 'synonymous' codon (Wright, 1990).Synonymous codons usage (SCU) bias has been documented both within and between genomes, with huge interspecific and even intragenomic variation (Jia et al., 2009).
Several biological factors such as tRNA abundance (Kanaya et al., 2001), strand specific mutational bias, replicational, transcriptional and translational selection (Hershberg and Petrov, 2008), secondary structure of proteins, mRNA structure, GC composition (Knight et al., 2001) and environmental factors (Basak and Ghosh, *Corresponding author.E-mail: jkhobondo@gmail.com. Author(s) agree that this article remain permanently open access under the terms of the Creative Commons Attribution License 4.0 International License 2005; Behura et al., 2013a) have been reported to influence the synonymous codon usage in various organisms.The afore mentioned factors led to two hypotheses on the evolution of codon bias; mutation bias and natural selection for translation accuracy and efficiency (Sharp et al., 2005).
The mutational bias hypothesis predicted that genes in the GC-rich regions of the genome preferentially use Gand C-ending codons, while those in the AT-rich regions use A-and T-ending codons (Zhang et al., 2009) as observed in mammals.
In E. coli, Stoletzki and Eyre-Walker (2007) found strong support for the selection for translational accuracy hypothesis; they reported that highly conserved sites and genes have higher codon bias than less conserved sites and genes.In their report, codon bias was positively correlated to gene length and production costs, both indicating selection against missense and nonsense mutations.This was further corroborated in plants such as Arabidopsis thaliana, Oryza sativa and Zea mays where codon usage bias was correlated to the base composition of genes, gene expression level and CDS length (Morton and Wright, 2007).
In higher animals like humans there are reported codon bias which are thought to maximize the speed of elongation, minimize the costs of proofreading thus maximizing the accuracy of translation (Bulmer, 1991).Several studies have failed to disentangle between translation accuracy and efficiency, as both are believed to be intertwined.
However, there are reports correlating the translational efficiency with expression levels and use of codons that match common tRNAs.For instance, in eukaryotes, codon bias have been associated with translation efficiency (Qian et al., 2012), so that the most abundant tRNA can be recognized easily in highly expressed genes.In the study, Qian et al. (2012) hypothesized that different synonymous codons were translated at different speeds due to disparities in codon selection time, and that faster translation were important because it minimize ribosome sequestering and so help alleviate ribosome shortage.
On the contrary, some reports dismiss translation efficiency and accuracy from synonymous codon bias usage in mammals (Reis et al., 2004).Instead splice related biases are evident (Parmley and Hurst, 2007) and selection for the preservation of exonic splicing enhancers (ESEs) where there are high density of regulatory elements, explains low SNP density, low protein evolutionary rates, and low synonymous substitution near intron-exon boundaries (Parmley and Hurst, 2007).
Despite reports on codon usage in mammals, there is no such literature in porcine genome.The study of codon usage would shade light to the known disparity in gene expression levels and quantitative trait loci that may related to genome architectures.In this study, we tested these mutation and translation hypotheses using genomic traits at our disposal.Our results confirm evidence of codon usage bias that was affected by CDS length, GC content and GC3s in the CDS.The analysis of codon usage pattern in pigs might give insight for understanding the mechanism of biased usage of synonymous codons in silent sites.
This could necessitate selecting appropriate hosts expression systems to improve or decrease the expression of target genes in vivo or in vitro.The codon usage profiles can be explored for designing degenerate primers and improve the accuracy of gene prediction from genomic sequences and protein functional classification thus maximizing the effectiveness of genetic manipulations in synthetic biology (Qian et al., 2012).The study of gene expression traits in the porcine genome is relevant to many fundamental biological processes including species and breed diversity, gene expression and evolution, and adaptation to micro environment.The objective of the study was to determine determinants of codon usage bias in the pig genome to further decipher plasticity of genes.

Sequence data
Two complete genome sequences were used for analysis.A total of 23,269 coding sequences was extracted from the female Duroc pig breed as the reference genome, (Sus scrofa build 10.2) using BioMart (Ensembl v 68).Only 21,550 coding sequence (CDS) that were more than 50 amino acids (150 bp) were included for analysis.The short CDS were excluded due to large estimation errors for codon usage which are associated with short sequence length.The majority of the excluded genes were microRNAs which are averagely 22 bp in length.The second data was extracted from gene coordinates and Ssc 10.2 reference genome.

Relative synonymous codon usage (RSCU)
Relative synonymous codon usage (RSCU) is the proportion of the observed codon divided by its expected frequency at equilibrium.An RSCU value close to 1 indicates lack of bias, RSCU >1 indicates a codon used more frequently than expected, and RSCU < 1 indicates a codon used less frequently than expected (Sharp et al., 2005).
RSCU values are largely independent of amino acid composition and are particularly useful in comparing codon usage among genes, or sets of genes that differ in their size and amino acid composition.In this study we developed an in house Perl script to calculate RSCU as; RSCU = j Where, C denotes actual codon counts, j denotes number of synonyms and i denote the codon counts within the synonyms.We finally derived mean RSCU as; Where, RSCU is the values derived per gene, N is the total number of codon counts per gene.The value of RSCU_Mean ranged from one to infinity depending on the biasness of the gene.It was assumed to be a sister index to genomic Codon Adaptation Index (gCAI) that is explained below.
Another parameter genomic RSCU (RSCU') was calculated as; Where, n is the number of synonymous codons of an amino acid.This parameter measures and compares the usage of all 61 sense codons and is the proportion of use of a codon in all genes.

Genomic codon adaptation index (gCAI)
Classical Codon adaptation index (CAI) was first used to measure gene expression.This measure is species dependent and is the empirical measure for gene in studies investigating mutational and selectional components of codon usage (Goetz and Fuglsang, 2005).A CAI value is always between 0 and 1, and a higher value means a stronger codon usage bias and higher expression level and / or translation efficiency.In most research CAI of a coding sequence (CDS) is computed from the two parameters; the codon frequencies of the CDS and the codon frequencies of a set of known highly expressed genes (often referred to as the reference set).This computation leads to CAI which is used as a proxy for gene expression.In this case CAI values are normalized using codon frequencies in highly expressed gene sets.According to Xia (2007), CAI computation involves first derivation of a column of W values; Where, Fij.ref is the frequency of codon j in synonymous codon family i, and MAXfi.ref is the maximum codon frequency in synonymous codon family i from a set of highly expressed genes.
The codon adaptation index for a given gene is then given by: Where, L is the number of codons from synonymous families in the gene.
In our study, genomic codon adaptation index (gCAI) is calculated as the geometric mean RSCU divided by the highest possible geometric mean of RSCU given the same Amino Acid sequence.
This value (gCAI) is a proxy for codon bias but not gene expression.This is because the gCAI values are normalized using codon frequencies at equilibrium, thus there is no assumption of gene expression bias.

Analysis tools
An in house Perl script was used to derive codon indices, gene length, GC and GC3 (the frequency of G+C at the third position) for all the CDS.Statistical analysis was conducted using R (V 2.15.0).We used a Sperman's rank correlation to relate codon indices (gCAI, RSCU ' ) with different nucleotide composition variables (that is, GC, GC3 and CDS length).Multivariate regression model was used to predict the biasness and determine contribution of genomic factors to the biasness.

Variation in the CDS length, GC content and GC3s in the Coding sequence
The coding sequence GC content ranged from 0.285 to 0.812 with a mean of 0.531.For the CDS length, the shortest and the longest gene were 151 and 45618 bp respectively, with a mean of 1415 bp.The comparison of GC content between genes (intron and exons) and CDS were on average 47 and 53%, respectively.The mean CDS and gene lengths were 1415 and 27338 bp (Table 1).This confirms that the CDS are generally GC richer than the genes.

Codon usage bias analyses
The observed relative synonymous codon usage (RSCU) clearly indicated that there was a nonrandom usage of synonymous codons for individual genes (Table 2).To investigate if the observed biasness, favoring specific codons, were beyond specific genes, we performed an overall genome wise analysis by concatenating all the genes into one large sequence string.The rationale was to exclude factors specific for individual genes.Preference of certain synonymous codons was observed in Figure 1.Table 2 shows the variation in RSCU values across codons coding for the same amino acid.This table highlights the biasness for a representative gene.For example valine a four degenerate amino acid has more preference for GTG = 1.5688 and GTC = 1.0139 than GTT = 0.8546 and GTA = 0.565.For Aspartate, GAC = 1.0598 was more preferred than GAT = 0.9401.Figure 1 depicts the codon usage of serine (RSCU) and may act as a representative of all synonyms.In this figure, the codon AGC, TCC and TCA were the most preferred in that order.To compare the usage of the 61 codons, RSCU' was used.In this analysis, the codons from two degenerate amino acids had higher values (for example, TAC = 5.50 x 10 -6 and TAT = 4.96 x 10 -6 ) as compared to three fold degenerate amino acids (for example, ATC = 3.09 x 10 -6 , ATT = 3.09 x 10 -6 , ATA = 2.08 x 10 -6 ).Generally it was noticed that the usage of the codons reduced with the increase of the degeneracy with the six fold degenerate having the lowest usage codons or observed bias.
We further fitted a model to predict the codon usage bias (CUB) using nucleotide composition factors as independent variables.The fitted model; Where, y is either gCAI or genomic RSCU.
In this model all the factors/variables were highly significant.The model explained 29% of the observed gCAI (Table 4).Almost similar results were realized with genomic RSCU except the change in the sign of the coefficients.As can be seen in Table 5, there was negative association between genomic RSCU with GC content, CDS length and the intercept.However, this parameter was positively associated with GC3 and GC frequencies.It is worth noting that coefficients had minimal effect.This showed that nucleotide composition factors play minor but significant roles in shaping codon bias.

DISCUSSION
Previous analyses of codon usage in different taxa have suggested that there exists a huge interspecific variation and clear intragenomic variability (Gagnaire et al., 2012) and this study is no exception.Several biological factors such as tRNA abundance, strand-specific mutational bias, gene expression level, gene length, amino acid composition, protein structure, mRNA structure, nucleotide composition, intron splicing, recombination, gene conversion, DNA packaging, intron number (Qin et al., 2013) and selection for increased translational efficiency or accuracy have been demonstrated to relate to codon usage bias (CUB) (Ingvarsson, 2007).Despite abundance of these reports very few studies have focused on mammalian genomes.It is commonly accepted that both natural selection and genetic drift shape CUB across taxa (Zhao et al., 2007).Selection of codon bias is generally viewed as being weak; therefore, it is expected that selective forces, such as purifying selection against unfavored codons, should be more prominent in organisms with large effective population size (Ne) such as prokaryotes and unicellular eukaryotes or even fruit flies.Species with low Ne are expected to be more prone to genetic drift and therefore, should show relaxed selective pressure on codon usage.In order to examine synonymous codon usage in the pig genome we first deciphered the genomic composition of the genes (intron and exons) and the CDS as well, followed by analysis of the CUB.We hereby present evidence suggesting that the pattern of synonymous codon choices in the Sus scrofa is as a result of a complex equilibrium between different forces, namely the natural selection at the translational level, nucleotide compositional, mutation bias and the length of each gene.
We report conclusive evidence for codon usage bias in the pig genome.The CUB is evident in the nonrandom usage of synonymous codons as shown by the codon indices.This finding is consistent with other studies involving prokaryotes (Karlin et al., 1997) and eukaryotes (Waldman et al., 2011).The observed preference of some codons could be suggestive of a weak selection force acting on codon pool in the pig genome.The observed  CUB is further proved to be influenced significantly by nucleotide composition.However, in contrast to other papers (Rao et al., 2011), we report negative correlation between genomic codon adaptation index (gCAI) or CUB and the CDS length.The GC content and GC3s were consistent with their findings.In humans, the GC content and mutational biases were reported as major factors that influence codon usage.In plants several factors like nucleotide composition of genes, the levels of gene expression and length of the coding sequence contributed to the observed codon usage bias.
In B. pseudomallei genome, highly expressed genes had the highest GC content and it tended to use G or C at the third position of the codon (Hershberg and Petrov, 2010).The highly expressed genes in B. pseudomallei also had high GC content positively correlated with CAI value and GC3s.Their result purport that the highly expressed genes tend to use 'C' or 'G' at synonymous positions compared with lowly expressed genes.In this study our results points to preferred usage of both C or G and A or T at the synonyms sites as shown in Table 2, with the C or G ending codons being the majority.
However a negative correlation between gCAI and GC content or GC3s is unique.This might be due to the difference in the genome isochore structure, ambiguity (vary with space and time) of the gene expression in mammals, or due to difference in methodology of calculating CAI variants The negative correlation found between gCAI and gene length is consistent with other   reports in organism such as yeast, Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Populus tremula and Silene latifolia (Qiu et al., 2011).

Figure1: The relative synonymous codon usage of serine showing codon AGC and TCG as th and the list preferred codons respectively
Previous studies have shown that metabolic systems prefer to express those genes that are less costly (Hahn and Kern, 2005).Moreover, there have been reports of longer genes having higher expression level, CAI values and higher codon usage bias in some unicellular genomes, specifically E. coli (Stoletzki and Eyre-Walker, 2007).These contradicting reports (positive and negative correlation) indicate there are no universal rules about gene length and codon bias.In this study, the longer genes had lower gCAI values or lower codon usage bias.The general consensus is that there should exist a positive correlation between gCAI values and gene length, which could be explained by selection of the preferred codons to avoid errors during translation.Since the cost of producing a protein is proportional to its length, selection in favor of codons which increase accuracy should be greater in longer genes, and long genes should therefore have higher synonymous codon bias.
In such genes, by using optimal codons, translation is faster whereby ribosomes move faster along the mRNA and are released quickly to be available to translate other mRNA (Zhao et al., 2007).The use of optimal codons increases the accuracy of translation by reducing translational errors that can occur.The errors include missense in which an incorrect amino acid is incorporated into the growing peptide chain and nonsense in which the peptide synthesis terminates prematurely by incorporating stop codons.It is believed that both missense and nonsense errors that produce non-and misfunctional proteins respectively, are costly to the cell because they consume amino acids and energy both in their production and during breakdown (Stoletzki and Eyre-Walker, 2007).Besides, missense errors may have other serious consequences, for example, a missense error in a DNA polymerase may temporally increase the mutation rate (Ninio, 1991).
Pig genome just like other mammals is found to vary greatly in base composition between different genomic regions.In vertebrates, such as mammalian and birds, one of the most striking features of their genomes is the difference in G+C contents isolated regions called isochore structures.In pig there exists heterogeneity in G+C content that results in variation in codon usage bias as was revealed elsewhere (Hershberg and Petrov, 2010).
Having a relatively high GC content, we expected the pig preferred codons to mirror the genome composition.GC rich organisms tend to have GC rich optimal codons, while AT rich organisms tend to have AT rich optimal codons.This observation is manifested in RSCU as most preferred codons end with G or C albeit with some ending with A or T. This phenomenon is dependent on the isochore structure of the pig genome that we confirmed by observed variation in GC content.The data analyzed provide evidence for the mutational bias hypothesis.In our view the codon bias is skewed towards the AT ending codons as was revealed by inverse correlation between gCAI and GC3s.Indeed, AT rich genes were shorter in length and could imply efficient protein translation to minimize energy consumption.We also suspect that other factors besides mutation bias may have contributed to codon usage.Amongst the other factors, we hypothesize that selection for preferred codons is affected by the abundance of tRNA in the cells or the ones that bind those tRNAs with optimal binding strength (Ikemura, 1985;Kanaya et al., 2001).We could not confirm this due to lack of information on tRNA .However, this hypothesis has been proven in other organisms like E. coli, B. subtilis and C. cerevisae.In that study cellular tRNAs correlates positively and closely to tRNA gene copy numbers; by extension this suggests that in these species there is correlation between optimal codon use and tRNAs abundance.However such correlation was not found in studies involving D. melanongaster and humans (Kanaya et al., 2001).
The positive correlation observed between GC content, GC3 and gene length explains the computed low codon bias.This is because long genes tend to have more G and C, abundant G or C at the third codons which are negatively related to gCAI.The nucleotide composition factors only play significant but minor roles in shaping the codon usage in the pig genome as revealed in low R 2 value and statistical interpretation exhibited in multivariate regression analysis.These statistical inferences are clear indication that the pig genome is so complex and molecular functions are controlled by several factors.

Conclusions
We confirm the existence of codon usage bias in the porcine genome which might suggest there is weak selection of preferred codons for translation accuracy.The codon usage bias is influenced slightly by nucleotide composition factors among others.

Figure 1 .
Figure 1.The relative synonymous codon usage of serine showing codon AGC and TCG as the most and the list preferred codons respectively.

Table 1 .
Comparison of GC and length of both CDS and gene.The GC content of CDS is higher than the content of genes.

Table 2 .
Codon usage table of a representative gene (ENSSSCG0000000015) showing biased synonyms of 20 amino acids.The numbers in bold are the subtotal synonyms per amino acid.

Table 4 .
The coefficients of the multivariate regression analysis explaining the genomic composition factors affecting genomic codon adaptation index.

Table 5 .
The coefficients of the multivariate regression analysis explaining the genomic composition factors affecting the genomic RSCU (RSCU').