Codon usage bias analysis for the coding sequences of Camellia sinensis and Brassica campestris

Codon usage bias plays an important role in the regulation of gene expression. A couple of measures are widely used to quantify the codon usage in genes. On the other hand, no quantitative endeavour has been made to compare the pattern of codon usage diversity within and between different genes of Camellia sinensis and Brassica campestris. Nucleotide composition and its relationship with codon usage bias were analyzed. Additionally, the rare codons were identified by computing the recurrence of event of all codons in coding sequences of C. sinensis and B. campestris. The host cell, Escherichia coli used universally, failed to express smoothly many eukaryotic genes. For this, the authors prognosticated the codons showing the highest and the lowest expressivity of the coding sequences of C. sinensis and B. campestris, in E. coli K12 strain to improve the expression level of the genes.


INTRODUCTION
Gene expression is a fundamental cellular process by which proteins are synthesized in a cell based on the information encoded in the genes.Most amino acids can be encoded by more than one codon; such codons are depicted as being synonymous, and mostly vary by one nucleotide in the third position.Synonymous codons are not used uniformly, varies across species and within genome in the same species, the phenomenon is called codon usage bias (CUB) (Akashi, 1994;Behura and Severson, 2013).Molecular evolutionary investigations on codon bias suggest that recurrence of codon use changes between genes from the same genome and also between genomes (Hooper and Berg, 2000).Highly expressed genes are more biased in terms of their codon usage as compared to low expressed genes, and provide differential efficiency as well as accuracy in the translation of genes (Rocha, 2004;Hershberg and Petrov, 2008).The selection associated with translational efficiency/ accuracy is often termed as 'translation selection'.During the last two decades, numerous lines of evidence suggested that codon usage bias is driven by selection, particularly for species of fungi (Bennetzen and Hall, 1982;Ikemura, 1985), bacteria (Ikemura, 1981;Sharp and Li, 1987a) and insects (Akashi, 1997;Moriyama and Powell, 1997).
Soon, after the discovery of whole genome sequencing

MATERIALS AND METHODS
The complete coding sequences of the thirty genes from C. sinensis and forty seven genes from B. campestris were retrieved from the National Centre of Biotechnology (NCBI) nucleotide database accessible from the website www.ncbi.nlm.nih.gov.Each of those cds were devoid of any unknown base (N), intercalary stop codon and possessed the start and stop codons.
Relative codon usage bias and codon adaptation index were used to study the overall codon usage variation among the genes.RCBS is the overall score of a gene indicating the influence of RCB of each codon in a gene.RCB reflects the level of gene expression.RCBS was calculated as by Roymondal et al. (2009).Gene expressivity was again measured by calculating the codon adaptation index as per Sharp and Li (1986).It essentially measures the distance from a given gene to a reference gene with respect to their amino-acid codon usages.CAI defines translational optimal codons as those that appear frequently in highly expressed genes that is: Where, L is the length of gene g and wc (l) is the relative adaptiveness of the codon c in the reference genes (not g).
Certain codons will appear multiple times in the gene.Hence we can rewrite the equation to sum up codons rather than length, and use counts rather than frequencies.This makes the dependence on the actual gene more clear.The more usual form is: The effective number of codons (ENc) is the total number of different codons used in a sequence.The values of ENc for standard genetic code range from 20 (where only one codon is used per amino acid) to 61 (where all possible synonymous codons are used with equal frequency).ENc measures bias toward the use of a smaller subset of codons, away from equal use of synonymous codons.For example, as mentioned above, highly expressed genes tend to use fewer codons due to selection.The underlying idea of ENc is similar to the concept of zygosity from population genetics, which refers to the similarity for a gene from two organisms.ENc value was calculated as per Wright (1990).The measure of codon usage, synonymous codon usage orders (SCUO) of genes was computed as per Wan et al. (2004).GC3s is the frequency of (G+C) and A3s, T3s, G3s, and C3s are the distributions of A, T, G and C bases at the third codon position (Gupta and Ghosh, 2001).A series of scripts (programs) were writen in Perl language and run in Windows for analysis.These programs were used to estimate the above mentioned genetic parameters.
The correlations between all the above mentioned parameters were measured with the gene expressivity to find out the genetic factors playing major role in the genes of C. sinensis and B. campestris.All codon quantifications were performed using the Anaconda software (Moura et al. 2007).The residual values of each codon pair were also quantified from the coding sequences of each plant species by the Anaconda program.The occurrence frequency of each codon for a particular amino acid was also calculated and compared with their expressivity values to identify

RESULTS AND DISCUSSION
The present study was carried out to assess the codon usage pattern and gene expressivity for the genes of C. sinensis and B. campestris.In numerous microscopic organisms, intragenomic diversity in codon usage among genes has been reported.The genes selected for the present study from the two plants with their accession numbers together with the overall AT and GC%, RCBS, CAI, ENc, SCUO, GC1, GC2 and GC3 are given in the supplementary file.It was found that the codons of C. sinensis and B. campestris are rich in A and/or T. Yet, on account of Homo sapiens, it has been shown that the codons ending in G and/or C are dominating in the whole coding region.
Due to the difference in mutational bias, the GC percentage among different species varies to a great extent, even for the species within the same order.To determine if GC bias among C. sinensis and B. campestris has an association with codon bias, the nondirectional codon bias measure effective number of codons (ENc) was resorted to.The effective number of codons used by a gene and GC% at the three different synonymous codon positions (GC1s, GC2s and GC3s) are used to study the codon usage variation among the genes of B. campestris and C. sinensis (Figure 1).To quantify the level of diversity in the synonymous codon usage among all the selected cds within the genome of B. campestris and C. sinensis, the mean distance between the pairs of cds was estimated.The mean distance was found to be 0.07 with a median of 0.06 for C. sinensis and a mean distance of 0.09 with the median of 0.07 for the cds sequence of B. campestris.When focusing on the previously studied genomes (Lafay et al., 2000;Grocock and Sharp, 2002;Wu et al., 2005), the mean values for Bacillus subtilis 168 (0.60), E. coli K12 MG1655 (0.47), Helicobacter pylori 26695 (0.38), and Haemophilus influenzae Rd KW20 (0.37) indicated that the mean values varied widely among species.ENc is a widely accepted measure of codon usage bias that quantifies the degree of deviation from equal use of synonymous codons.It has been suggested that ENc may be dependent on the strength of the codon bias discrepancy (Fuglsang, 2004).The coefficient of determination (denoted as R 2 ) indicates how well the data points fit a straight line or curve.From the analysis, it is apparent that the coefficient of determination is 0.37 and 0.15 for the genes of C. sinensis and B. campestris, respectively (Figure 2).This reveals that 37% of the variation in expressivity for the cds of C. sinensis and 15% for the cds of B. campestris could be explained by the ENc.The remaining percentage of the variation in expressivity could be attributed to unknown factors, that is, genetic variation and/or other external factors.
Synonymous codon usage orders (SCUO) of genes of each species were further analyzed.SCUO is a relatively easier approach as compared to RSCU and is considered as more robust for comparative analysis of codon usage.The SCUO analysis shows that a majority of the genes selected for the present study are associated with high codon usage bias (43% cds in C. sinensis and 68% in B. campestris have SCUO≥0.5).This outcome proposes that these genes are associated with specific functions such as translational processes, ribosomes (mostly ribosomal protein genes), intracellular activities, transport, oxidation-reduction process and others (Supplementary Tables 1 and 2).
The Anaconda software was used to determine the adjusted residual values for association of each codon pair in genome-wide manner for the two plant species.The residual values signify the Chi-square test association between the two codons of each context (Moura et al., 2007).Furthermore, based on the average cluster patterns of adjusted residual values of codon pair frequencies among the C. sinensis and B. campestris, it was found that specific contexts were represented more often than other contexts.The cluster patterns revealed distinctions as well as commonalities of codon context  each order.The map constructed for the two plants was, again, compared in one single display to allow detection of overall patterns of codon context.Differential Display Map (DDM) was constructed from the absolute value by subtracting both maps cell-by-cell (Figure 4).
Researchers proposed that codons which are utilized less as often as possible all through the genome are rate limiting factors of exogenous gene expression supported by experimental verification (Garcia et al., 1986;Zhang et al., 2004).In C. sinensis and B. campestris, the 'rare codon' was defined by calculating the recurrence of event of all codons (Threshold selected: 10/1000) in coding sequences (Figure 5).In the meantime, our examination demonstrated that many of these rare codon pairs contain termination codons (Table 1).Based on the hypothesis that gene expressivity and codon composition are strongly correlated, the codon adaptation index has been defined to provide an intuitively meaningful measure of the extent of the codon preference in a gene.We have estimated the CAI and RCBS for each cds as a measure of gene expressivity (supplementary material).The CAI with RCBS were compared and it was observed that both showed a similar pattern.In concurrence with different past studies (Ikemura, 1981(Ikemura, , 1982;;Moriyama and Powell, 1997), it was observed that RCBS decreased with the length of the encoded cds.Since the RCBS value depends on cds length, CAI was used as a central measure for expressivity analysis.
Gene expression studies are essential for predicting the expression potentiality of a particular gene of interest.This will help in the discovery of new coding sequences of genes for most elevated protein expression in a cell so that these man-made proteins can be synthesized and used for therapeutic drives world-wide.Along these lines, it is important to find the codons that dictate the highest and the lowest expressivity of a cds within a particular

Brassica campestris Camellia sinensis
Rare codons ACG, CCG, CGA, CGG, CUA, CCC, CGC, CGU, GCG, GUA, UCG, UGU ACG, CCG, CGA, CGG, CGC, CGU, GCG, GUA, UCG, UUA, UGU, UGC expression system.The DDM analysis results suggested that both plants showed similar codon context pattern to some extent.For confirmation, the pattern of synonymous codons usage for both plants were compared.In support of our previous study on cereals (Chakraborty and Paul, 2015), both plants selected for the present study also maintained more or less similar pattern of synonymous codon usage (Figure 6).These result indicated that througout the evolution, both plants maintained a precise pattern of codon usage, may be due to the natural selection, mutation or any other external factors.Again, the role of each codon in terms of expressivity within the two plants were analyzed.The occurrence frequency of 59 codons (except stop codons and codons for Met & Trp) were calculated for each cds of C. sinensis and B. campestris and predicted their expression level in E. coli K12 strain.The occurrence frequency for each codon in cds was again allied with their expressivity values.Using the criterion derived from statistical analysis (positive and negative codon bias relating to the gene expression level), the codons showing the highest and lowest expressivity in E. coli k12 we obtained (Table 2).E. coli genome tRNA copy number data sets available in the genomic tRNA database (http://gtrnadb.ucsc.edu/)also support the results of highest and lowest productive codons.
To confirm the results of this analysis, we changed the original cds downloaded from the database to the highest expressive and the lowest expressive cds sequence by replacing the codons with highest and lowest expressive codons, respectively.The expressivity values for all the three sets of a cds sequence (original, highest was lowest cds) was calculated by using codonW.These results revealed that the highest as well as the lowest coding sequences significantly differed in expression level from the original cds downloaded from the NCBI database.

Conclusion
A novel method for identification of codons showing the highest and the lowest expressivity was introduced, in view of their recurrence of event.The event recurrence for every codon/cds was again allied with their expressivity values.Using the criterion derived from statistical analysis, the codons showing the highest and the lowest expressivity in E. coli k12 were obtained.The natural codons present in cds were replaced by the predicted codons of this study showing the lowest and the highest expressivity using a Perl program developed by the authors of this study.By comparing the expressivity values of our cds with that of original cds downloaded from NCBI, we have established that our method is a general one, not connected with the adjustments in gene length and overall nucleotide  All the cds selected for the present study from the two plants, this file provides the gene name, accession numbers along with the overall AT and GC percentage, RCBS, CAI, ENc, SCUO, GC1, GC2 & GC3 for all the genes.These data allow for the reconstruction of all the analyses.

Figure 1 .
Figure 1.Effective number of codon (ENc) distribution for the genes of B. campestris and C. sinensis.GC% at third codon position for C. sinensis and for first codon position for B. campestris showed strong correlation (0.3, 0.4) respectively with the ENc among all the codon positions.

Figure 2 .
Figure2.ENc values plotted against the CAI for the cds of Camellia sinensis and Brassica campestris.The coefficient of determination (denoted as R 2 ) is 0.37 and 0.15 for the genes of Camellia sinensis and Brassica campestris respectively suggesting that 37% of the variation in expressivity for the cds of Camellia sinensis and 15% for the cds of Brassica campestris could be explained by the ENc.

Figure 3 .
Figure 3. Patterns of codon context variation in C. sinensis and B. campestris.The green colour represents the highest number of the contexts and red colour represents the lowest number of contexts.The 59 codons are in rows and the 39 codons in columns.The colour intensity corresponds to the residual value present in each cell of the contingency table.
Figure 4.Figure 4: Comparison of codon context pattern between Camellia sinensis and Brassica campestris.Differential display map was obtained by calculating the module of the difference between the residuals of each map.The yellow cells indicate the highest context difference and the black cells represent pairs of codons that have similar residual values between two species.

Figure 4 :
Figure 4.Figure 4: Comparison of codon context pattern between Camellia sinensis and Brassica campestris.Differential display map was obtained by calculating the module of the difference between the residuals of each map.The yellow cells indicate the highest context difference and the black cells represent pairs of codons that have similar residual values between two species.

Figure 5 .
Figure 5. Rare codons for the cds of B. Campestris and C. sinensis.The 'rare codon' was defined by calculating the frequency of occurrence of all codons in coding sequences (threshold selected 10/1000).

Figure 6 .
Figure 6.Comparison of the pattern of synonymous codons usage for C. sinensis and B. campestris.Synonymous codons were placed in the x-axis and their usage frequency in the y-axis.Both plants showed the almost similar pattern of synonymous codon usage with little variation in the usage frequency.

Table 1 .
Rare codons for the cds of Brassica Campestris and Camellia sinensis.

Table 2 .
Codons for highest and lowest expressivity for the genes of C. sinensis and B. campestris.

acids Codons showing lowest expressivity Codons showing highest expressivity Camellia sinensis Brassica campestris Camellia sinensis Brassica campestris
composition, with a little noise in measurements.To design the highest and lowest expressive cds of the genes of C. sinensis and B. campestris in E. coli K12 strain, the restriction sites in the bacterium were not considered.