SARS-CoV-2 Molecular Clock and Zoonosis

SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus 2), associated with Corona Virus Disease 2019 (COVID-19) coined by World Health Organization, belongs to single stranded RNA viruses (ssRNA Viruses) under Betacoronaviruses. The virus’ molecular dynamics are necessary in the wake of human-human transmissions globally with mortality cases on the rise, thusly the race for a vaccine. As the viral genome expresses more human-biased mutations, the coronavirus disease 2019 continues to infect people in their millions, with the available detection kits limiting the numbers detected out of the population. Understanding the molecular basis of the virus through bioinformatics would speed up the viral diagnostics, management and vaccine generation. Currently, the scientific community seeks to give varied perspectives of what is known of the virus at a cellular level. The knowledge is scattered and requires a consolidated flow on thematic understanding in order to ensue further build up towards curbing the disease. The structure and function of the virus, genome and revealed mutations are critical in directing the SARS-CoV-2 virus and disease understanding. Here, we analyze and review published knowledge on the virus in relation to the molecular specs and evolutionary relationships of the virus.


INTRODUCTION
Coronaviruses fall under the Coronaviridae family that has four genera: Alpha (α), Beta (β), Delta and Gamma (Woo et al., 2007;Lefkowitz et al., 2018). They have a positive-sense single-stranded RNA genome of length 26.4 -31.7 kb (largest of RNA viruses). The virus causes severe respiratory sickness, dubbed COVID-19 disease, having spread from human-human worldwide at an alarming rate with high fatalities WHO, 2020).
Other than a high rate of transmission, SARS-CoV-2 demonstrates longer incubation period prior to being symptomatic or asymptomatic as is the case in some patients, hence organ failures in selected patients and up to a 3% mortality rate (Huang et al., 2019;Chen et al., 2020;Guan et al., 2019).

Whole SARS-CoV-2 genome structure
Initial genome sequence data placed SARS-CoV-2 in the genus Betacoronavirus, subgenus Sarbecovirus, together with SARS-CoV. However, MERS-CoV fell under subgenus Merbecovirus of the same genus (Lu et al., 2020;Wu et al., 2020;Zhou et al., 2020;Zhu et al., 2020). Prior similarity comparisons reveal a 79% similarity between SARSCoV-2 and SARS-CoV though variations are great at the gene level with 72% base sequence homology on spike (S) protein that binds to host receptors. The novel SARS-CoV-2 as such, has a similar structure with Betacoronavirus viruses whose gene order is (50-replicase ORF1ab-S-envelope (E)membrane (M)-N-30. The longest replicases gene is ORF1ab at 21 kb length with 16 non-structural proteins and open reading frames (ORFs) downstream. This comparative genomic analysis was enhanced by related virus from Rhinolophus affinis bat in 2013 from Yunnan, China .
Subsequently, virus RaTG13 has 96% similarity to SARS-CoV-2 with key genomic differences including; SARS-CoV-2 having a furin (polybasic) cleavage site insertion between S1 and S2 subunits of S protein site, that is similar to insertions in HCoV-HKU1 and avian influenza virus but is absent in other Beta-coronaviruses (Coutard et al., 2020). There is an 85% similarity between the receptor binding domain (RBD) of RaTG13 and SARS-CoV-2 sharing a sixth of critical amino acid residues. Additionally, structural and nucleotide comparisons show that the RBD in SARS-CoV-2 is adapted to binding on human angiotensin-converting enzyme 2 (ACE2) receptors, a phenomenal that is also true for SARS-CoV (Wrapp et al., 2020).
A full length genome of SARS-CoV-2 was first generated by researchers at Yongzhen Zhang China . The genome's structure had the following order; a 5′-untranslated region (UTR)>replicase complex (orf 1ab) >structural proteins (Spike(S) >Envelope (E) >Membrane (M) >Nucleocapsid (N)) >3′-UTR and nonstructural ORFs. On the virus are 6 accessory proteins, whose genes are encoded as ORF3a, ORF6, ORF7a, ORF7b, and ORF8 Oostra et al., 2007). Different coronaviruses have diverse variable numbers of extra ORFs between the Nucleocapsid and the Spike genes. The virus relies on the transcription regulator motif (TRS) situated at the 3' end of the RNA replication and recombination (Asim et al., 2020).
Further, Lu et al. (2020) report that, SARS-CoV-2 has a complete genome size of 29 kb (29825 bp-29903 bp) falling within the range given by Woo et al. (2007) and Lefkowitz et al., (2018), placing the genome size at 26.4 kb -31.7 kb. The largest gene segment (ORF1ab gene) is made up of two Open Reading Frames (ORF1a and ORF1b). ORF1ab gene loci in the virus SARS-CoV-2 (251-21541 nt) alternate slightly with the start codon position unlike in MERS-CoV (279-21514nt) and SARS-CoV (265-21486 nt). Between the ORFs there are putative pseudoknot structures presided by a slippery sequence (UUUAAAC). Papain-like protease (PLpro) and 3 C-like protease (3CLpro) cleave to ORF1ab gene (Lu et al., 2020). This is similar to reports by Zhou et al. (2020) and Lu et al. (2020) that document that, ORF1ab gene of SARS-CoV-2 genome has 15-16 non-structural proteins (nsp) encoded at the consensus cleavage sites. However, Lu et al. (2020) further identify nsp 12 and 13 as the encoding gene for RNA-dependent RNA polymerase (RNA pol) and the helicase protein respectively. Additional information detects a single point mutation at the slippery sequence on this novel SARSCoV-2. Moreover, downstream of the ORF1ab (Upstream of S-Gene) is the haemagglutinin esterase (HE) gene in human infecting coronaviruses. There are three surface glycoproteins (spike (S), envelope (E) and membrane (M)) in coronaviruses (Lu et al., 2020).
Coronaviruses have type 1 membrane glycoproteins (S proteins) that form the pronounced "spikes". The S proteins are cleaved into two, the receptor binding S1 domain (high identity with bats) and cell membrane fusion S2 domains (low identity with bats). Therefore, there is high similarity between S-protein domains of SARS-CoV-2 and SARS-CoV that utilize angiotensin-converting enzyme 2 (ACE2) cellular receptor (Lu et al., 2020). However, key amino acid substitutions (at 501, 493, 439, 485 and 486 positions) were detected in the SARS-CoV-2 S-protein domain believed to be equally important in SARS-CoV (Lu et al., 2020). The E and M genes (for transmembrane proteins) are conserved in coronaviruses.
The nucleocapsid (N) protein dictates the structure. It interacts with the viral RNA genome for packaging within the virus. This assembly signal is well established in SARS-CoV. The 3a protein of SARS-CoV controls virus release through transmembrane complex in channel proteins, apoptosis, Golgi fragmentation, and intracellular vesicles accumulation. Similar to SARS-CoV and MERS-CoV, SARS-CoV-2 has other small Open Reading Frames (ORF9, ORF13, ORF14, and ORF10) situated downstream of N gene (Marra et al., 2003). The function of the N gene and small ORFs of SARS-CoV-2 is not yet known (Asim et al., 2020). A reference sequence for SARS-CoV-2 has been coined from sequence of 63 isolated strains in order to promote viral detection, vaccine design, epidemic investigation, functional analysis, and drug efficacy evaluation. The size of the reference genome strain was 29870 bp (Changtai et al., 2020). Probe and PCR primer design for SARS-CoV-2 should not be based on individual sample sequences but rather a reference sequence to undo the false positives associated with that (Changtai et al., 2020).
ORF10 consists of short protein peptide, 38 deposits in length. Koyama et al. (2020) depicted ORF10 in SARS-CoV-2 has no comparative proteins in NCBI repository. The study proposes adoption of this unique protein in PCR diagnostics for rapid disease distinction (Koyama et al., 2020). Though similar studies report new annotations of ORF1ab deposited in NCBI, NSP6 is the main contrast putative protein hence the NSP proteins were withheld Alinda et al. 3 (Koyama et al., 2020). Further 12 variations were referenced in NSP3 protein in ORF1ab suggesting a linkage between nsp3 association and coronavirus inception (Hurst et al., 2013) though they determined papain-like protease in NSP3 considered important in COVID-19 infection (Niemeyer et al., 2018). Considering the high rate of mutation in RNA viruses, more mutations are expected to appear in the viral genome and will be an important aspect of tracking the spread of SARS-CoV-2 (Grubaugh et al., 2020).

Phylogenesis of SARS-CoV-2 virus
Earlier studies used Pol or N gene as standard practice of phylogenetic tree construction, hence SARS-CoV was suggested to be a gammacoronavirus (Marra et al., 2003;Rota et al., 2003). However, extra amino-terminal domain analysis of the spike protein of the SARS-CoV discovered 19/20 cysteine residues spatially conserved in Betacoronaviruses (Rota et al., 2003). Further, conserved residues were only five in number within the alpha and gammacorona viruses (Rota et al., 2003). Subsequent phylogenesis based on whole genome analysis aided concluding that SARS-CoV belongs to betacoronavirus, consequently, whole genome phylogenesis of SARS-CoV-2 based on RdRP gene and spike gene revealed its classification falls under genus betacoronavirus (Eickmann, 2003). The coronaviruses SARS-CoV-2 and Yunnan/RaTG13 share a common ancestry, though they vary in spike genes sizes (SARS-CoV-2:3822nt; RaTG13:3809nt). Similar studies report that SARS-CoV and MERS-CoV used intermediate hosts (civets and camels) to infect humans (Guan et al., 2003;Alagaili et al., 2014).
Phylogenetic networks were used by Forster et al. (2020), to document how the SARS-CoV-2 virus spreads across the globe based on three central variants differing by changes in amino acid , named A, B, and C. Type A and C were found in Europe and America while type B in East Asia (Figure 1). Typically, the coronavirus evolution rate is close to 10 −4 substitute per bp annually, with a possible per replication cycle mutation (Su et al., 2016). Homology at amino acid and nucleotide levels in comparison to reference sequence is at 99.99% (Lu et al., 2019;Chan et al., 2019). The estimated "To Most Recent Common Ancestor" (TMRCA) dates and evolution degree for SARS-CoV-2 show high similarity on various clock models and coalescent tree before subjection to tipdating technique and constrained-dating technique, but highly distinct over prior coalescent trees. Bayesian analyses plus tip-dating technique by a strict clock and coalescent tree (constant size) prior demonstrated that SARS-CoV-2 evolution rate is 1.24 × 10 −3 substitutions per site annually (Li et al., 2020a, b, c), confirming the evolution rate in similar studies (Lu et al., 2020;Chan et al., 2019).

METHODOLOGY
We downloaded 55 Spike protein sequences from NCBI repository (https://www.ncbi.nlm.nih.gov). The sequences were aligned on BioEdit (Hall, 1999) using MUSCLE. The aligned file was exported onto MEGA-X (Kumar et al., 2018), converted to MEGA file format and using Neighbor Joining algorithm (Saitou and Nei (1987); we constructed a phylogenetic tree.

RESULTS
The evolutionary history was inferred using the Neighbor-Joining method (Saitou and Nei, 1987). The optimal tree with the sum of branch length = 17.18547576 is shown. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (50 replicates) is shown next to the branches (Felsenstein, 1985). The tree is drawn to scale, with branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. The evolutionary distances were computed using the Poisson correction method (Zuckerkandl and Pauling, 1965) and are in the units of the number of amino acid substitutions per site. This analysis involved 55 amino acid sequences. All ambiguous positions were removed for each sequence pair (pairwise deletion option). There were a total of 1358 positions in the final dataset. Evolutionary analyses were conducted in MEGA X (Kumar et al., 2018).
The tree had five clusters. Yunnan, Wuhan and USA SARS-COV-2 coronaviruses clustered together in the first category together with other SARS-CoV viruses from the regions. BetaCov viruses clustered in the second category and closer to other BetaCov identities in the third and fourth clusters. MERS CoV viruses clustered in the last group together with members from England, Egypt, UAE and Moscow. The SARS-CoV-2 cluster suggests a common ancestry with SARS-CoV, with some even closer to the Rousettus bat coronavirus in the first cluster ( Figure 2). The evolutionary history was inferred using the Neighbor-Joining method (Saitou and Nei, 1987). The optimal tree with the sum of branch length = 17.18547576 is shown. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (50 replicates) is shown next to the branches (Felsenstein, 1985). The tree is drawn to scale, with branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. The evolutionary distances were computed using the Poisson correction method (Zuckerkandl and Pauling, 1965) and are in the units of the number of amino acid substitutions per site. This analysis involved 55 amino acid sequences. All ambiguous positions were removed for each sequence pair (pairwise deletion option). There were a total of 1358 positions in the final dataset. Evolutionary analyses were conducted in MEGA X (Kumar et al., 2018).

DISCUSSION
As more partial and whole genome sequences of SARS-CoV-2 continue being harnessed and deposited in the repository, analyses of these RNA virus will continue encountering more mutations as evident in studies of Forster et al. (2020) and Grubaugh et al. (2020). The evolution of the virus using varied tools agrees on a common ancestry regardless of the type of sequence or loci of the gene from which the parentage is drawn. Most importantly, the structure of the virus is now definite as researchers in the recent studies agree with the perceived arrangement of genes on the viral genome, their functions and viral evolution-turnover rate. However, it will be better if more studies would be conducted on gene clusters with unknown functions (N gene and small ORFs), to open up the structural functions further. Unlike this study that is limited to partial genomes, comparative studies using whole genomes and increased sample size are proposed.

Conclusion
As detection kits continue to fill the market, it will be important if their suitability is based on the most potent scientific criterion that would give consistent results for positive or negative tests. Some studies suggest the use of a tailor made reference genome (Changtai et al., 2020) while others base diagnosis on initial genome of the prior viruses. Others such as Koyama et al. (2020) suggest the use of unique proteins (ORF10) for PCR diagnosis. This is in the wake of more countries expressing concerns over flawed tests where some kits give inconsistent results that stir up mistrust. As we follow through towards vaccine development, and tracking the infection patterns, use of modern technology for both diagnosis and treatment should not be an afterthought. Protocol and due process should be followed through when it comes to these developments as we look forward to flattening the COVID-19 curve.