Previous Article | Next Article ![]()
Journal of Virology, June 2005, p. 7570-7596, Vol. 79, No. 12
0022-538X/05/$08.00+0 doi:10.1128/JVI.79.12.7570-7596.2005
Copyright © 2005, American Society for Microbiology. All Rights Reserved.
Samuel Karlin,1 and
Edward S. Mocarski2*
Department of Mathematics,1 Department of Microbiology & Immunology, Stanford University, Stanford, California 943052
Received 9 December 2004/ Accepted 25 February 2005
|
|
|---|
|
|
|---|
The annotated human CMV (HCMV) genome sequence has formed a basis for comparisons to other betaherpesviruses. Murine CMV (MCMV) (40) and rat CMV (RCMV) (45) retain obvious sequence homologs of about 80 HCMV ORFs, or roughly 50% of the annotated genes in these viruses. Non-CMV betaherpesviruses infecting humans, such as herpesvirus 6 (19) and herpesvirus 7 (36), as well as those infecting lower primates, such as herpesvirus tupaia (2), retain similar core sets of ORFs. Approximately 40 of these 80 betaherpesvirus-specific ORFs are shared with all mammalian and avian herpesviruses (13) and are considered to be herpesvirus common. Despite obvious levels of divergence in the betaherpesviruses, common biological characteristics have emerged from studies of viruses infecting laboratory animals, and these have helped us to define immune control by the host and immune escape by the virus and to accumulate a myriad of additional basic information on replication, pathogenesis, and latency (25, 28-30, 41).
Prediction of the protein-coding potential of genomes is by nature provisional. In particular, herpesviruses and other eukaryotic viruses have been difficult to annotate accurately using conventional criteria, as evidenced by the recognition of additional genes as well as the elimination of ORFs found to be spurious based on additional investigation. For example, evidence suggests that the commonly employed limitations of ORF length (
100 codons) and maximum ORF overlap (<60%) lead to the exclusion of known CMV gene products, such as the multiply spliced immunomodulatory function, viral interleukin 10 (24, 26), and the 73-amino-acid herpesvirus-conserved smallest capsid protein (18). Similarly, recent efforts to identify structural proteins in MCMV have also resulted in several revisions to genome annotation (23). In addition to the small sizes of ORFs, biologically relevant events that may confound conventional annotation methods include posttranscriptional modification, mRNA splicing, alternate translation initiation sites, and stop codon suppression. Finally, automated annotation procedures may also be confounded due to unrecognized errors in underlying sequencing. Current limitations of analysis might be overcome by new approaches that are less restrictive and provide an extended list of candidate genes for experimental verification.
In the present study, we investigated the protein-coding potential of the MCMV and RCMV genomes, taking into account the conservation of ORFs and genome-specific sequence features. Analogously to the human and chimpanzee CMV genomes (14), MCMV and RCMV retain a remarkable level of evolutionary relatedness and similarity in both functional organization and arrangement of genes (40, 45). Our analysis of genome-specific sequence features will focus on translational "frame analysis" (5), exploiting the differential G+C distribution among codon base positions in genomes of high G+C content (see Materials and Methods and Fig. S1 in the supplemental material). To provide an objective means to evaluate the extent to which G+C content influences the translational frames and to reveal a potential coding region in any sequence, we also defined a new measure of gene compositional bias and a related measure of coding potential. Our approach makes no assumptions about the minimum length of coding sequences, although we focused on ORFs of
20 codons, and does not impose restrictions on the degree of overlap between putative protein-coding regions. This procedure represents a marked modification of standard methods and produces a substantial revision of the current annotations for the MCMV and RCMV genomes. Our analysis suggests that CMV genomes likely encode a greater number of overlapping genes than previously thought.
|
|
|---|
Homologies.
Similarity between ORF products of MCMV and RCMV was evaluated by the significant segment pair alignment (SSPA) program, and regions of similarity were identified by the multiple alignment program ITERALIGN (7). The alignment of viral genomes employed ORF products with lengths of
20 codons. The predicted products of all ORFs with lengths of
20 codons (60 nucleotides [nt]) were queried against a large nonredundant database of protein sequences using the BLASTP program (1).
Frame-specific G+C profiles (S-profiles). We characterized the G+C contents and distribution of genomic sequences of MCMV and RCMV by three measures of frame-specific G+C content (5). The G+C content of the genome was evaluated within a moving window of fixed length (201 nt or 102 nt) with respect to every third nucleotide of the genome. First, genome positions 1, 4, 7, and so on, up to the end of the genome sequence, were scanned, and then genome positions 2, 5, 8, and so on, were scanned, followed by genome positions 3, 6, 9, and so on (Fig. S1). With this procedure, variations in G+C contents along the genome were represented by three profiles, each representing a frame, referred to as "S-profiles." The relationships among S-profiles were used to assess the presence of protein-coding genes in genome regions of high G+C content (5), qualitatively by visual examination and quantitatively through the definitions of a bias in a frame-specific G+C distribution (S-bias) and of a related measure of coding potential (see below).
S-bias.
For a potential coding region of G+C content S, we defined a measure of how the G+C contents at codon base positions 1, 2, and 3 (S1, S2, and S3) compared to expectations (S-bias). Expectations
1(S),
2(S), and
3(S) of S1, S2, and S3 for a potential coding sequence of G+C content S were defined by the linear regressions of S1, S2, and S3 over S, measured in a set of 2,813 published herpesvirus genes (Fig. 1B to D). S1, S2, and S3 values were normalized to these expectations by the differences:
,
, and
. The obvious relation S1 + S2 + S3 = 3S holds for each gene. Since also
1(S) +
2(S) +
3(S) = 3S, the normalized G+C content values project onto the plane
, which can be represented in the two orthogonal dimensions
and
. The scaling factor K was specified so that Var(T1) was equal to Var(T2) for the set of 2,813 published herpesvirus genes (see Results). The S-bias of a putative coding region of G+C contents S, S1, S2, and S3 was defined as the magnitude of the corresponding vector (T1, T2): S-bias(S1, S2, S3 | S) = (T12 + T22)1/2. With this definition, a sequence with a distribution of G+C nucleotides among codon base positions corresponding to expectations will have an S-bias of 0.0, independently of its overall G+C content. As the S-bias increases, the likelihood of an ORF to code for a protein decreases.
![]() View larger version (33K): [in a new window] |
FIG. 1. Percent G+C contents in the first, second, and third codon positions of genes across herpesvirus species in relation to genome and gene G+C content. (A) Average G+C contents at the first (red circles), second (green circles), and third (blue circles) codon base positions in coding regions from 28 herpesvirus genomes in relation to the genome G+C content. The point corresponding to all coding regions from MCMV is labeled M, whereas M1 and M2 correspond to genes from the region of high G+C content of MCMV and from the remaining part of the genome, respectively (Fig. 2). Similarly, R, R1, and R2 indicate coding regions from the complete RCMV genome or from corresponding regions of high and low G+C contents (Fig. 2). Linear regressions through these points are shown as solid lines that are color coded to indicate the codon position. A similar set of regression analyses was carried out with 84 prokaryotic genomes for comparison, and results are shown as dashed lines (note that the dashed blue line almost precisely overlaps the solid blue line). (B-D) The G+C contents of 2,813 herpesvirus genes at the first (B), second (C), and third (D) codon base positions in relation to the overall G+C content of each gene. Genes annotated in the MCMV and RCMV genomes are shown in green and red, respectively.
|
![]() |
![]() |
The amino acid bias (aa-bias) of gene group G relative to gene group F was defined as
![]() |
Coding potentials.
Local coding potentials were evaluated from nucleotide composition as follows. For each of the six coding frames (three on the direct strand and three on the complementary strand), an S-bias was evaluated within a window of 102 nt. The probability distribution of S-biases in coding regions was determined from similar windows extracted from all annotated coding regions of 28 herpesvirus genomes. Corresponding probabilities were obtained for random distributions of S1, S2, and S3 given G+C content S. In the case of random distributions, S1, S2, and S3 values have the same expectation (S) and same distribution. The S-bias for random distribution (S-biasrand) was calculated by normalizing S1, S2, and S3 values as follows:
, S2* = S2 S, and
and biases were directly calculated with the equation
. The conditional probability (coding potential) [P(CODi
F)] of a sequence, F, to be coding in frame i was evaluated as follows:
![]() |
GeneMark coding potentials (6) were also evaluated based on predictions obtained with the program GeneMarkS (4) as implemented at the website http://opal.biology.gatech.edu/GeneMark/genemarks.cgi.
|
|
|---|
The high overall G+C contents of the MCMV and RCMV genomes (58.7% and 61.0%, respectively) and the corresponding high contrasts in G+C usage at different codon positions (Fig. 1A) were expected to provide a means to reliably identify protein-coding regions. However, C and G bases were not evenly distributed across the two genomes (Fig. 2). In both MCMV and RCMV, the G+C contents were greatest (61.7% and 69.2%, respectively) in the large genomic segment containing the betaherpesvirus-conserved protein-coding regions. G+C contents were more varied and generally lower (54.3% and 47.7%, respectively) in the remaining genomic segments. These differences in G+C contents resulted in varied contrasts of G+C usage at each codon position for genes expressed in different genomic regions.
![]() View larger version (40K): [in a new window] |
FIG. 2. Sequence G+C contents in the MCMV and RCMV genomes measured within moving windows with a size of 201 nt. The regions of high G+C contents conserved between the MCMV and RCMV genomes are shaded gray.
|
20 codons from the MCMV and RCMV genomes.
We identified a total of 5,541 MCMV and 4,741 RCMV ORFs with lengths of
20 codons (defined without regard to AUG codons, from stop codon to stop codon). All ORFs were analyzed in terms of S-bias, C-bias, and aa-bias (see Materials and Methods). The compositional biases of ORFs corresponding to previously annotated coding sequences were determined over the previously reported length (40, 45), and newly annotated ORFs were evaluated over the entire stop codon-to-stop codon distance as well as beginning at AUG codons when these were at least 60 nt upstream of a stop codon.
S-biases.
We devised a scoring system to quantify biases in G+C contents at codon positions across a putative coding region (see Materials and Methods). The expected G+C contents at codon positions 1, 2, and 3 were determined from the regression lines over 2,813 annotated ORFs from 28 herpesvirus genomes (Fig. 1B to D). Normalized G+C values were transformed into the coordinate system T1 and T2 with a scaling factor (K) equal to 1.804 (see Materials and Methods). The distribution of S-biases of all MCMV and RCMV ORFs of
20 codons is shown by the black lines in Fig. 3A and B, where we chose the starting position associated with the lowest bias for each newly annotated ORF. The distribution of all such ORFs with G+C contents of >50% (4,348 ORFs from MCMV and 3,095 ORFs from RCMV) is shown by the gray lines. The distribution of biases among previously annotated ORFs is also shown for comparison. A large proportion (46% in MCMV and 51% in RCMV) of all previously nonannotated ORFs had a low S-bias typical of coding regions (lower than the threshold corresponding to 95% of the annotated genes). Virtually all ORFs with high S-biases (>40) had high G+C contents (>50%), as expected from the great asymmetries in frame-specific G+C usage that distinguish G+C-rich coding from noncoding sequences. However, 34% of the ORFs with high G+C contents showed low S-biases. We also evaluated the biases in codon usage and amino acid usage for all ORFs (see Materials and Methods). We computed all biases relative to the average frequencies observed either among previously annotated genes encoded in the region of high G+C content or among all other annotated genes of the respective genome. For each ORF we selected the smaller of the two biases. The distribution of C-biases and aa-biases among ORFs of MCMV and RCMV are shown in Fig. 3C to F. The C-biases and aa-biases of previously annotated ORFs were low compared to those of the ORF sets analyzed here, although, as for S-biases, a large number of these ORFs have C-biases (18% in MCMV and 32% in RCMV) and aa-biases (19% in MCMV and 33% in RCMV) within the 95th percentile range of the corresponding annotated genes.
![]() View larger version (21K): [in a new window] |
FIG. 3. Distribution of compositional biases of ORFs of 20 codons from the MCMV and RCMV genomes. The distributions among all ORFs are shown as black lines, and the distributions among ORFs of 20 codons and with G+C contents of >50% are shown as gray lines. The distributions among published genes are shaded black. See the text for definitions of S-bias, C-bias, and aa-bias.
|
Conservation between MCMV and RCMV genomes.
We searched for similarities between proteins potentially encoded by ORFs with lengths of
20 codons from MCMV and RCMV using the computer protocol SSPA (7). Pairwise comparisons between 5,541 MCMV and 4,734 RCMV ORFs resulted in 73,330 pairs (0.28%) that exhibited statistically significant similarity. As expected, extended similarities (
50% SSPA similarity) were distributed along the two viral genomes in a collinear fashion. We then applied the ITERALIGN multiple sequence alignment program (7) to identify all ungapped blocks of aligned positions with lengths of >10 codons. To select the most-reliable regions of homology among all blocks, we constructed a pairwise alignment of the MCMV and RCMV genomes, starting from the longest blocks and progressively adding shorter blocks. Blocks that were not collinear with the partial alignment obtained from the longer blocks were excluded. The resulting genome alignment, shown in Fig. 4, involved 107,739 positions, covering about 47% of each genome sequence. Of these, 94,261 (87.5%) coincided with alignments between amino acids of annotated proteins and were plotted in black. A remarkable 13,478 positions (12.5%), plotted in red, corresponded to alignments between ORF pairs involving at least one ORF not appearing in the original genome annotations.
![]() View larger version (10K): [in a new window] |
FIG. 4. Homologous protein-coding positions of the MCMV and RCMV genomes as derived from significant similarities determined by SSPA (7) analysis of all potential protein products of ORFs at least 20 codons long. Red bars highlight segments corresponding to the 13,478 conserved positions that were not previously recognized in published genome annotations. K, thousand. Numbers on the x and y axes indicate positions.
|
|
View this table: [in a new window] |
TABLE 1. Newly identified ORF pairs conserved between MCMV and RCMV
|
BLASTP analyses.
We queried the products of all ORFs with lengths of
20 codons against the NCBI nonredundant database of 644,068 proteins from coding sequence translations of sequences in GenBank, the Protein Data Bank, Swiss-Prot, and PIR using BLASTP (1). Significant results (E-value, <0.001) from this analysis for ORFs in the MCMV and RCMV genomes not previously annotated are reported in supplemental Tables S3 and S4, respectively. Eleven of these matches (see the footnotes of Tables S3 and S4) involved sequences of low complexity and are likely spurious. Other matches confirmed ORFs M31b, M73.5e2, m120.1, r48.2, R73.5e2, and R102b, newly annotated based on SSPA analysis.
BLASTP analysis identified four additional candidate genes in the MCMV genome, one with similarity to a region of RCMV r5, one overlapping M57 and similar to the single-stranded DNA-binding protein of primate CMVs, one similar to a hypothetical protein of the rhesus macaque CMV, and one similar to RCMV r95.1. In RCMV, 16 ORFs showed interesting BLASTP matches (boldfaced in supplemental Table S4). Three ORFs showed respective similarities to the arabinogalactan protein of maize, to the regulatory protein E2 from human papillomavirus, and to BHLF1 from EBV. ORF r169.1 (overlapping r169) showed extensive similarity to ORF r171, located immediately downstream in the RCMV genome (see the supplemental material). A notable feature in the RCMV genome evidenced by the BLASTP analysis was the existence of multiple similarities between ORFs overlapping in different frames the published genes r121.1, r121.2, and r125. These similarities corresponded to multiple exact repetitions of long DNA elements (supplemental Tables S5 and S6) duplicated in different frames within the same overlapping ORF. The lack of relatedness of these DNA structures to any coding frame suggests that these genome regions may not code for proteins at all.
Three ORFs of RCMV, newly named r153e2, r153e3, and r153e4, showed significant similarity to a lectin-like glycoprotein first identified in the English isolate of RCMV (46), where the protein is encoded by five exons. Similarity analysis of this protein against our collection of peptides suggested that a homologous lectin-like protein is also encoded in the RCMV Maastricht genome, within the region including positions 217034 to 217816. By the alignment of the putative products of these ORFs to the protein identified in the English isolate (Fig. 5) and the identification of putative donor and acceptor sites in the RCMV genome, we suggest that this protein is encoded in RCMV Maastricht but employs four exons and has a total length of 186 aa.
![]() View larger version (33K): [in a new window] |
FIG. 5. Alignment of a lectin-like glycoprotein identified in RCMV English (46) with ORFs from RCMV Maastricht (45). ORFs from RCMV Maastricht are translated and numbered from stop codon to stop codon. Aligned positions are represented in capital letters. Nonaligned positions are represented in lowercase letters. The alignment suggests that in RCMV Maastricht, this protein is encoded by four exons (indicated in red) within ORF C217640.0.217816 (exon 1, C217651-217783), ORF C217366-217680 (exon 2, C217402-217580), ORF C217214-217411 (exon 3, C217221-217327), and ORF C217004-217162 (exon 4, C217004-217142). The N-terminal conservation of a similar highly hydrophobic region within the ORFs including exons 1 and 3 suggests possible alternative splicing.
|
20 codons from the two genomes (see the legend of Fig. 6).
![]() ![]() ![]() ![]() ![]() ![]() View larger version (316K): [in a new window] |
FIG. 6. Frame-specific G+C profiles (S-profiles [5]) along the MCMV genome (horizontal axis) are represented by red, green, and blue curves within windows with a size of 201 nt (intense red, green, and blue curves, respectively) and with a size of 102 nt (light red, light green, and light blue curves). The overall G+C contents, measured within windows with a size of 201 nt, are represented by the black curve. All positions in the MCMV and RCMV genomes showing significant (P 0.01) S-profile contrasts that are not consistent with previously annotated genes are identified by shaded areas across the S-profile plot. All ORFs are represented as arrows pointing from the 5' to the 3' end and are colored according to the frame of the third position of their codons. ORFs previously annotated in published literature (40) are plotted as "annotated genes," with filled regions denoting conservation between MCMV and RCMV. "Conserved" indicates other regions conserved between the MCMV and RCMV genomes, evaluated by comparing similarities of all ORFs of 20 codons (see Materials and Methods and Results) from the direct strand of the genome (upper line) or from the complementary strand (lower line) and colored according to the frame of the third codon position. All conserved nonannotated ORFs with ungapped blocks of similarity longer than 10 aa that are consistent with the collinear arrangement of the two genomes (shown in Table 1) and ORFs with significant BLASTP (1) hits (from Table S3) are indicated by thin arrows. "Coding potential" indicates genome positions with a coding potential of >0.5, evaluated by S-biases (S) or by the GeneMark procedure (G) (6). The frames of the regions of high coding potential are color coded as for genes, and the genome strand is distinguished by representation on the upper (direct) or lower (complementary) lines. "Newly annotated genes" indicates coding regions newly predicted by our methods. Coding regions predicted with highest confidence are depicted with thick lines and shown in full color.
|
![]() ![]() ![]() ![]() ![]() ![]() View larger version (297K): [in a new window] |
FIG. 7. Genome and published annotation of RCMV Maastricht (45). Explanatory material can be found in the legend to Fig. 6.
|
To provide an objective means to identify regions in the MCMV and RCMV genomes where S-profiles would predict the presence of protein-coding sequences, we first identified all positions (centered in windows of 102 nt) where frame-specific G+C contents differed by more than 35% (corresponding to a random probability of
0.01). We then excluded all regions where these contrasts could be explained by the presence of previously annotated ORFs. The remaining regions of high frame-specific G+C contrasts, shown as shaded blocks in Fig. 6 and 7, suggest the existence of expressed genes.
Newly annotated ORFs. All ORFs whose expression was consistent with the observed high contrasts in S-profiles (>35%) were shown among the newly annotated ORFs in Fig. 6 and 7 and were listed in Table 2 (MCMV) and Table 3 (RCMV). In these tables we have indicated the genome positions of the annotated coding sequences, their G+C contents, and the newly assigned name for each ORF, and we have also indicated the published genes that overlapped each newly annotated ORF. For each ORF we have also identified the presence of a putative translation initiation codon, low compositional biases, conservation, and overlap to sequence regions of high coding potential measured by S-biases or measured by the GeneMark (6) procedure, as well as weak conservation or overlap only to short regions of high coding potential. The presence of an AUG codon did not apply when ORFs were interpreted as internal or last exons or as 3' frameshift extensions of a sequence in a different frame. Evidence from S-profiles was distinguished as extending through the full length of the ORF or as partially covering the length of the ORF. We included 33 ORFs in Tables 2 and 3 that were identified by SSPA and/or BLASTP similarity and had been retained after scrutiny through S-profile analysis. ORFs that were most reliably predicted as coding sequences (by the extension and strength of the S-profile signal and/or by strong conservation) are indicated. ORFs supported by conservation (through SSPA and/or BLASTP analysis) and by S-profiles are also indicated.
|
View this table: [in a new window] |
TABLE 2. Characteristics of newly annotated ORFs in MCMV
|
|
View this table: [in a new window] |
TABLE 3. Characteristics of newly annotated ORFs in RCMV
|
S-profiles of ORFs identified by SSPA similarity. All annotated ORFs identified by SSPA similarity retained in Tables 2 and 3 are marked. Among the potential coding regions identified by SSPA similarity analysis, 2 newly annotated ORFs from MCMV (M31b and M73.5e2) and 10 newly annotated ORFs from RCMV (R23a, r25.3b, r38.5, r48.2, R73.5e2, R98a, R102b, r115.1, r153e3, and r153e1) were also confirmed by strong contrasts in S-profiles. S-profiles also distinguished the most likely reading frame within some of the groups of overlapping MCMV and RCMV ORFs with similarity to one another (Table 1). Within two of these groups, S-profiles favored as coding regions RCMV ORF r48.2, homologous to published ORF m48.2, and ORF r124.1, homologous to published ORF m124. In a third group of ORFs overlapping M116 and R116, S-profiles identified in MCMV an ORF borne on the direct strand (m116.1), whereas RCMV conservation and S-profiles favored an ORF on the complementary strand (r115.1). Seven ORFs from MCMV (m44.1, m44.3, m45.2, m106.1, m106.3, m120.1, and m123.1) and seven ORFs from RCMV (r44.1, r44.3, r45.2, r106.1, r108.1, r124.2, and r133e2), identified by extended SSPA similarity, were not recognized by S-profiles. Of these, ORF m120.1 from MCMV and ORF r132e2 from RCMV showed particularly strong conservation. All 14 ORFs are listed in Tables 2 and 3 as potential protein-coding sequences.
S-profiles of ORFs identified by BLASTP similarity. All annotated ORFs identified by BLASTP analysis and retained in Tables 2 and 3 are indicated. Among four ORFs in MCMV and eight ORFs in RCMV that were identified as candidate genes by BLASTP analysis (Tables S3 and S4 in the supplemental material), three ORFs from RCMV, corresponding to two exons (r153e1 and r153e3) of the lectin-like gene and to a paralog (r169.1) of published ORF r171, were also supported by S-profiles. Exons 2 and 4 of r153 could not be confirmed by their S-profiles due to low G+C contents. Another ORF from RCMV (r58.1) was identified by BLASTP for its similarity to the regulatory protein E2 of human papillomavirus. Although S-profiles did not support the expression of this ORF over its entire length, the expression of the C-terminal portion (corresponding to the conserved region) was supported by extended GeneMark coding potentials and by a weak S-profile signal. Other BLASTP-identified coding regions were not supported by S-profiles, which strongly supported the authenticity of previously annotated ORFs in the same regions. These findings suggest that a reevaluation of other published proteins matching these ORFs (mostly hypothetical proteins from various herpesviruses [see supplemental Tables S3 and S4]) would be valuable.
Nonconserved ORFs identified by S-profile analysis. Eight ORFs from MCMV (m20b, m116.1, m122.5, m122.6, m143b, m154.1, m154.2, and m163.1) and three ORFs from RCMV (r2.2, R27a, and r41.1), although not or poorly conserved, corresponded to strong contrasts in S-profiles and to extended regions of high coding potential. Among these, ORF m20b has been experimentally verified as a frameshift 3' extension of m20 (23). We also interpreted ORF m143b as a frameshift 3' extension (or possibly a second exon) of m143, consistent with the lack of an AUG codon. ORF R27a was interpreted as a frameshift 5' extension of R27 and terminated at the corresponding approximate position (see the supplemental material). ORF r41.1 was similar to m41.1 mostly in a region coincident with a corresponding region of conservation with the overlapping published genes r41 and m41. However, strong contrasts in S-profiles and the presence of a conserved initiation codon (AUG) strongly suggest that this ORF (and its MCMV homolog, m41.1) is expressed.
S-profiles and overlapping ORFs. S-profiles yielded useful verification of the position of most previously annotated ORFs of high (>50%) G+C content (see above). However, among these ORFs we identified 99 sequences that were only partially matched by S-profiles. The S-profile evidence for these sequences was classified as "partial" in the "evidence" column of Tables S7 and S8, where it was also diagrammatically represented (e.g., for ORF m25.2 "++" indicates that over approximately the first two thirds of the annotated sequence, S-profiles conform to the expression of this ORF but not over the last third). Many partial S-profile inconsistencies observed in previously annotated genes coincided with the overlap of newly annotated sequences. In MCMV, 35 previously annotated genes overlapped 58 newly identified ORFs (Table S7), and in RCMV, 24 previously annotated genes overlapped 35 newly identified ORFs (Table S8). Irregular S-profiles were observed in these regions of overlap. In 36 of these situations, the identification of a new ORF fully explained the irregularity (supplemental Tables S7 and S8). Irregular regions could be partly explained in 18 other cases.
Alternative start of translation of previously annotated ORFs. The use of an alternative translation start site was suspected when consistent S-profiles failed to coincide with the most 5'-end-proximal AUG in annotated genes. Alternative initiation sites have already been characterized for some genes, such as MCMV m131, a short first exon of the mck gene, where the fourth AUG codon in the full-length ORF is where translation starts (27).
Using S-profile analysis, we identified 25 ORF candidates in the MCMV genome that may employ alternative translation start sites located upstream of the previously annotated site (Table S7). A different initiation codon downstream of the previous annotation was predicted in 19 cases (m9, M25, m25.1, M31, M34, M43, M51, M53, M55, M69, M71, M72, M73, M77, M102, m119.1, m131, and m139) (Table S7). An upstream start site was suggested by S-profiles for ORF m16, although no AUG codon was found in this region.
In RCMV we found 11 ORF candidates for alternative start sites (Table S8), 9 of these apparently starting downstream and 2 (r4 and r70.1) apparently starting upstream of previously suggested start sites. Six of the nine ORFs for which S-profiles suggested a downstream start of translation (R31, r41, r74, R91, R115, and r171.1) also encoded a putative initiation codon (AUG) in corresponding positions. For three ORFs (R77, R122e5, and r166) in which an alternative start site could not be predicted, overlap to other coding sequences was found to explain the observed S-profiles. In the case of R122e5, S-profiles were also consistent with an alternative exon structure. In the cases of r4 and r70.1, S-profiles strongly confirmed evidence from sequence conservation that the coding regions of these genes should be extended 5' of the original annotation (see also the section on SSPA similarity analysis and the supplemental material).
Other contributing evidence from S-profiles. S-profile inconsistencies were found in published ORFs of high G+C content from MCMV (M24, m25.2, M46, m48.2, M50, M69, M71, M87, M93, M112e1, m129, m131, m144, m159, m163, m165, and m170) and from RCMV (R43, r70.4, R77, r133, and r171) that could not be explained by overlapping sequences or alternative translation initiation. Most striking examples of these arrangements were found in genes M69, M87, and R77. Although the nature of these anomalous regions was unclear, in specific cases these might correspond to proteins of peculiar amino acid compositions or to the presence of introns.
Annotated genes not evidenced by S-profiles. Fourteen previously annotated genes from MCMV and 42 from RCMV could not be confirmed by S-profiles due to low G+C content (indicated in Tables S7 and S8 as not applicable). Virtually all of these genes belonged to the regions of low G+C content of the corresponding genomes. The only exception was gene m74, which had uncharacteristically low G+C content despite its location in the high-G+C region of MCMV. Among annotated genes of high G+C content, 39 genes from MCMV and 28 genes from RCMV did not show the expected S-profile contrasts even though their G+C contents were often >60%. In the case of m19, m48.1, m108, M116, m119.5, and m134 from MCMV and r2.1, r4.1, r25.2, r95.1, and r167 from RCMV (indicated as "contradicted" in Tables S7 and S8), S-profiles largely contradicted their expression, providing evidence for expression of overlapped ORFs in different frames. Fifty-six other annotated genes (classified by "no evidence" in Tables S7 and S8) had high G+C contents in all three codon positions. The atypical S-profiles underlying these genes may be a consequence of corresponding gene products of atypical amino acid composition. It must be noted that most of these genes were not conserved between MCMV and RCMV and that their expression and functionality have not been characterized as yet in any direct investigation.
|
|
|---|
We have avoided conventional criteria of minimum ORF length (
100 codons) and maximum ORF overlap (<60%) to prevent the a priori exclusion of a class of genes that has previously been found only by direct experimental investigation. We also have not required an AUG codon to be present in potential coding regions. This allowed us to uncover small and overlapping ORFs, mRNA splicing, use of alternative translation initiation sequences in the coding complement of herpesvirus genomes, and several frameshifts within coding sequences. In fact, two of the frameshift extensions identified by our analysis in the MCMV genome, m20b and M31b, have been experimentally verified as correct 3'-terminal sequences of genes m20 and M31 (23). Alternative translation initiation signals are apparently used in a bona fide complete gene (R71, renamed r70.1) and in ORF r4, whose conservation and S-profile signals clearly extend 5' of their first AUG codon. Other examples of herpesvirus coding regions not initiated by an AUG codon have been reported (46). A coding sequence lacking an initiation codon may also relate to a potential multiexonic structure of the corresponding genes or appear as a consequence of sequencing errors.
Our analyses revealed intriguing anomalies and potential in the G+C distribution (S-profiles) within annotated genes. Within annotated genes showing N-terminal anomalies in S-profiles, AUG codons were often located near the ends of these regions. This suggested the possibility of an alternative start of translation. An interesting example is M25 from MCMV, encoding tegument protein pM25. This protein presents extensive low-complexity regions 5' of several possible alternative starts of translation identified by our analysis. Intriguingly, in viral preparations, pM25 is found in forms of different molecular masses, identified as a true late 130-kDa peptide (included in the tegument) and two early 105-kDa and 95-kDa peptides (47) and later also as a 200-kDa, 52-kDa, or 48-kDa peptide (23). Peptides translated from the AUG codons corresponding to the region of conservation between MCMV and RCMV and to consistent S-profiles have predicted molecular masses of 57.2 kDa and 45.2 kDa. The sizes of these peptides are consistent with the smaller peptides isolated from viral preparations before replication (the annotated gene has a predicted molecular mass of 103 kDa). We suggest that some peptides from M25 may result from alternative transcription and translation start sites rather than from posttranslational proteolysis.
Interesting anomalies in G+C distribution were evident also in gene M55 (glycoprotein B) from MCMV. This gene shows striking differences in S-profiles between its 5'-terminal, central, and 3'-terminal parts (Fig. 6). The functional form of this protein is generated by cleavage in the central part of the protein (38, 44). The 3' part of the gene, corresponding to the region of highest S-profile contrasts, is preceded by an AUG codon and two putative TATA box sequences (see the supplemental material). This suggests that the C-terminal part of glycoprotein B may also be alternatively translated from a shortened transcript.
Many anomalies in S-profiles involving the central or C-terminal parts of annotated sequences cannot be explained by alternative start codons. In many cases these coincide with parts of the protein that are not conserved and often include low-complexity sequences, as, for example, in the pairs of homologs M34/R34, M56/R56, M69/R69, M83/R83, and M105/R105. The hydrophilic amino acid composition and lack of sequence conservation of these regions suggests that they may function as flexible linkers between separate functional domains of a protein or that they may correspond to loops or, for terminal elements, to nonfunctional tails. In the case of MCMV m45.1, the entire sequence has an anomalous composition (see the supplemental material). It is possible that m45.1 evolved from a seemingly nonfunctional N-terminal sequence of M45, still present in the homologous sequence R45 from RCMV.
The herpesvirus capsid limits the size of the genome that can be packaged. From this perspective, it seems unlikely that nonfunctional regions of DNA can be retained in a genome where genes tend to be densely packed. While noncoding regions may be involved as control elements in transcription or DNA replication, we speculate that the presence of regions of weak selection in herpesvirus proteins may allow these viruses to encode overlapped genes to a greater extent than presently described. Frame analysis of G+C content suggests that the MCMV and RCMV genomes contain ORFs of high coding potential that overlap.
Annotation is a process of prediction and confirmation by methods that provide a working set of data for additional empirical experimental studies. We believe that there is a need to relax the criteria used in conventional annotation methods in the study of eukaryotic viruses, where overlapping genes and posttranscriptional regulation, such as mRNA splicing and the use of nonconventional translation signals, are relevant biological processes. The increased ease of current experimental techniques in verifying the expression of coding sequences makes striving for coverage, perhaps with a reduction in specificity, a reasonable approach to gene prediction. Our application of different methods of sequence analysis identified a plethora of candidate genes that are excluded by more conventional criteria of annotation, providing a more comprehensive picture of the coding potential of these genomes for experimental verification.
Supplemental material for this article may be found at http://jvi.asm.org/. ![]()
Present address: Clinical Research Unit #136, HS Hvidovre Hospital, 2650 Hvidovre, Denmark. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»