This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrowReprints and Permissions
Right arrow Copyright Information
Right arrow Books from ASM Press
Right arrow MicrobeWorld
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Rigoutsos, I.
Right arrow Articles by Shenk, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Rigoutsos, I.
Right arrow Articles by Shenk, T.

 Previous Article  |  Next Article 

Journal of Virology, April 2003, p. 4326-4344, Vol. 77, No. 7
0022-538X/03/$08.00+0     DOI: 10.1128/JVI.77.7.4326-4344.2003
Copyright © 2003, American Society for Microbiology. All Rights Reserved.

In Silico Pattern-Based Analysis of the Human Cytomegalovirus Genome

Isidore Rigoutsos,1* Jiri Novotny,2 Tien Huynh,1 Stephen T. Chin-Bow,1 Laxmi Parida,1 Daniel Platt,1 David Coleman,3 and Thomas Shenk3

Bioinformatics and Pattern Discovery Group, IBM TJ Watson Research Center, Yorktown Heights, New York 10598,1 Victor Chang Cardiac Research Institute, Darlinghurst, New South Wales 2010, Australia,2 Department of Molecular Biology, Princeton University, Princeton, New Jersey 085443

Received 10 July 2002/ Accepted 23 December 2002


arrow
ABSTRACT
 
More than 200 open reading frames (ORFs) from the human cytomegalovirus genome have been reported as potentially coding for proteins. We have used two pattern-based in silico approaches to analyze this set of putative viral genes. With the help of an objective annotation method that is based on the Bio-Dictionary, a comprehensive collection of amino acid patterns that describes the currently known natural sequence space of proteins, we have reannotated all of the previously reported putative genes of the human cytomegalovirus. Also, with the help of MUSCA, a pattern-based multiple sequence alignment algorithm, we have reexamined the original human cytomegalovirus gene family definitions. Our analysis of the genome shows that many of the coded proteins comprise amino acid combinations that are unique to either the human cytomegalovirus or the larger group of herpesviruses. We have confirmed that a surprisingly large portion of the analyzed ORFs encode membrane proteins, and we have discovered a significant number of previously uncharacterized proteins that are predicted to be G-protein-coupled receptor homologues. The analysis also indicates that many of the encoded proteins undergo posttranslational modifications such as hydroxylation, phosphorylation, and glycosylation. ORFs encoding proteins with similar functional behavior appear in neighboring regions of the human cytomegalovirus genome. All of the results of the present study can be found and interactively explored online (http://cbcsrv.watson.ibm.com/virus/).


arrow
INTRODUCTION
 
The advent of DNA sequencing technology is generating vast amounts of sequences that are deposited in public databases. The rate at which genomes can be sequenced has now outpaced the rate at which a sequence's function can be determined through wet-lab experimentation, thus leading to increasing demand for automated (in silico) approaches to the elucidation of protein function. As more and more protein sequences and complete genomes become available in the public domain, in silico protein annotation is emerging as an inexpensive and effective approach for dealing with the flood of genomic data.

Of the numerous approaches that have been proposed over the years, the determination of regions of similarity between a novel protein of unknown function and one or more database proteins with known annotation has been the method of choice. Such a determination allows one to predict the common region in the protein of unknown function as exhibiting the functional characteristics of the respective region from the annotated database protein through what is frequently called a "guilty-by-association" approach. These methods are also known as homology-based methods, and they have led to significant advances in protein annotation (2, 22, 36).

During the latter half of the 1990s, pattern-based approaches have been steadily gaining ground as the methods of choice for solving various computational problems in molecular biology (28). One such algorithm is MUSCA, a multiple sequence alignment algorithm, which we described in an earlier study (23). MUSCA begins by using the Teiresias pattern discovery algorithm (25, 26) to identify patterns that are shared by k or more input sequences. During its second phase, MUSCA exploits the location of the discovered patterns to anchor and induce alignments of increasingly larger input fragments. Because of the manner in which it operates, MUSCA is uniquely suitable to handle inputs in which one or more domains are shared among the sequences to process. In a parallel effort, we also described a pattern-based approach to the problem of protein annotation (30). The approach is centered on the Bio-Dictionary, an exhaustive collection of amino acid patterns (heretofore referred to as seqlets) that completely covers the natural sequence space of proteins defined by the currently available sequences. The Bio-Dictionary is computed by carrying out pattern discovery with the Teiresias algorithm (25, 26) on very large databases of biological sequences such as SwissProt/TrEMBL (4). The seqlets contained in the Bio-Dictionary can capture functional and structural signals that have been reused during evolution both within and across families of related proteins (27, 28, 29). This new method uses the seqlets contained in the Bio-Dictionary to exhaustively annotate a query protein by using the information that is available in a well-maintained database, such as SwissProt/TrEMBL, and employs a weighted, position-specific scoring scheme that is not affected by the overrepresentation of well-conserved proteins and protein fragments, which exist in the public databases. As we showed elsewhere (30) and for several published genomes, this Bio-Dictionary-based approach matched the quality and sensitivity of the annotations that were obtained with semiautomated approaches while requiring only a very small investment of computational resources.

In an earlier study (21), we examined the annotation of human cytomegalovirus (HCMV; also known as human herpesvirus 5 [HHV-5]) by using ProCeryon (a program for fold recognition and protein structure analysis; ProCeryon Biosciences), a structure prediction program that is based on threading (16, 34). Each of the HHV-5 ORFs was threaded with the ProCeryon algorithm, and a structural and functional hypothesis was generated. As anticipated and due to the large number of membrane proteins coded for by this genome, the threading approach provided hypotheses for a little less than 50% of the coding regions. The desire to further push the annotation envelope for HHV-5 led us to the sequence-based work that we discuss here.

HHV-5 is a member of the betaherpesviruses, a subgroup of herpesviruses with common growth characteristics (1). Considered the prototypical betaherpesvirus, HHV-5 spreads to the majority of the population at an early age, causing asymptomatic infections in healthy individuals. However, it can produce life-threatening disease in immunosuppressed individuals and as a result of congenital infections (24).

The HHV-5 virion contains a linear, double-stranded DNA genome (~230 kbp) encased in an icosahedral capsid (8). The capsid is surrounded by a protein matrix and a lipid envelope with integral glycoproteins. Two major unique regions, denoted as long (UL) and short (US), can be identified in the viral genome and are bracketed by repeated domains. The AD169 strain of HHV-5 was sequenced in 1990 (8), and 208 ORFs were predicted as coding for proteins >=100 amino acids in length. More recently, an insertion that modifies ORFs 42 and 43 was identified in the AD169 strain (10, 20), and analysis of a cDNA sequence has revealed that UL101 does not exist, whereas UL102 is modified (35). Finally, several additional ORFs were found in the Towne and Toledo strains of HHV-5 (7). The repeated ORFs include J1L/J1I/J1S, which are partially related; TRL1 through TRL13, which at a second location are labeled IRL1 through IRL13; TRL14, which shares a N-terminal region with its IRL14 counterpart; and IRS1/TRS1, which are half repeated and half unique. The unique ORFs are UL1 to UL154 and US1 to US36, with some ORFs receiving fractional designations such as UL21.5, UL48.5, and UL80.5.

We describe below the application of the Bio-Dictionary to the in silico annotation of the HHV-5 genome. We have generated and processed a "composite" genome that is the union of the originally reported genes from AD169, the three modified ORFs in AD169, as well as the genes from the Towne and Toledo strains. The method we used to annotate the composite genome is described in detail elsewhere (30), and an implementation of it is available online (http://cbcsrv.watson.ibm.com/Tpa.html). The functional hypotheses, as well as numerous features that we identified in these proteins are also available online (http://cbcsrv.watson.ibm.com/virus/); at this website, one can find summaries of each protein's functional annotation and information about the nature and location of posttranslational modifications, active sites, identifiable domains (e.g., transmembrane), and alignments with other proteins from the public databases, as well as detailed information on the similarity of each annotated amino acid sequence to archaeal, bacterial, eukaryotic, and viral sequences. For completeness, we have also included the results from our earlier, threading-based annotation of HHV-5 (21). Finally, we used MUSCA to reanalyze the originally proposed HHV-5 ORF families and we present our conclusions here.


arrow
MATERIALS AND METHODS
 
For the analysis outlined in the present study, we used two computational approaches, both of which have previously been described. We briefly outline each of these two approaches below.

Bio-Dictionary-based automated protein annotation. The computational tool that we used for the annotation component of this work relies on the Bio-Dictionary: the latter was originally created by using the Teiresias pattern discovery algorithm (25, 26) to process the GenPept database as a whole (29); this computation has since been repeated at regular intervals on the increasingly larger installments of the SwissProt/TrEMBL database (4). The Bio-Dictionary is a very large collection of sequence patterns, referred to as seqlets. In other words, seqlets are strings of literals interspersed with zero or more wild cards: a unique amino acid, or a small set of permitted amino acids, can occupy the locations of each literal; the positions corresponding to the wild cards indicate locations that can be occupied by any amino acid. For example, the seqlet [KR].K[ILMV][AG]L describes all hexapeptides that begin with either a lysine or an arginine; followed by any one of the 20 amino acids; followed by a lysine; followed by a isoleucine, leucine, methionine, or valine; followed by alanine or glycine; and finally ending with a leucine. The Bio-Dictionary seqlets capture functional and structural signals that extend beyond protein family boundaries, which is not an unexpected result considering the manner in which the collection is produced.

An additional property of this collection is that it nearly completely covers the currently known sequence space of natural proteins and can thus be used in lieu of the original processed sequence database to solve a gamut of problems, including gene finding (33) and protein annotation (30). For the purposes of protein annotation, each seqlet is augmented with additional information pertaining to functional, structural, or other properties of the seqlet's known instances in proteins that have been studied computationally and experimentally.

To annotate a previously uncharacterized protein, instances of all of the seqlets in the Bio-Dictionary are sought in the sequence under consideration: for seqlets that are present in the sequence, their respective meanings are used to label the part of the sequence corresponding to the seqlet's instance in a straightforward "guilty-by-association" approach. The meanings of overlapping seqlets are subsequently accumulated and coalesced into hypotheses about the function of the processed protein, the presence of various domains and active sites, the nature and location of posttranslational modifications, etc. Details on the computational aspects of the Bio-Dictionary-based protein annotation are given elsewhere by Rigoutsos et al. (30).

MUSCA (pattern-based multiple sequence alignment). MUSCA is a two-phase algorithm for computing the multiple sequence alignment of a set of N sequences (23). During the first phase, MUSCA uses Teiresias to discover patterns that are common among K or more of the input's N sequences. These patterns are used in the second phase to generate and report the multiple sequence alignment. In particular, the motifs are first mapped to vertices of a directed graph. If the two motifs pi and pj do not occur simultaneously in any sequence, then there is no edge connecting the corresponding vertices of the graph. The vertices corresponding to pi and pj will be connected by an edge with direction from pi to pj if pi occurs before pj in all of the sequences where they both appear. The labels of the edges depend on three things: whether pi and pj are pairwise incompatible, whether they have overlapping instances, or whether they are pairwise compatible but do not overlap. Vertices that are joined by incompatible edges or participate in inconsistent cycles form the basic nonfeasible sets. After the vertices of the reduced graph were labeled with the help of a simple cost function, we used a greedy algorithm to obtain a solution to a weighted set-cover problem that essentially identifies the minimum number of motifs/vertices to be removed. The resulting graph was used to determine the blocks that involve overlapping feasible motifs. We obtained the final alignment by properly aligning the blocks and padding up the existing gaps.

The alignments that MUSCA generates are independent of the order in which the input sequences are given. The algorithm is uniquely suitable to process inputs where the various sequences share domains that are present in some of the sequences only or share inputs that comprise two or more subsets with high conservation within each subset but low conservation across subsets.


arrow
RESULTS
 
In this section, we present a summary of the results that we obtained from processing the composite HHV-5 genome. Additional information and details for each individual sequence can be found at http://cbcsrv.watson.ibm.com/virus/. From this site, the user has the option of using either a graphical or a textual interface for accessing the annotation that is available for each annotated ORF. In each case, we generated a file with statements that succinctly describe our findings. Corroborating local and/or global alignments are given when relevant and/or available; also given for each annotated ORF are plots that show in graphical form the nature and location of phylogenetic-domain-specific fragments, discovered local or global similarities, domains of interest, sites of interest, etc. Only a small subset of the information that is available through the website is included and discussed below.

Benchmarking our approach. In a recent study (30), we discussed and tested the Bio-Dictionary-based protein annotation method in detail and with the help of many diverse input sequences. We demonstrated that this method provides substantial benefits versus traditional approaches in terms of objective annotation capability, and the reader is referred to that discussion for more information. It is important to stress that our approach makes use of a weighted, position-specific scoring scheme that is not affected by the overrepresentation of well-conserved proteins and protein fragments that exist in the public databases. A given feature that has been associated with a region of the query is assigned a normalized score between 0 and 100; the resulting figure is an estimate of the percentage of the total number of the region's distinct instances that have also been annotated as sharing the feature. Empirically, we determined that scores between 90 and 100 correspond to good, conservative results: for the HHV-5 annotation presented below, we only considered features whose confidence estimates fell in this interval of values.

General findings. Table 1 summarizes our findings for each of the annotated ORFs of the composite genome. The ORFs are listed in the order in which they appear on the composite HHV-5 genome. In the first three columns, and for each ORF, we list the ORF's name, its accession number, and a functional hypothesis. The fourth column lists features of each predicted protein such as binding sites, posttranslationally modified sites, etc.


View this table:
[in this window]
[in a new window]
 
TABLE 1. List of brief annotations for the composite genome of HCMVa

Several general observations can be made readily. As pointed out by Chee et al. (8), the importance of glycoproteins as surface antigens has generated interest in the identification and characterization of the members of this group. Chee et al. (8) predicted the presence of one or more glycosylation sites in proteins encoded by a total of 53 ORFs. Mokarski and Courcelle (19) revised this number to ca. 60. As seen in Table 1, our analysis predicts the presence of glycosylation sites in 125 of the annotated ORFs. Four of these ORFs—TRL12, IRL12, UL32, and UL132—contain O-linked glycosylation sites, and the remainder contain N-linked modifications. Of the 125 ORFs predicted to encode glycosylated proteins, the following have 10 or more potential glycosylation sites: TRL12/IRL12 (7 O-type and 15 N-type), UL1 (10 N-type, possibly 13), UL7 (11 N-type), UL18 (13 N-type), UL20 (12 N-type), UL37 (18 N-type), UL55 (19 N-type), UL74 (18 N-type), UL116 (14 N-type), and UL120 (10 N-type). UL32 has been proposed to contain a single O-type glycosylation site (19); however, our analysis indicates the presence of two such sites, at amino acid locations 921 and 952, respectively.

Our analysis predicts that as many as 144 proteins, approximately half of them glycoproteins, are probably integral membrane proteins. Also, for at least 49 of the analyzed proteins we find evidence for the presence of a signal peptide. These proteins are TRL2/IRL2, TRL10/IRL10, TRL11/IRL11, UL4, UL11, UL12, UL13, UL14, UL16, UL18, UL20, UL21, UL21.5, UL31, UL37, UL40, UL41, UL50, UL55, UL56, UL73, UL75, UL91, UL111.5, UL115, UL117, UL118, UL119, UL121, UL124, UL130, UL132, UL132/Toledo, UL139/Toledo, UL144/Toledo, UL147/Toledo, UL148/Toledo, UL149/Toledo, UL152/Towne, US3, US6, US7, US8, US9, US10, US11, US25, US30, and US31. These proteins would be expected to represent a collection of plasma membrane proteins, proteins that reside within intracellular membranous compartments, and secreted proteins.

We found evidence for one or more phosphorylation sites for the proteins encoded by the following nine ORFs: TRL5/IRL5, UL4, UL34, UL59, UL67, UL83, UL109, IRS1, and US36. Also, at least 17 of the coded proteins contain one or more hydroxylation sites: J1L, TRL9/IRL9, UL15, UL31, UL44, UL52, UL57, UL61, UL62, UL104, UL141/Toledo, UL150/Toledo, IRS1, US8, US32, TRS1, and J1S. Of these, UL61 has the largest number (i.e., nine) of such sites.

Thirty-one ORFs seem to be virus specific in the sense that they contain no notable, distinguishing features and no identifiable similarities to anything except for other viral sequences. In particular, at least 15 ORFs (J1L, TRL8/IRL8, UL60, UL62, UL64, UL66, UL81, UL90, UL106, UL110, UL137/Toledo, UL145/Toledo, UL148/Toledo, UL149/Toledo, US5, US33, and J1S) appear to be specific to the HCMV, at least 7 ORFs (UL25, UL29, UL36, UL47, UL88, UL91, and UL96) are specific to the herpesviruses, and at least 3 ORFs (UL53, UL85, and UL115) are virus specific, i.e., homologies can be found outside the herpesvirus family. The presence of many ORFs that are specific to HHV-5 but not other organisms is consistent with the view that HHV-5 is an ancient virus whose genome has evolved separately from its host for an extended period.

G-protein-coupled receptor homologues. Our analysis revealed, among the annotated ORFs, the presence of a group of 15 sequences that are likely to code for GPCR-like proteins. In particular, six ORFs, namely, UL33, UL78, US12, US14, US27, and US28, show strong homologies to members from well-understood GPCR families.

For an additional nine ORFs, our analysis clearly shows the presence of exactly seven transmembrane regions, thus implying a membership in the GPCR family: these nine ORFs are UL100, US13, US15, US16, US17, US18, US19, US20, and US21. It is notable that a substantial part of the US region contains ORFs that appear to be coding for proteins that have a GPCR-like sequence composition and GPCR-like characteristics. An interesting observation can be made with respect to the nine ORFs that our analysis places in this category: although there is support that each of these ORFs contains seven transmembrane regions, none of these ORFs show any notable global sequence similarity to members of known GPCR families. However, each of these ORFs is composed of transmembrane helices that appear to have been "borrowed" from distinct GPCR families and placed in an order that has not been previously encountered; in other words, the transmembrane helices of each of these nine ORFs have not appeared as a group in a single GPCR (sub)family before. We computationally verified the situation with the ORFs encoding newly predicted GPCR-like proteins (UL100, US13, US15, US16, US17, US18, US19, US20, and US21) as follows. By using each of these sequences as a query and employing standard similarity searching tools, we compared it with other sequences and in particular with those contained in the GPCRDB database (15); only weak conservation spanning part of the ORF sequences could be identified. Subsequently, we extracted the amino acid subsequences that our analysis indicated as corresponding to the seven transmembrane helices of the ORF under consideration and used these shorter regions as queries for a search of the GPCRDB database: in each case, we discovered notable similarities of these queries/helices to annotated transmembrane helices of known GPCRs, but each one of these "hits" came from a different functional subdivision of the GPCR superfamily, thus supporting our statement above.

The case of one ORF in particular, UL78, is discussed in detail in (30). Therein, we show local alignments between the transmembrane helices of UL78 to transmembrane helices of well-characterized transmembrane proteins. A PSI-BLAST (3) search of the public databases based on UL78 could only identify an ~70-amino-acid region of UL78 as weakly similar to the rhodopsin family but could not determine the local similarities mentioned above, a direct consequence of their short length. In our analysis (see data available online at the companion website), we also give an alignment of UL78 with P2Y7_HUMAN, a human purinoreceptor.

Revisiting the previously defined HHV-5 gene families. In the original study of the sequence of HHV-5, Chee et al. (8) defined and described several families that comprised subsets of the reported putative genes. This categorization into the families was based on the use of heuristics-based similarity searching approaches. As we have described in previous work (29, 30), this approach to family determination can lead to incorrect conclusions in a manner analogous to incorrectly annotating proteins by assuming that the "transitive closure" property applies: the fact that sequence A is similar to sequence B and that sequence B is similar to sequence C should not be used to imply that sequence A is similar to sequence C. An annotator will frequently exploit either the first or the best "hit" in the output of a database search carried out by using the FASTA (22), BLAST (2), or Smith-Waterman (36) tool: in the presence of small but well-conserved regions or of domains that are shared by distinct proteins (11) this choice is sometimes not optimal. In fact, the multidomain organization of proteins can lead to incorrectly annotated database entries and, by extension, to incorrect definitions of protein families.

Chee et al. (8) defined eight gene families, namely, UL25, UL82, RL11, US1, US2, US6, US12, and US22. We analyzed these HCMV families with the help of MUSCA (23), a pattern discovery-based multiple sequence alignment algorithm, and reevaluated their definitions. The MUSCA algorithm is described in some detail above in Materials and Methods. The use of patterns in inducing multiple sequence alignments is particularly appropriate in the presence of shared domains. In the alignments that are described below, amino acids that participated in the patterns that induced the respective alignment are capitalized and are also colored based on their hydropathy. For some of these cases, the rather involved relationship of the considered sequences necessitated the manual selection of the regions to be aligned.

US1 family. The first family we examined was the US1 family comprising ORFs US1, US31, and US32. The multiple sequence alignment for these three sequences is shown in Fig. 1A and supports the original definition of this family.



View larger version (59K):
[in this window]
[in a new window]
 
FIG. 1. Pattern-based alignments of the US1 family (A), the US2 family (B), and of UL25 and UL35 (C). In all cases, only the amino acids which participated in the patterns that induced the alignment are shown in color; the different colors represent different hydropathies.

RL11 family. We next analyzed the RL11 family, which included 14 ORFs: IRL11, TRL11, IRL12, TRL12, TRL14, UL1, and UL4 through UL11. Each of these 14 sequences also was used as a query in a FASTA search against the remaining 13 members of the original family. In all searches the determined local similarities scored below the moderately conservative threshold value of 200, essentially indicating the presence of only local, weak similarities among these 14 sequences and questioning the original family definition.

US2 family. The US2 family, as originally defined, comprised the proteins US2 and US3 whose alignment is shown in Fig. 1B and supports the original family definition.

UL25 family and UL82 family. The original definition of the UL25 family comprised the ORFs UL25 and UL35. We again used MUSCA to align the members of this family (Fig. 1C). As is evident, the similarity of these two sequences is rather weak, thus putting the original definition of the UL25 family into question. A similar situation exists in the case of the UL82 family that consists of UL82 and UL83: the remaining sequence similarity is also weak (the alignment is not shown).

US6 family. The originally defined US6 family consists of the ORFs US6 through US11. Manual analysis of this sequence group indicates that US7, US8, US9, and US11 form a cluster and a multiple sequence alignment is shown in Fig. 2A. Fairly long regions are reasonably well conserved, thus supporting the hypothesis that these four sequences form a family. For the remaining two sequences, US6 and US10, small regions appear to be shared among the pairs US6-US8 and US8-US10, but no unifying similarities are evident; see Fig. 2B and C. To recapitulate, our analysis suggests that the US6 family definition includes only US7, US8, US9, and US11.



View larger version (69K):
[in this window]
[in a new window]
 
FIG. 2. Pattern-based alignments for four of the six members in the original US6 family (A); US6 with US8 from the original US6 family (B); US10 with US8 from the original US6 family (C); and US12, US13, US14, and US20 from the original US12 family (D). In all cases, only the amino acids which participated in the patterns that induced the alignment are shown in color based on their hydropathy.

US12 family. The original US12 family definition included the 10 ORFs US12 through US21. As in the case of the US6 family, manual analysis indicates that US12, US13, US14, and US20 form a cluster, and a multiple sequence alignment for them is shown in Fig. 2D. Similarly, US15 and US16 form a separate cluster, whereas US19 and US21 ought to be treated as singletons. The remaining two sequences, US17 and US18, form a weak cluster with US20 (cf. Fig. 2D), and an alignment of these three sequences, shown in Fig. 3A, indicates a rather low degree of sequence conservation. It is interesting that with the exception of the first conserved block, the regions shared among the group US17/US18/US20 are not the same as those shared by the group US12/US13/US14/US20. In summary, the original family definition ought to be revised and split into two groups: US12/US13/US14/US20 and US15/US16. MUSCA provides provides no support for inclusion of the remaining sequences: US17, US18, US19, and US21 should be removed from the family definition altogether.



View larger version (73K):
[in this window]
[in a new window]
 
FIG. 3. Pattern-based alignments for US17, US18, and US20 from the original US12 family (A) and of US22, US23, US24, and US26 (B) from the original US22 family.

US22 family. We, finally, examined the US22 family whose original definition comprised the ORFs UL23, UL24, UL28, UL29, UL36, UL43, US22, US23, US24, US26, IRS1, and TRS1. As above, we prescreened these sequences by using manual analysis and determined the presence of two sequence clusters: IRS1/TRS1/US22/US23 and US22/US23/US24/US26. US22, US23, US24, and US26 form a cluster that exhibits a much better degree of local conservation: indeed, as can be seen from the multiple sequence alignment for this group in Fig. 3B the conserved region spans their N-terminal regions but does not extend to their C termini. On the other hand, IRS1, TRS1, US22, and US23 form a cluster and a multiple sequence alignment is shown in Fig. 4: the degree of observed local conservation is moderate. It is difficult to say what the best recommendation is for this family; a conservative approach would reduce the members of this family to only the group US22/US23/US24/US26.



View larger version (75K):
[in this window]
[in a new window]
 
FIG. 4. Pattern-based alignment of IRS1, TRS1, US22, and US23 of the original US22 family. Only the amino acids which participated in the patterns that induced the alignment are shown in color based on their hydropathy.

Table 2 summarizes the results of the analysis described above. For each of the originally defined families, we indicate how our results revise the family memberships of the respective ORFs.


View this table:
[in this window]
[in a new window]
 
TABLE 2. Revised groupings of the originally defined HCMV gene families


arrow
DISCUSSION
 
Using pattern-based approaches, we analyzed the HHV-5 genome (strains AD169, Towne, and Toledo). A brief summary of the annotations that we generated is presented in Table 1, and the complete set of our findings can be explored at http://cbcsrv.watson.ibm.com/virus/. Several enhancements to the annotation of the HHV-5 genome have been made possible by this approach. In particular, we have revised the number of ORFs that code for transmembrane proteins, we have revised the number of ORFs that code for proteins containing glycosylation sites, we have found evidence for phosphorylation sites in at least 9 ORFs and for hydroxylation sites in at least 17 ORFs, we have found support for the existence of 15 ORFs that code for proteins with GPCR-like sequence composition and characteristics and, finally, 31 of the ORFs appear to be virus specific, adding support to the view that HHV-5 is an ancient virus that evolved separately from its host for an extended period.

One of the more intriguing predictions from this annotation exercise is the expectation that HHV-5 encodes a larger number of GPCRs than previously thought. UL33, UL78, US27, and US28 have been identified previously as encoding GPCRs (for a review, see reference 32). Our analysis strongly predicts that US12 and US14 encode GPCRs, and this is corroborated by good quality alignments with known GPCRs. An additional nine sequences, namely, the ORFs UL100, US13, and US15 to US21, each contain seven recognizable transmembrane domains; similarity searches indicate that the sequence fragments corresponding to each of the seven transmembrane regions of these nine ORFs match annotated transmembrane helices of known GPCRs but of distinct functional behavior. In other words, the transmembrane helices in each of these nine ORFs have not appeared as a group in a single GPCR (sub)family before. If these proteins prove to have GPCR activity, analysis of their physiological effects might provide important new insights to the function of this class of proteins since they are only distantly related in terms of their seqlet content to known GPCRs.

The possible hydroxylation of a set of HHV-5 proteins deserves further comment. Hydroxylation of proline residues influences protein assembly into a triple-stranded helix in the case of collagen (17) and directs the proteasome-dependent degradation of other proteins (6). It is possible, then, that the structure or half-life of one or more HHV-5 proteins is influenced by hydroxylation. Hydroxylation might influence the structure of UL61, which contains a domain that, like collagen, is predicted to form a triple helix and includes two possible hydroxylation sites. Further, since hydroxylation of hypoxia-inducible transcription factors controls the response of cells to changes in oxygen availability (6), one might speculate that hydroxylation of a viral protein might allow HHV-5 to sense and adapt to the oxygen environment of its host cell. For example, IRS1 and TRS1, two viral proteins with putative hydroxylation motifs, exhibit transcriptional activity in transfected cells (31, 38). Conceivably, their activity is modulated by hydroxylation.

Our analysis suggests numerous functional hypotheses based on similarities of domains within HHV-5 proteins to domains within other proteins of known function. For example, TRL4 is predicted to encode a transmembrane glycoprotein with similarity to the fas antigen ligand. Very little is known about this protein, although its mRNA is expressed in large quantities with early kinetics (14). There also is a report that an epitope-tagged (C-terminal) version of the protein is localized to the nucleolus within transfected cells (5). Our annotation does not anticipate a nucleolar localization, and it is possible that the reported localization was perturbed by the epitope tag. HHV-5 infection has been reported to induce the expression of the cellular fas ligand in certain cell types (9), and the expression of fas ligand has been proposed to assist in virus escape from immune surveillance (37). Consequently, the TRL4-encoded protein might prove to antagonize the host antiviral response by serving as a viral mimic of the cellular fas ligand. This hypothesis, as well as additional hypotheses generated by our annotation, can be tested experimentally.

In addition to reannotating the complete HCMV genome, we also reevaluated the original HHV-5 gene family definitions with the help of a pattern-based multiple sequence alignment algorithm and proposed new groupings for the original family members; we also identified cases in which the detectable sequence similarities observed in our analyses were not strong enough to support family membership. Our analysis with MUSCA of protein families was limited to a reevaluation of the families proposed by Chee et al. (8). Additional families might exist that could be identified by using pattern-based approaches.

The present analysis focused on the original annotated set of HHV-5 ORFs (8). It is possible that the genome contains substantially more coding ORFs than we have analyzed here. We employed the MacVector genome analysis program (version 7.0; Oxford Molecular Group), utilizing a human codon bias, to identify all ORFs within the HHV-5 AD169 genome that encode polypeptides that are at least 50 amino acids long. We also analyzed the genome by using the TESTCODE algorithm (12). These two filters identified ca. 700 ORFs encoding polypeptides >50 amino acids in size and starting with an AUG, as well as many more when the requirement for an N-terminal AUG was dropped (E. Murphy, I. Rigoutsos, and T. Shenk, unpublished data). We are currently utilizing the BDGF algorithm (33), a Bio-Dictionary-based gene finder, to validate our expectation that the HHV-5 genome contains substantially more coding ORFs than predicted in the original annotation.


arrow
ACKNOWLEDGMENTS
 
We thank Edward Mocarski for insightful comments on the manuscript.

This work was supported in part by grants from the National Institutes of Health to T.S. (CA82396, CA85786, and CA87661).


arrow
FOOTNOTES
 
* Corresponding author. Mailing address: Bioinformatics and Pattern Discovery Group, IBM TJ Watson Research Center, PO Box 218, Yorktown Heights, NY 10598. Phone: (914) 945-1384. Fax: (914) 945-4104. E-mail: rigoutso{at}us.ibm.com. Back


arrow
REFERENCES
 
    1
  1. Alford, C. A., and W. J. Britt. 1996. Cytomegalovirus, p. 1981-2010. In B. N. Fields, D. M. Knipe, and P. M. Howley (ed.), Fields virology, 3rd ed. Lippincott-Raven Publishers, Philadelphia, Pa.
  2. 2
  3. Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 5:403-410.
  4. 3
  5. Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.[Abstract/Free Full Text]
  6. 4
  7. Bairoch, A., and R. Apweiler. 2000. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 2000. Nucleic Acids Res. 28:45-48.[Abstract/Free Full Text]
  8. 5
  9. Bergammi, G., M. Reschke, M. C. Battista, M. C. Boccuni, F. Campanini, A. Ripalti, and M. P. Landini. 1998. The major open reading frame of the ß2.7 transcript of human cytomegalovirus: in vitro expression of a protein posttranscriptionally regulated by the 5' region. J. Virol. 72:8425-8429.[Abstract/Free Full Text]
  10. 6
  11. Bruick, R. K., and S. L. McKnight. 2002. Oxygen sensing gets a second wind. Science 295:807-808.[Abstract/Free Full Text]
  12. 7
  13. Cha, T. A., E. Tom, G. W. Kemble, G. M. Duke, E. S. Mocarski, and R. R. Spaete. 1996. Human cytomegalovirus clinical isolates carry at least 19 genes not found in laboratory strains. J. Virol. 70:78-83.[Abstract]
  14. 8
  15. Chee, M., A. Bankier, S. Beck, R. Bohni, C. Brown, R. Cerny, T. Horsnell, C. Hutchinson III, T. Kouzarides, J. Martignetti, et al. 1990. Analysis of the protein-coding content of the sequence of human cytomegalovirus strain AD169. Curr. Top. Microbiol. Immunol. 154:125-169.[Medline]
  16. 9
  17. Cinatl, J., Jr., R. Blaheta, M. Bittoova, M. Scholz, S. Margraf, J.-U. Vogel, J. Cinatl, and H. W. Doerr. 2000. Decreased neutrophil adhesion to human cytomegalovirus-infected retinal pigment epithelial cells is mediated by virus-induced upregulation of Fas ligand independent of neutrophil apoptosis. J. Immunol. 165:4405-4413.[Abstract/Free Full Text]
  18. 10
  19. Dargan, D. J., F. E. Jamieson, J. MacLean, A. Dolan, C. Addison, and D. J. McGeoch. 1997. The published DNA sequence of human cytomegalovirus strain AD169 lacks 929 base pairs affecting genes UL42 and UL43. J. Virol. 71:9833-9836.[Abstract]
  20. 11
  21. Doolittle, R. 1995. The multiplicity of domains in proteins. Annu. Rev. Biochem. 64:287-314.[CrossRef][Medline]
  22. 12
  23. Fickett, J. W. 1982. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 10:5303-5318.[Abstract/Free Full Text]
  24. 13
  25. Floratos, A., I. Rigoutsos, L. Parida, G. Stolovitzky, and Y. Gao. 1999. Sequence homology detection through large-scale pattern discovery, p. 164-173. In Proceedings of the Third Annual ACM International Conference on Computational Molecular Biology (RECOMB '99). ACM, Lyon, France.
  26. 14
  27. Greenaway, P. J., and G. W. Wilkinson. 1987. Nucleotide sequence of the most abundantly transcribed early gene of human cytomegalovirus strain AD169. Virus Res. 7:17-31.[CrossRef][Medline]
  28. 15
  29. Horn, F., J. Weare, M. W. Beukers, S. Hörsch, A. Bairoch, W. Chen, Ø. Edvardsen, F. Campagne, and G. Vriend. 1998. GPCRDB: an information system for G protein-coupled receptors. Nucleic Acids Res. 26:277-281.
  30. 16
  31. Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. A new approach to protein fold recognition. Nature 358:86-89.[CrossRef][Medline]
  32. 17
  33. Kivirikko, K. I., and T. Pihlajaniemi. 1998. Collagen hydroxylases and the protein disulfide isomerase subunit of prolyl 4-hydroxylases. Adv. Enzymol. Relat. Areas Mol. Biol. 2:325-400.
  34. 18
  35. Mocarski, E. S. 1996. Cytomegaloviruses and their replication, p. 2447-2492. In B. N. Fields, D. M. Knipe, and P. M. Howley (ed.), Fields virology, 3rd ed. Lippincott-Raven Publishers, Philadelphia, Pa.
  36. 19
  37. Mocarski, E. S., and C. T. Courcelle. 2001. Cytomegaloviruses and their replication, p. 2629-2673. In D. M. Knipe, P. M. Howley, D. E. Griffin, R. A. Lamb, M. A. Martin, B. Roizman, and S. E. Straus (ed.), Fields virology, 4th ed., vol. 2. Lippincott-Raven Publishers, Philadelphia, Pa.
  38. 20
  39. Mocarski, E. S., M. N. Prichard, C. S. Tan, and J. M. Brown. 1997. Reassessing the organization of the UL42-UL43 region of the human cytomegalovirus strain AD169 genome. Virology 239:169-175.[CrossRef][Medline]
  40. 21
  41. Novotny, J., I. Rigoutsos, D. Coleman, and T. Shenk. 2001. In silico structural and functional analysis of the human cytomegalovirus (HHV5) genome. J. Mol. Biol. 310:1151-1166.[CrossRef][Medline]
  42. 22
  43. Pearson, W. R., and D. J. Lipman. 1998. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85:2444-2448.
  44. 23
  45. Parida, L., A. Floratos, and I. Rigoutsos. 1999. An approximation algorithm for alignment of multiple sequences using motif discovery. J. Comb. Optim. 3:247-275.[CrossRef]
  46. 24
  47. Pass, R. F. 2001. Cytomegalovirus, p. 2675-2705. In D. M. Knipe, P. M. Howley, D. E. Griffin, R. A. Lamb, M. A. Martin, B. Roizman, and S. E. Straus (ed.), Fields virology, 4th ed., vol. 2. Lippincott-Raven Publishers, Philadelphia, Pa.
  48. 25
  49. Rigoutsos, I., and A. Floratos. 1998. Combinatorial pattern discovery in biological sequences: the Teiresias algorithm. Bioinformatics 14:55-67.[Abstract/Free Full Text]
  50. 26
  51. Rigoutsos, I., and A. Floratos. 1998. Motif discovery without alignment or enumeration, p. 221-227. In Proceedings of the Second Annual ACM International Conference on Computational Molecular Biology (RECOMB '98). ACM, New York, N.Y.
  52. 27
  53. Rigoutsos, I., Y. Gao, A. Floratos, and L. Parida. 1999. Building dictionaries of 1D and 3D motifs by mining the unaligned 1D sequences of 17 archaeal and bacterial genomes, p. 223-233. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology (ISMB '99). AAAI Press, Heidelberg, Germany.
  54. 28
  55. Rigoutsos, I., A. Floratos, L. Parida, Y. Gao, and D. Platt. 2000. The emergence of pattern discovery techniques in computational biology. Metabolic Eng. 2:159-177.[CrossRef][Medline]
  56. 29
  57. Rigoutsos, I., A. Floratos, C. Ouzounis, Y. Gao, and L. Parida. 1999. Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins. Proteins Struct. Funct. Genet. 37:264-277.[CrossRef][Medline]
  58. 30
  59. Rigoutsos, I., T. Huynh, A. Floratos, L. Parida, and D. Platt. 2002. Dictionary-driven protein annotation. Nucleic Acids Res. 30:3901-3916.[Abstract/Free Full Text]
  60. 31
  61. Romanowski, M. J., and T. Shenk. 1997. Characterization of the human cytomegalovirus IRS1 and TRS1 genes: a second immediate early transcription unit within IRS1 whose product antagonizes transcriptional activation. J. Virol. 71:1485-1496.[Abstract]
  62. 32
  63. Rosenkilde, M. M., M. Waldhoer, H. R. Luttichau, and T. W. Schwartz. 2001. Virally encoded 7TM receptors. Oncogene 20:1582-1593.[CrossRef][Medline]
  64. 33
  65. Shibuya, T., and I. Rigoutsos. 2002. Dictionary-driven microbial gene finding. Nucleic Acids Res. 30:2710-2725.[Abstract/Free Full Text]
  66. 34
  67. Sippl, M., and S. Weitkus. 1992. Detection of native-like models for amino acid sequences of unknown three-dimensional structure in a data base of known protein conformations. Proteins Struct. Funct. Genet. 13:258-271.[CrossRef][Medline]
  68. 35
  69. Smith, J. A., and G. S. Pari. 1995. Human cytomegalovirus UL102 gene. J. Virol. 69:1734-1740.[Abstract]
  70. 36
  71. Smith, T. F., and M. S. Waterman. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195-197.[CrossRef][Medline]
  72. 37
  73. Smyth, M. J., and J. A. Trapani. 1998. The relative role of lymphocyte granule exocytosis versus death receptor-mediated cytotoxicity in viral pathophysiology. J. Virol. 72:1-9.[Free Full Text]
  74. 38
  75. Stasiak, P. C., and E. S. Mocarski. 1992. Transcription of the cytomegalovirus ICP36 gene requires the alpha gene product TRS1 in addition to IE1 and IE2. J. Virol. 66:1050-1058.[Abstract/Free Full Text]


Journal of Virology, April 2003, p. 4326-4344, Vol. 77, No. 7
0022-538X/03/$08.00+0     DOI: 10.1128/JVI.77.7.4326-4344.2003
Copyright © 2003, American Society for Microbiology. All Rights Reserved.




This article has been cited by other articles:

  • Tabata, T., Kawakatsu, H., Maidji, E., Sakai, T., Sakai, K., Fang-Hoover, J., Aiba, M., Sheppard, D., Pereira, L. (2008). Induction of an Epithelial Integrin {alpha}v{beta}6 in Human Cytomegalovirus-Infected Endothelial Cells Leads to Activation of Transforming Growth Factor-{beta}1 and Increased Collagen Production. Am. J. Pathol. 172: 1127-1140 [Abstract] [Full Text]  
  • Borst, E. M., Wagner, K., Binz, A., Sodeik, B., Messerle, M. (2008). The Essential Human Cytomegalovirus Gene UL52 Is Required for Cleavage-Packaging of the Viral Genome. J. Virol. 82: 2065-2078 [Abstract] [Full Text]  
  • Stropes, M. P. M., Miller, W. E. (2008). Functional analysis of human cytomegalovirus pUS28 mutants in infected cells. J. Gen. Virol. 89: 97-105 [Abstract] [Full Text]  
  • Sherrill, J. D., Miller, W. E. (2006). G Protein-coupled Receptor (GPCR) Kinase 2 Regulates Agonist-independent Gq/11 Signaling from the Mouse Cytomegalovirus GPCR M33. J. Biol. Chem. 281: 39796-39805 [Abstract] [Full Text]  
  • Child, S. J., Hanson, L. K., Brown, C. E., Janzen, D. M., Geballe, A. P. (2006). Double-Stranded RNA Binding by a Heterodimeric Complex of Murine Cytomegalovirus m142 and m143 Proteins.. J. Virol. 80: 10173-10180 [Abstract] [Full Text]  
  • Feng, X., Schroer, J., Yu, D., Shenk, T. (2006). Human Cytomegalovirus pUS24 Is a Virion Protein That Functions Very Early in the Replication Cycle.. J. Virol. 80: 8371-8378 [Abstract] [Full Text]  
  • Das, S., Skomorovska-Prokvolit, Y., Wang, F.-Z., Pellett, P. E. (2006). Infection-Dependent Nuclear Localization of US17, a Member of the US12 Family of Human Cytomegalovirus-Encoded Seven-Transmembrane Proteins. J. Virol. 80: 1191-1203 [Abstract] [Full Text]  
  • Sylwester, A. W., Mitchell, B. L., Edgar, J. B., Taormina, C., Pelte, C., Ruchti, F., Sleath, P. R., Grabstein, K. H., Hosken, N. A., Kern, F., Nelson, J. A., Picker, L. J. (2005). Broadly targeted human cytomegalovirus-specific CD4+ and CD8+ T cells dominate the memory compartments of exposed subjects. JEM 202: 673-685 [Abstract] [Full Text]  
  • Tsirigos, A., Rigoutsos, I. (2005). A sensitive, support-vector-machine method for the detection of horizontal gene transfers in viral, archaeal and bacterial genomes. Nucleic Acids Res 33: 3699-3707 [Abstract] [Full Text]  
  • Wang, S.-K., Duh, C.-Y., Wu, C.-W. (2004). Human Cytomegalovirus UL76 Encodes a Novel Virion-Associated Protein That Is Able To Inhibit Viral Replication. J. Virol. 78: 9750-9762 [Abstract] [Full Text]  
  • Huynh, T., Rigoutsos, I. (2004). The web server of IBM's Bioinformatics and Pattern Discovery group: 2004 update. Nucleic Acids Res 32: W10-W15 [Abstract] [Full Text]  
  • Casarosa, P., Gruijthuijsen, Y. K., Michel, D., Beisser, P. S., Holl, J., Fitzsimons, C. P., Verzijl, D., Bruggeman, C. A., Mertens, T., Leurs, R., Vink, C., Smit, M. J. (2003). Constitutive Signaling of the Human Cytomegalovirus-encoded Receptor UL33 Differs from That of Its Rat Cytomegalovirus Homolog R33 by Promiscuous Activation of G Proteins of the Gq, Gi, and Gs Classes. J. Biol. Chem. 278: 50010-50023 [Abstract] [Full Text]  
  • Yue, Y., Zhou, S. S., Barry, P. A. (2003). Antibody responses to rhesus cytomegalovirus glycoprotein B in naturally infected rhesus macaques. J. Gen. Virol. 84: 3371-3379 [Abstract] [Full Text]  
  • Dunn, W., Chou, C., Li, H., Hai, R., Patterson, D., Stolc, V., Zhu, H., Liu, F. (2003). Functional profiling of a human cytomegalovirus genome. Proc. Natl. Acad. Sci. USA 100: 14223-14228 [Abstract] [Full Text]  
  • Murphy, E., Rigoutsos, I., Shibuya, T., Shenk, T. E. (2003). Reevaluation of human cytomegalovirus coding potential. Proc. Natl. Acad. Sci. USA 100: 13585-13590 [Abstract] [Full Text]  
  • Huynh, T., Rigoutsos, I., Parida, L., Platt, D., Shibuya, T. (2003). The web server of IBM's Bioinformatics and Pattern Discovery group. Nucleic Acids Res 31: 3645-3650 [Abstract] [Full Text]  

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrowReprints and Permissions
Right arrow Copyright Information
Right arrow Books from ASM Press
Right arrow MicrobeWorld
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Rigoutsos, I.
Right arrow Articles by Shenk, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Rigoutsos, I.
Right arrow Articles by Shenk, T.