Previous Article | Next Article ![]()
Journal of Virology, April 2003, p. 4326-4344, Vol. 77, No. 7
0022-538X/03/$08.00+0 DOI: 10.1128/JVI.77.7.4326-4344.2003
Copyright © 2003, American Society for Microbiology. All Rights Reserved.
Bioinformatics and Pattern Discovery Group, IBM TJ Watson Research Center, Yorktown Heights, New York 10598,1 Victor Chang Cardiac Research Institute, Darlinghurst, New South Wales 2010, Australia,2 Department of Molecular Biology, Princeton University, Princeton, New Jersey 085443
Received 10 July 2002/ Accepted 23 December 2002
|
|
|---|
|
|
|---|
Of the numerous approaches that have been proposed over the years, the determination of regions of similarity between a novel protein of unknown function and one or more database proteins with known annotation has been the method of choice. Such a determination allows one to predict the common region in the protein of unknown function as exhibiting the functional characteristics of the respective region from the annotated database protein through what is frequently called a "guilty-by-association" approach. These methods are also known as homology-based methods, and they have led to significant advances in protein annotation (2, 22, 36).
During the latter half of the 1990s, pattern-based approaches have been steadily gaining ground as the methods of choice for solving various computational problems in molecular biology (28). One such algorithm is MUSCA, a multiple sequence alignment algorithm, which we described in an earlier study (23). MUSCA begins by using the Teiresias pattern discovery algorithm (25, 26) to identify patterns that are shared by k or more input sequences. During its second phase, MUSCA exploits the location of the discovered patterns to anchor and induce alignments of increasingly larger input fragments. Because of the manner in which it operates, MUSCA is uniquely suitable to handle inputs in which one or more domains are shared among the sequences to process. In a parallel effort, we also described a pattern-based approach to the problem of protein annotation (30). The approach is centered on the Bio-Dictionary, an exhaustive collection of amino acid patterns (heretofore referred to as seqlets) that completely covers the natural sequence space of proteins defined by the currently available sequences. The Bio-Dictionary is computed by carrying out pattern discovery with the Teiresias algorithm (25, 26) on very large databases of biological sequences such as SwissProt/TrEMBL (4). The seqlets contained in the Bio-Dictionary can capture functional and structural signals that have been reused during evolution both within and across families of related proteins (27, 28, 29). This new method uses the seqlets contained in the Bio-Dictionary to exhaustively annotate a query protein by using the information that is available in a well-maintained database, such as SwissProt/TrEMBL, and employs a weighted, position-specific scoring scheme that is not affected by the overrepresentation of well-conserved proteins and protein fragments, which exist in the public databases. As we showed elsewhere (30) and for several published genomes, this Bio-Dictionary-based approach matched the quality and sensitivity of the annotations that were obtained with semiautomated approaches while requiring only a very small investment of computational resources.
In an earlier study (21), we examined the annotation of human cytomegalovirus (HCMV; also known as human herpesvirus 5 [HHV-5]) by using ProCeryon (a program for fold recognition and protein structure analysis; ProCeryon Biosciences), a structure prediction program that is based on threading (16, 34). Each of the HHV-5 ORFs was threaded with the ProCeryon algorithm, and a structural and functional hypothesis was generated. As anticipated and due to the large number of membrane proteins coded for by this genome, the threading approach provided hypotheses for a little less than 50% of the coding regions. The desire to further push the annotation envelope for HHV-5 led us to the sequence-based work that we discuss here.
HHV-5 is a member of the betaherpesviruses, a subgroup of herpesviruses with common growth characteristics (1). Considered the prototypical betaherpesvirus, HHV-5 spreads to the majority of the population at an early age, causing asymptomatic infections in healthy individuals. However, it can produce life-threatening disease in immunosuppressed individuals and as a result of congenital infections (24).
The HHV-5 virion contains a linear, double-stranded DNA genome (
230 kbp) encased in an icosahedral capsid (8). The capsid is surrounded by a protein matrix and a lipid envelope with integral glycoproteins. Two major unique regions, denoted as long (UL) and short (US), can be identified in the viral genome and are bracketed by repeated domains. The AD169 strain of HHV-5 was sequenced in 1990 (8), and 208 ORFs were predicted as coding for proteins
100 amino acids in length. More recently, an insertion that modifies ORFs 42 and 43 was identified in the AD169 strain (10, 20), and analysis of a cDNA sequence has revealed that UL101 does not exist, whereas UL102 is modified (35). Finally, several additional ORFs were found in the Towne and Toledo strains of HHV-5 (7). The repeated ORFs include J1L/J1I/J1S, which are partially related; TRL1 through TRL13, which at a second location are labeled IRL1 through IRL13; TRL14, which shares a N-terminal region with its IRL14 counterpart; and IRS1/TRS1, which are half repeated and half unique. The unique ORFs are UL1 to UL154 and US1 to US36, with some ORFs receiving fractional designations such as UL21.5, UL48.5, and UL80.5.
We describe below the application of the Bio-Dictionary to the in silico annotation of the HHV-5 genome. We have generated and processed a "composite" genome that is the union of the originally reported genes from AD169, the three modified ORFs in AD169, as well as the genes from the Towne and Toledo strains. The method we used to annotate the composite genome is described in detail elsewhere (30), and an implementation of it is available online (http://cbcsrv.watson.ibm.com/Tpa.html). The functional hypotheses, as well as numerous features that we identified in these proteins are also available online (http://cbcsrv.watson.ibm.com/virus/); at this website, one can find summaries of each protein's functional annotation and information about the nature and location of posttranslational modifications, active sites, identifiable domains (e.g., transmembrane), and alignments with other proteins from the public databases, as well as detailed information on the similarity of each annotated amino acid sequence to archaeal, bacterial, eukaryotic, and viral sequences. For completeness, we have also included the results from our earlier, threading-based annotation of HHV-5 (21). Finally, we used MUSCA to reanalyze the originally proposed HHV-5 ORF families and we present our conclusions here.
|
|
|---|
Bio-Dictionary-based automated protein annotation. The computational tool that we used for the annotation component of this work relies on the Bio-Dictionary: the latter was originally created by using the Teiresias pattern discovery algorithm (25, 26) to process the GenPept database as a whole (29); this computation has since been repeated at regular intervals on the increasingly larger installments of the SwissProt/TrEMBL database (4). The Bio-Dictionary is a very large collection of sequence patterns, referred to as seqlets. In other words, seqlets are strings of literals interspersed with zero or more wild cards: a unique amino acid, or a small set of permitted amino acids, can occupy the locations of each literal; the positions corresponding to the wild cards indicate locations that can be occupied by any amino acid. For example, the seqlet [KR].K[ILMV][AG]L describes all hexapeptides that begin with either a lysine or an arginine; followed by any one of the 20 amino acids; followed by a lysine; followed by a isoleucine, leucine, methionine, or valine; followed by alanine or glycine; and finally ending with a leucine. The Bio-Dictionary seqlets capture functional and structural signals that extend beyond protein family boundaries, which is not an unexpected result considering the manner in which the collection is produced.
An additional property of this collection is that it nearly completely covers the currently known sequence space of natural proteins and can thus be used in lieu of the original processed sequence database to solve a gamut of problems, including gene finding (33) and protein annotation (30). For the purposes of protein annotation, each seqlet is augmented with additional information pertaining to functional, structural, or other properties of the seqlet's known instances in proteins that have been studied computationally and experimentally.
To annotate a previously uncharacterized protein, instances of all of the seqlets in the Bio-Dictionary are sought in the sequence under consideration: for seqlets that are present in the sequence, their respective meanings are used to label the part of the sequence corresponding to the seqlet's instance in a straightforward "guilty-by-association" approach. The meanings of overlapping seqlets are subsequently accumulated and coalesced into hypotheses about the function of the processed protein, the presence of various domains and active sites, the nature and location of posttranslational modifications, etc. Details on the computational aspects of the Bio-Dictionary-based protein annotation are given elsewhere by Rigoutsos et al. (30).
MUSCA (pattern-based multiple sequence alignment). MUSCA is a two-phase algorithm for computing the multiple sequence alignment of a set of N sequences (23). During the first phase, MUSCA uses Teiresias to discover patterns that are common among K or more of the input's N sequences. These patterns are used in the second phase to generate and report the multiple sequence alignment. In particular, the motifs are first mapped to vertices of a directed graph. If the two motifs pi and pj do not occur simultaneously in any sequence, then there is no edge connecting the corresponding vertices of the graph. The vertices corresponding to pi and pj will be connected by an edge with direction from pi to pj if pi occurs before pj in all of the sequences where they both appear. The labels of the edges depend on three things: whether pi and pj are pairwise incompatible, whether they have overlapping instances, or whether they are pairwise compatible but do not overlap. Vertices that are joined by incompatible edges or participate in inconsistent cycles form the basic nonfeasible sets. After the vertices of the reduced graph were labeled with the help of a simple cost function, we used a greedy algorithm to obtain a solution to a weighted set-cover problem that essentially identifies the minimum number of motifs/vertices to be removed. The resulting graph was used to determine the blocks that involve overlapping feasible motifs. We obtained the final alignment by properly aligning the blocks and padding up the existing gaps.
The alignments that MUSCA generates are independent of the order in which the input sequences are given. The algorithm is uniquely suitable to process inputs where the various sequences share domains that are present in some of the sequences only or share inputs that comprise two or more subsets with high conservation within each subset but low conservation across subsets.
|
|
|---|
Benchmarking our approach. In a recent study (30), we discussed and tested the Bio-Dictionary-based protein annotation method in detail and with the help of many diverse input sequences. We demonstrated that this method provides substantial benefits versus traditional approaches in terms of objective annotation capability, and the reader is referred to that discussion for more information. It is important to stress that our approach makes use of a weighted, position-specific scoring scheme that is not affected by the overrepresentation of well-conserved proteins and protein fragments that exist in the public databases. A given feature that has been associated with a region of the query is assigned a normalized score between 0 and 100; the resulting figure is an estimate of the percentage of the total number of the region's distinct instances that have also been annotated as sharing the feature. Empirically, we determined that scores between 90 and 100 correspond to good, conservative results: for the HHV-5 annotation presented below, we only considered features whose confidence estimates fell in this interval of values.
General findings. Table 1 summarizes our findings for each of the annotated ORFs of the composite genome. The ORFs are listed in the order in which they appear on the composite HHV-5 genome. In the first three columns, and for each ORF, we list the ORF's name, its accession number, and a functional hypothesis. The fourth column lists features of each predicted protein such as binding sites, posttranslationally modified sites, etc.
|
View this table: [in a new window] |
TABLE 1. List of brief annotations for the composite genome of HCMVa
|
Our analysis predicts that as many as 144 proteins, approximately half of them glycoproteins, are probably integral membrane proteins. Also, for at least 49 of the analyzed proteins we find evidence for the presence of a signal peptide. These proteins are TRL2/IRL2, TRL10/IRL10, TRL11/IRL11, UL4, UL11, UL12, UL13, UL14, UL16, UL18, UL20, UL21, UL21.5, UL31, UL37, UL40, UL41, UL50, UL55, UL56, UL73, UL75, UL91, UL111.5, UL115, UL117, UL118, UL119, UL121, UL124, UL130, UL132, UL132/Toledo, UL139/Toledo, UL144/Toledo, UL147/Toledo, UL148/Toledo, UL149/Toledo, UL152/Towne, US3, US6, US7, US8, US9, US10, US11, US25, US30, and US31. These proteins would be expected to represent a collection of plasma membrane proteins, proteins that reside within intracellular membranous compartments, and secreted proteins.
We found evidence for one or more phosphorylation sites for the proteins encoded by the following nine ORFs: TRL5/IRL5, UL4, UL34, UL59, UL67, UL83, UL109, IRS1, and US36. Also, at least 17 of the coded proteins contain one or more hydroxylation sites: J1L, TRL9/IRL9, UL15, UL31, UL44, UL52, UL57, UL61, UL62, UL104, UL141/Toledo, UL150/Toledo, IRS1, US8, US32, TRS1, and J1S. Of these, UL61 has the largest number (i.e., nine) of such sites.
Thirty-one ORFs seem to be virus specific in the sense that they contain no notable, distinguishing features and no identifiable similarities to anything except for other viral sequences. In particular, at least 15 ORFs (J1L, TRL8/IRL8, UL60, UL62, UL64, UL66, UL81, UL90, UL106, UL110, UL137/Toledo, UL145/Toledo, UL148/Toledo, UL149/Toledo, US5, US33, and J1S) appear to be specific to the HCMV, at least 7 ORFs (UL25, UL29, UL36, UL47, UL88, UL91, and UL96) are specific to the herpesviruses, and at least 3 ORFs (UL53, UL85, and UL115) are virus specific, i.e., homologies can be found outside the herpesvirus family. The presence of many ORFs that are specific to HHV-5 but not other organisms is consistent with the view that HHV-5 is an ancient virus whose genome has evolved separately from its host for an extended period.
G-protein-coupled receptor homologues. Our analysis revealed, among the annotated ORFs, the presence of a group of 15 sequences that are likely to code for GPCR-like proteins. In particular, six ORFs, namely, UL33, UL78, US12, US14, US27, and US28, show strong homologies to members from well-understood GPCR families.
For an additional nine ORFs, our analysis clearly shows the presence of exactly seven transmembrane regions, thus implying a membership in the GPCR family: these nine ORFs are UL100, US13, US15, US16, US17, US18, US19, US20, and US21. It is notable that a substantial part of the US region contains ORFs that appear to be coding for proteins that have a GPCR-like sequence composition and GPCR-like characteristics. An interesting observation can be made with respect to the nine ORFs that our analysis places in this category: although there is support that each of these ORFs contains seven transmembrane regions, none of these ORFs show any notable global sequence similarity to members of known GPCR families. However, each of these ORFs is composed of transmembrane helices that appear to have been "borrowed" from distinct GPCR families and placed in an order that has not been previously encountered; in other words, the transmembrane helices of each of these nine ORFs have not appeared as a group in a single GPCR (sub)family before. We computationally verified the situation with the ORFs encoding newly predicted GPCR-like proteins (UL100, US13, US15, US16, US17, US18, US19, US20, and US21) as follows. By using each of these sequences as a query and employing standard similarity searching tools, we compared it with other sequences and in particular with those contained in the GPCRDB database (15); only weak conservation spanning part of the ORF sequences could be identified. Subsequently, we extracted the amino acid subsequences that our analysis indicated as corresponding to the seven transmembrane helices of the ORF under consideration and used these shorter regions as queries for a search of the GPCRDB database: in each case, we discovered notable similarities of these queries/helices to annotated transmembrane helices of known GPCRs, but each one of these "hits" came from a different functional subdivision of the GPCR superfamily, thus supporting our statement above.
The case of one ORF in particular, UL78, is discussed in detail in (30). Therein, we show local alignments between the transmembrane helices of UL78 to transmembrane helices of well-characterized transmembrane proteins. A PSI-BLAST (3) search of the public databases based on UL78 could only identify an
70-amino-acid region of UL78 as weakly similar to the rhodopsin family but could not determine the local similarities mentioned above, a direct consequence of their short length. In our analysis (see data available online at the companion website), we also give an alignment of UL78 with P2Y7_HUMAN, a human purinoreceptor.
Revisiting the previously defined HHV-5 gene families. In the original study of the sequence of HHV-5, Chee et al. (8) defined and described several families that comprised subsets of the reported putative genes. This categorization into the families was based on the use of heuristics-based similarity searching approaches. As we have described in previous work (29, 30), this approach to family determination can lead to incorrect conclusions in a manner analogous to incorrectly annotating proteins by assuming that the "transitive closure" property applies: the fact that sequence A is similar to sequence B and that sequence B is similar to sequence C should not be used to imply that sequence A is similar to sequence C. An annotator will frequently exploit either the first or the best "hit" in the output of a database search carried out by using the FASTA (22), BLAST (2), or Smith-Waterman (36) tool: in the presence of small but well-conserved regions or of domains that are shared by distinct proteins (11) this choice is sometimes not optimal. In fact, the multidomain organization of proteins can lead to incorrectly annotated database entries and, by extension, to incorrect definitions of protein families.
Chee et al. (8) defined eight gene families, namely, UL25, UL82, RL11, US1, US2, US6, US12, and US22. We analyzed these HCMV families with the help of MUSCA (23), a pattern discovery-based multiple sequence alignment algorithm, and reevaluated their definitions. The MUSCA algorithm is described in some detail above in Materials and Methods. The use of patterns in inducing multiple sequence alignments is particularly appropriate in the presence of shared domains. In the alignments that are described below, amino acids that participated in the patterns that induced the respective alignment are capitalized and are also colored based on their hydropathy. For some of these cases, the rather involved relationship of the considered sequences necessitated the manual selection of the regions to be aligned.
US1 family. The first family we examined was the US1 family comprising ORFs US1, US31, and US32. The multiple sequence alignment for these three sequences is shown in Fig. 1A and supports the original definition of this family.
![]() View larger version (59K): [in a new window] |
FIG. 1. Pattern-based alignments of the US1 family (A), the US2 family (B), and of UL25 and UL35 (C). In all cases, only the amino acids which participated in the patterns that induced the alignment are shown in color; the different colors represent different hydropathies.
|
US2 family. The US2 family, as originally defined, comprised the proteins US2 and US3 whose alignment is shown in Fig. 1B and supports the original family definition.
UL25 family and UL82 family. The original definition of the UL25 family comprised the ORFs UL25 and UL35. We again used MUSCA to align the members of this family (Fig. 1C). As is evident, the similarity of these two sequences is rather weak, thus putting the original definition of the UL25 family into question. A similar situation exists in the case of the UL82 family that consists of UL82 and UL83: the remaining sequence similarity is also weak (the alignment is not shown).
US6 family. The originally defined US6 family consists of the ORFs US6 through US11. Manual analysis of this sequence group indicates that US7, US8, US9, and US11 form a cluster and a multiple sequence alignment is shown in Fig. 2A. Fairly long regions are reasonably well conserved, thus supporting the hypothesis that these four sequences form a family. For the remaining two sequences, US6 and US10, small regions appear to be shared among the pairs US6-US8 and US8-US10, but no unifying similarities are evident; see Fig. 2B and C. To recapitulate, our analysis suggests that the US6 family definition includes only US7, US8, US9, and US11.
![]() View larger version (69K): [in a new window] |
FIG. 2. Pattern-based alignments for four of the six members in the original US6 family (A); US6 with US8 from the original US6 family (B); US10 with US8 from the original US6 family (C); and US12, US13, US14, and US20 from the original US12 family (D). In all cases, only the amino acids which participated in the patterns that induced the alignment are shown in color based on their hydropathy.
|
![]() View larger version (73K): [in a new window] |
FIG. 3. Pattern-based alignments for US17, US18, and US20 from the original US12 family (A) and of US22, US23, US24, and US26 (B) from the original US22 family.
|
![]() View larger version (75K): [in a new window] |
FIG. 4. Pattern-based alignment of IRS1, TRS1, US22, and US23 of the original US22 family. Only the amino acids which participated in the patterns that induced the alignment are shown in color based on their hydropathy.
|
|
View this table: [in a new window] |
TABLE 2. Revised groupings of the originally defined HCMV gene families
|
|
|
|---|
One of the more intriguing predictions from this annotation exercise is the expectation that HHV-5 encodes a larger number of GPCRs than previously thought. UL33, UL78, US27, and US28 have been identified previously as encoding GPCRs (for a review, see reference 32). Our analysis strongly predicts that US12 and US14 encode GPCRs, and this is corroborated by good quality alignments with known GPCRs. An additional nine sequences, namely, the ORFs UL100, US13, and US15 to US21, each contain seven recognizable transmembrane domains; similarity searches indicate that the sequence fragments corresponding to each of the seven transmembrane regions of these nine ORFs match annotated transmembrane helices of known GPCRs but of distinct functional behavior. In other words, the transmembrane helices in each of these nine ORFs have not appeared as a group in a single GPCR (sub)family before. If these proteins prove to have GPCR activity, analysis of their physiological effects might provide important new insights to the function of this class of proteins since they are only distantly related in terms of their seqlet content to known GPCRs.
The possible hydroxylation of a set of HHV-5 proteins deserves further comment. Hydroxylation of proline residues influences protein assembly into a triple-stranded helix in the case of collagen (17) and directs the proteasome-dependent degradation of other proteins (6). It is possible, then, that the structure or half-life of one or more HHV-5 proteins is influenced by hydroxylation. Hydroxylation might influence the structure of UL61, which contains a domain that, like collagen, is predicted to form a triple helix and includes two possible hydroxylation sites. Further, since hydroxylation of hypoxia-inducible transcription factors controls the response of cells to changes in oxygen availability (6), one might speculate that hydroxylation of a viral protein might allow HHV-5 to sense and adapt to the oxygen environment of its host cell. For example, IRS1 and TRS1, two viral proteins with putative hydroxylation motifs, exhibit transcriptional activity in transfected cells (31, 38). Conceivably, their activity is modulated by hydroxylation.
Our analysis suggests numerous functional hypotheses based on similarities of domains within HHV-5 proteins to domains within other proteins of known function. For example, TRL4 is predicted to encode a transmembrane glycoprotein with similarity to the fas antigen ligand. Very little is known about this protein, although its mRNA is expressed in large quantities with early kinetics (14). There also is a report that an epitope-tagged (C-terminal) version of the protein is localized to the nucleolus within transfected cells (5). Our annotation does not anticipate a nucleolar localization, and it is possible that the reported localization was perturbed by the epitope tag. HHV-5 infection has been reported to induce the expression of the cellular fas ligand in certain cell types (9), and the expression of fas ligand has been proposed to assist in virus escape from immune surveillance (37). Consequently, the TRL4-encoded protein might prove to antagonize the host antiviral response by serving as a viral mimic of the cellular fas ligand. This hypothesis, as well as additional hypotheses generated by our annotation, can be tested experimentally.
In addition to reannotating the complete HCMV genome, we also reevaluated the original HHV-5 gene family definitions with the help of a pattern-based multiple sequence alignment algorithm and proposed new groupings for the original family members; we also identified cases in which the detectable sequence similarities observed in our analyses were not strong enough to support family membership. Our analysis with MUSCA of protein families was limited to a reevaluation of the families proposed by Chee et al. (8). Additional families might exist that could be identified by using pattern-based approaches.
The present analysis focused on the original annotated set of HHV-5 ORFs (8). It is possible that the genome contains substantially more coding ORFs than we have analyzed here. We employed the MacVector genome analysis program (version 7.0; Oxford Molecular Group), utilizing a human codon bias, to identify all ORFs within the HHV-5 AD169 genome that encode polypeptides that are at least 50 amino acids long. We also analyzed the genome by using the TESTCODE algorithm (12). These two filters identified ca. 700 ORFs encoding polypeptides >50 amino acids in size and starting with an AUG, as well as many more when the requirement for an N-terminal AUG was dropped (E. Murphy, I. Rigoutsos, and T. Shenk, unpublished data). We are currently utilizing the BDGF algorithm (33), a Bio-Dictionary-based gene finder, to validate our expectation that the HHV-5 genome contains substantially more coding ORFs than predicted in the original annotation.
This work was supported in part by grants from the National Institutes of Health to T.S. (CA82396, CA85786, and CA87661).
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»