Journal of Virology, July 2004, p. 7291-7298, Vol. 78, No. 14
0022-538X/04/$08.00+0 DOI: 10.1128/JVI.78.14.7291-7298.2004
Copyright © 2004, American Society for Microbiology. All Rights Reserved.
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
|
|
|---|
As the number of viral records in the public sequence databases (GenBank, EMBL, and DDBJ) grows, retrieving a viral genomic sequence of interest with associated information is becoming increasingly complex. High redundancy in the databases is a common problem for all organisms; in the case of viruses, however, the large number of available strains, isolates, and mutants further exacerbates the problem. For example, a search of Entrez Nucleotide currently retrieves more than 95,500 records for Human immunodeficiency virus 1 (HIV-1) and more than 22,500 records for Hepatitis C virus (HCV) alone; the total number of viral nucleotide records exceeds 220,000. Among these are both partial and complete genomic sequences, including partial sequences marked as a complete genome by submitters. Historically, sequence databases were merely archives of sequences directly submitted by users. Although a stricter submission procedure has been applied in recent years and therefore the quality of sequence records has greatly improved, a significant number of records are still underannotated, and the information in the old sequence records is often outdated. Furthermore, viral genomes are remarkably variable, consisting of either single-stranded or double-stranded DNA or RNA in either linear or circular form and comprising one or more segments. This variability makes viral records especially prone to inaccuracies in molecular information annotation.
To cope with these problems, NCBI has created the Viral Genomes Project as a part of the NCBI Genomes Project (19). Only complete or, occasionally, nearly complete viral genomic sequences missing only nontranslated portions (usually the ends of a genomic molecule) are being collected for this project, thereby greatly reducing redundancy. All available complete viral genomic sequences are being collected in order to faithfully represent the great genome variability found in many viruses. For example, 314 complete genome sequences of HIV-1 from various strains and isolates are included in the Entrez Genome collection. But only one sequence (NC_001802) has been selected as a reference (RefSeq) to serve as a molecular standard.
RefSeq records are manually curated to correct and update content in the original sequence records, which often involves consultations with the original submitters and/or other outside experts. The collection of preselected reference sequences greatly facilitates comparison of the genomes of different viruses. As of December 2003, the Viral Genomes Project contained 1,677 viral reference genomic sequences representing 1,223 virus species, which make a significant contribution to the NCBI RefSeq collection (13). Figure 1 shows the growth of the viral RefSeq collection during the past 3 years.
![]() View larger version (56K): [in a new window] |
FIG. 1. The growth of NCBI's Viral Genomes Project. The bars represent the numbers of new and all viral genome reference sequences in each quarter.
|
|
|
|---|
Some complete viral genome sequences are not detected by this automatic procedure because the source record either did not correctly indicate a circular topology or did not include the key words that NCBI curators use for automatic screening. To overcome this problem, viral sequences undergo an additional screening based on the sequence length. In this procedure, all viral sequences are retrieved from GenBank and grouped by virus genera. If there are viral reference sequences in a genus, sequences longer than 90% of the shortest reference sequence in the genus are selected. If there are no viral reference sequences in a genus, sequences longer than 90% of the longest sequences in the genus are selected. NCBI curators then check the selected sequences for their completeness.
Additionally, complete viral genome sequences are identified with the aid of external scientific advisors, experts on particular families or groups of viruses, who also assist in the curatorial process. The list of advisors and their contact information is available at http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viradvisors.html.
More than 200 viral sequences have been identified and added to the viral genomes collection with the advisors' help. We invite all virologists to contribute to the Viral Genomes public resource by sending us their suggestions concerning any virus or virus group (including those for which we already have advisors). To help the NCBI identify complete viral genomes, it is strongly recommended that submitters use the term "complete genome" in the definition of their sequences.
Taxonomic classification. The Viral Genomes Project is tightly linked with the taxonomy database. In this database, each organism or taxonomy node has its own unique identification number, a tax_id. Viral Genomes and GenBank sequence records contain tax_id information as well as organism names and taxonomic lineages. The Viral Genomes Project makes use of the tax_ids to build a taxonomical hierarchy in its tools and views (see Visualization).
The NCBI taxonomy database contains names and classifications of more than 100,000 organisms for which sequence data are available. As of December 2003, it included the names of about 6,700 viral species and a total of about 9,300 individual viral names for individual strains, serotypes, isolates, and genotypes.
The names and classifications of viruses in the taxonomy database follow, to a large extent, the most recent report of the International Committee on the Taxonomy of Viruses, ICTV7 (18). As the ICTV reports appear infrequently, the NCBI taxonomy database attempts to stay current by also accepting new names and classification schemes on a case-by-case basis as provided in the reports of the ICTV executive meetings (9) and based on the advice of outside experts.
However, many sequence submissions are for viruses that are not listed in the ICTV report and sometimes not even described in the published literature. In spite of this, the taxonomy database can index these organisms and associated records. For example, the taxonomy database lists the TT virus (11) and bacteriophage Mx8 (unpublished), although they are not present in the ICTV report. Finding an appropriate taxonomic position for a virus usually involves comparative sequence analysis. The complete inventory of viruses and their classifications can be explored at the NCBI taxonomy web site at http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/.
|
|
|---|
RefSeq records are continually reviewed. Any existing RefSeq can be replaced with a better-studied, more practically important, and/or better-annotated genomic sequence. Recent review of the Flaviviridae family revealed that an old RefSeq record for HCV based on the GenBank record D90208 of 1993 was missing an important 82-base sequence at the 3' end, which was first detected in HCV in 1996 by Kolykhalov et al. (5). Therefore, a new RefSeq record (NC_004102) was created, based on GenBank sequence AF009606, which represents the first cDNA clone to produce infectious and pathogenic HCV RNA.
|
|
|---|
Only about half of provisional records indicate strain or isolate data; in other cases, NCBI gathers the information from the current literature or the ICTV resource.
|
|
|---|
A large part of the curatorial process involves improvement of genome annotation, which includes searches for missing genes, assignment of functional roles to protein products, correction of annotations for proteins expressed by frame shifting or read through, restoration of proteins disrupted by sequencing errors, and addition of information on the processing of viral polyproteins.
In collaboration with Mark Borodovsky, the GeneMark program (http://opal.biology.gatech.edu/GeneMark/VIOLIN) was used to predict open reading frames (ORFs) in all viral RefSeq genomes and to compare them with the original annotations. To date, almost 100 records have been manually annotated with additional GeneMark-predicted ORFs (10). Whenever possible, putative protein functions were inferred from the results of BLAST searches against Viral Clusters of Related Proteins (VOG; see next section) or the NCBI nonredundant protein database (nr). The new annotations were usually confirmed by additional information retrieved from the current literature. Examples include the complete genomic sequences of the large double-stranded DNA (dsDNA) virusesLymphocystis disease virus 1 and Sheeppox virus. The original GenBank records, L63545 and AY077832, respectively, contained no annotation, with the exception of a coat protein in the former record. Subsequently, 157 and 147 protein coding genes were predicted by the GeneMark program in these genomes, respectively, and added to the corresponding reference sequences (records NC_001824 and NC_004002). The annotation of Lymphocystis disease virus 1 RefSeq record I (NC_001824) was further reviewed manually and compared with author-supplied annotation kindly provided by C. A. Tidona (17).
A significant number of problematic annotations existed in source records where viral proteins were expressed by frame-shifting or stop codon read-through mechanisms. For example, the correct precursor polyproteins were missing from many records of Retroid viruses (including the HIV-1 RefSeq NC_001802) and from a few families of positive-strand single-stranded RNA (ssRNA) viruses, such as Arteriviridae, Coronaviridae, and Astroviridae. Subsequently, corrections were made to the records, allowing for the mature peptides to be curated.
The annotation of viral polyproteins is a satellite project that will be described elsewhere. Briefly, it involves the following: (i) the production of alignments of related viral polyproteins from both RefSeq and Genome neighbor records, usually grouped by species or genera; (ii) the incorporation of cleavage sites available from the sequence databases or from the current literature into the alignment; (iii) the analysis of the alignment for (potential) cleavage sites in the reference sequences; and (iv) the annotation of the missing (predicted) cleavage sites in corresponding RefSeq records. For example, for the genus Flavivirus, 62 nonredundant polyproteins were aligned and 438 previously annotated cleavage sites were indicated in the alignment, which allowed for the prediction of an additional 368 putative cleavage sites. As many as 17 (of 20) flavivirus RefSeq records have been updated accordingly and provided with appropriate comments. Many RefSeq records for other viruses that explore the strategy of polyprotein processing have been processed this way. The PV RefSeq NC_002058 now has all 11 mature peptides (polyprotein processing products), whereas its source record, V01149, has only 3 mature peptides. Similarly, no mature peptides are present in the original record of Equine arteritis virus, X53459, the best-studied (in terms of molecular biology) representative of the order Nidovirales, while the corresponding RefSeq entry NC_002532 now contains 12 mature peptides. More data on alternative polyprotein processing will be added in the future.
|
|
|---|
|
|
|---|
|
|
|---|
![]() View larger version (75K): [in a new window] |
FIG. 2. NCBI Viral Genomes Project main page (URL, http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html). The general introduction section containing the Influenza A virus replication scheme is followed by information on the number of available reference sequences and complete viral genomes. The latter is also a hyperlink to the complete list of available viral genomes; this link is duplicated in the lower right side as "Complete alphabetical list." The numbers in the colored squares are the links to the lists of viral genomes grouped by major taxonomy divisions, where each number is the number of reference sequences in a particular division. Below is the virus group or family search box that allows one to retrieve the listing of all genomes belonging to a particular virus division, family, or floating genus. The lists for all of the taxonomy groups are available via the link "Genomes grouped by taxonomy" in the lower left-hand corner. Links from the blue sidebar menu include a general overview of viruses, virus reference sequences, statistics, "FAQ," "Advisors," "Help," nucleotide and protein sequence retrieval tools, and alphabetical and taxonomic lists of available viral genomes. There are also links to external sites dedicated to virus biology, taxonomy, and nomenclature, as well as to other sequence databases.
|
ssDNA viruses
Geminiviridae) and within the same taxonomy level (e.g., from one family to another). In addition, it allows one to study the genome of interest at different levels of detail, from an entire genome
[genome segment]
gene
translation product
protein domains (or mature peptides). The VOG pages, which provide access to curated clusters of related viral proteins, are cross- linked with the virus family and group pages and the genomic view pages, thereby showing the functional or evolutionary relationships of the viral genomes covered. To analyze a genome or to compare several genomes, one can select the group of interest and apply the tools and precomputed results provided by the Viral Genomes resource. For example, start with the ssDNA virus group that consists of six families. Figure 3 shows the steps taken in such an analysis for the family Geminiviridae.
![]() View larger version (72K): [in a new window] |
FIG. 3. The Viral Genomes web pages for the family Geminiviridae. Arrows indicate the direction of links between the web pages. (A) List of complete viral genomes for the family (taxonomy listing) (http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/10811.html). Viruses are listed in alphabetical order under each genus. (B) The genome summary page for Bean golden yellow mosaic virus (BGYMV) (http://www.ncbi.nlm.nih.gov/genomes/taxg.cgi?tax=220340). (C) Graphical view (circular) for the DNA A RefSeq sequence
of BGYMV (http://www.ncbi.nlm.nih.gov/genomes/framik.cgi?db=Genome&gi=10163 and http://www.ncbi.nlm.nih.gov/genomes/framik.cgi?db=Genome&gi=10162). (D) VOG page for the family (http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/shagog.cgi?data=ssdna.defl&clust=SSDNA&fam=G). (E) Graphical representation of BLAST results for one of the protein clusters (http://www.ncbi.nlm.nih.gov/genomes/blu_all.cgi?data=ssdna.defl&cog=VOGs0100&phage=2&9626471).
|
Graphical and textual alignments of BLAST results are available for each cluster (Fig. 3E), allowing for quick and in-depth analysis of sequence similarities among the members of the cluster.
The Graphic View page of a viral RefSeq is a good starting point for exploring a viral genome. The graphical view can be accessed directly by typing the name of a virus into one of the two search boxes (Fig. 2). Figure 4A shows part of such a page for Enterobacteria phage P2. This page displays a graphical view of genome features such as CDSs, RNA genes, protein coding genes, signals, and more. From the graphical view, one can go to the "protein view," which summarizes the coding region information for each DNA or RNA strand (Fig. 4B). Each protein is shown on this page as a colored (depending on the nucleic acid strain) rectangle, which is hyperlinked to a corresponding BLINK page (Fig. 4C) that displays precomputed BLAST neighbors for this protein. A complementary list of VOG affiliations (if any) for all of the proteins encoded by this phage genome (Fig. 4D) allows one to get an instant impression of the functions of these proteins and to review each of the relevant VOGs in order to learn more about the functions and evolution of these and related proteins. Nucleotide and protein records in FASTA format or a protein table can be downloaded from a "Coding Regions" view (link shown on the left tool bar, Fig. 4A and B). To continue the comparison with other related genomes, one can return to the family page in one click.
![]() View larger version (81K): [in a new window] |
FIG. 4. A closer look at a viral genome: Enterobacteria phage P2. (A) Graphical view for the RefSeq of Enterobacteria phage P2 (http://www.ncbi.nlm.nih.gov/genomes/framik.cgi?db=genome&gi=13510). The organism name on this page is linked to the corresponding page of the NCBI Taxonomy. The RefSeq accession number that starts with the letters "NC_" is linked to an Entrez record flat file view. There are also links to the following information: publications associated with the sequence (if any), the status (reviewed, validated, or provisional) of the genome record, and the hyperlinked accession number of the source sequence record. (B) A graphical view of proteins encoded in the phage genome (http://www.ncbi.nlm.nih.gov/genomes/altvik.cgi?gi=13510&from=0&to=33592&db=genome&x=15&y=17). (C) BLINK results for the Int protein (http://www.ncbi.nlm.nih.gov/sutils/blink.cgi?pid=9630357&cut=95). (D) List of Enterobacteria phage P2 proteins and their associated VOGs (http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/shagog.cgi?data=phogs.defl&clust=PHOG&org=Y).
|
|
|
|---|
The Viral Genomes Project is regularly updated as new data and more tools become available. Proteins from all viral reference sequences are to be subjected to VOG analysis.
The views expressed in this Commentary do not necessarily reflect the views of the journal or of ASM.
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»