Previous Article | Next Article ![]()
Journal of Virology, January 2004, p. 424-440, Vol. 78, No. 1
0022-538X/04/$08.00+0 DOI: 10.1128/JVI.78.1.424-440.2004
Copyright © 2004, American Society for Microbiology. All Rights Reserved.
Christoph J. Hengartner,2,
Thomas C. Mettenleiter,1 and Lynn W. Enquist2*
Institute of Molecular Biology, Friedrich-Loeffler-Institutes, Federal Research Centre for Virus Diseases of Animals, D-17493 Greifswald-Insel Riems, Germany,1 Department of Molecular Biology, Princeton University, Princeton, New Jersey 085442
Received 14 August 2003/ Accepted 20 September 2003
|
|
|---|
|
|
|---|
Besides its economic importance, PRV has proven to be an excellent model system for alphaherpesvirus biology (reviewed in references 26 and 47 to 49). In particular, the mechanisms involved in initiation of infection, virion morphogenesis and egress, and neuroinvasion and transneuronal spread are under intense examination. In this respect, studies of the molecular biology of PRV continue to provide insight into the mechanisms of alphaherpesvirus infection in vitro and in vivo.
The PRV genome is similar in arrangement to the genomes of EHV-1, BHV-1, and VZV, encompassing a unique long segment (UL) and a unique short region (US). The US region is bracketed by inverted repeat sequences, resulting in the formation of two possible PRV genome isomers with oppositely oriented US regions. Although this arrangement was detected some time ago (5, 19), its biological significance remains unclear. The genomes of PRV and the related HSV-1 are largely colinear, with the exception of an inversion of a portion of the UL region in PRV compared to HSV-1 (5). Again, the biological relevance of this inversion is not known.
Despite progress over the years, studies on PRV gene function and comparative virology have been hampered by the lack of a complete genome sequence. Although the first PRV DNA sequences were published in the mid 1980s (33, 55, 58, 61), the high G+C content of the PRV genome, averaging 74%, made reliable sequencing extremely difficult. Therefore, sequence determinations remained limited to fragments encompassing one or a few genes. In contrast, complete sequences have been determined for numerous other herpesviruses, including the alphaherpesviruses VZV (18, 29), simian varicella virus (30), EHV-1 (70) and EHV-4 (71), HSV-1 (46), HSV-2 (25), herpes B virus (cercopithecine herpesvirus 1) (54) and MDV1 (72) and MDV3 (1). Of these, only the recently sequenced herpes B virus has a higher G+C content than PRV, averaging 74.5%. Given the difficulties in sequencing DNA with a high G+C content, we assembled a complete PRV genome sequence by compiling the available sequence information and sequencing the remaining gaps in the linear genome. The new sequences obtained included the left end of the UL region, the coding sequences of UL16 and UL17, and the coding sequence for the first exon of UL15. The complete annotated PRV sequence presented in this report is composed of sequences derived primarily from PRV strain Kaplan (38), but it also includes sequences from other strains, such as Becker (UL27/28, UL43/44, US7/US8/US9) Indiana-Funkhauser (EPO), NIA-3 (UL23, UL13/14, US2), Rice (US4), and TNL (UL29). Where available, multiple sequences for a given genome region originating from different strains were compared, and the variability among PRV strains was determined. In general, PRV sequences obtained from diverse strains from around the world are remarkably similar, providing confidence that the composite sequence will have utility. In addition, a genome-wide search for transcriptional control elements yielded a striking picture of gene organization with important consequences for gene array analyses of alphaherpesviruses.
|
|
|---|
|
View this table: [in a new window] |
TABLE 2. Assembly of a complete PRV genome sequence
|
ORF search and analysis. All but a few PRV open reading frames (ORFs) with homology to other herpesvirus genes were already identified or proposed and named according to the gene nomenclature used for HSV-1 (47). Sequences comprising the PRV homologs of UL16 and UL17 as well as the first exon of UL15 are described here for the first time. To search for novel PRV-specific ORFs, the complete DNA sequence was analyzed with the program codonpreference (GCG software package, Wisconsin Package version 10.2; Genetics Computer Group [GCG], Madison, Wis.) and screened for ORFs with a high G+C content on the third nucleotide position of codons. All of the known functional ORFs of PRV are characterized by this high G+C bias (data not shown).
In addition, the complete genome was translated using the program Translate (GCG software package), and all ORFs with a minimum length of 60 codons and a methionine as start codon were analyzed for homology to known proteins using a FastA search (GCG software package) against the PIR protein database (release 68.0). As a third approach to identify new genes in the PRV genome, the sequence was submitted to GeneMarkS, a self-training program for prediction of gene starts (Georgia Institute of Technology [http://opal.biology.gatech.edu/GeneMark/genemarks.cgi]) (6).
Search for polyadenylation signals. The PRV genome sequence was submitted to PolyADQ, a eukaryotic (human) polyadenylation [poly(A)] signal search engine (Cold Spring Harbor Laboratory [http://argon.cshl.org/tabaska/polyadq_form.html]) (69). All cutoff parameters were initially set at zero to return the location of all AATAAA and ATTAAA consensus signals, along with an associated score between 0 and 1. For each potential poly(A) signal, all upstream genes were noted. The putative location of the actual site of poly(A) addition was presumed to be 20 bp downstream of the poly(A) signal. Experimental data for the poly(A) sites were collected from published reports. In the case of S1 nuclease mapping, the site was calculated from the reported DNA size and the error on this measurement was assigned an arbitrary error of 5%. All predicted and experimental poly(A) sites were used to calculate the length of the 3' untranslated transcript region (UTR) of each gene.
Promoter search. The PRV genome sequence was submitted to the Berkeley Drosophila Genome Projects Neural Network Promoter Prediction program, a eukaryotic (human) core promoter search engine (http://www.fruitfly.org/seq_tools/promoter.html) (59). The initial search was performed at very high stringency (cutoff score of 0.99 out of 1.00). The program returned high-scoring core promoters (50-bp-long fragments) along with a predicted transcription start site (TSS). The core promoters found in this search and all later searches were examined for the presence of a TATA box consensus using the TRANSFACFind search engine (http://motif.genome.ad.jp/) (34). The stringency for the TATA box searches was relatively low, with a cutoff score of 65 (out of 100). Of 98 high-scoring core promoters, 52 predicted transcripts able to encode 46 of the 72 known PRV ORFs and 1 predicted the large latency transcript (LLT). To find promoters for the remaining 26 ORFs, a medium-stringency promoter search (cutoff, 0.80 out of 1.00) was performed on the 350-bp DNA fragments upstream of the ORFs, followed again by a search for a TATA box consensus. This medium-stringency search yielded promoter predictions for 21 more ORFs, but four of these promoters contained no TATA box and were discarded. Of the remaining nine ORFs without assigned promoters (ORF1.2, UL33, UL36, UL23, UL11, UL8.5, UL6, and the major and minor forms of US3), UL6 and the two US3 isoforms had well-mapped TSS (51, 74). Successful low-stringency searches (cutoff, 0.40 out of 1.00) for promoters matching these TSS left six ORFs without assigned promoters.
For each promoter, the predicted TSS location was noted and compared to experimentally determined TSS from published reports, if available. In the case of S1 nuclease mapping, the TSS was calculated from the reported DNA size, and the error on this measurement was assigned an arbitrary value of 5%. The minimal mRNA size, excluding the poly(A) tail, was calculated from the predicted TSS and poly(A) site of each gene.
The level of DNA identity between the Kozak consensus sequence (GCCGCCRCCATGG [44]) and the 13 nucleotides around the initiator ATG of each was measured. The predicted TSS for each gene was used to calculate the expected length of the 5' UTR.
Search for splice sites and repeated elements. The PRV genome sequence was submitted to the Berkeley Drosophila Genome Project Splice Site Prediction by Neural Network, a eukaryotic (human) search engine for donor and acceptor splice sites (http://www.fruitfly.org/seq_tools/splice.html) (59). A search was performed at high stringency (cutoff score of 0.95 out of 1.00), and all consecutive donor and acceptor sites were noted and examined. No donor-acceptor pair was found in any of the predicted transcripts.
A search for repeated DNA regions was performed visually by comparing the genomic sequence to itself, using the two-dimensional plot output from a Pustell DNA matrix analysis. A DNA identity scoring matrix was used with the following search parameters: window size of 30 nucleotides, 90% identity, hash value of 6, and jump value of 1, both-strands comparison. Repeated DNA regions were recognized by their characteristic diagonally hatched box shape.
Nucleotide sequence accession number. The complete, annotated DNA sequence is available from GenBank under the accession number BU001744. An annotated PRV genome, containing a detailed referenced description for each gene, is also available at the Los Alamos sequence database for sexually transmitted diseases (http://www.stdgen.lanl.gov). The latter genome database will also be linked to a future PRV gene expression database at Los Alamos National Laboratories (http://www.herpes.lanl.gov/).
|
|
|---|
|
View this table: [in a new window] |
TABLE 1. Pairwise protein-coding DNA sequence comparison of PRV strains
|
![]() View larger version (38K): [in a new window] |
FIG. 2. Predicted PRV transcript and gene organization. The linear form of the PRV genome is constituted of the larger unique long (UL) sequence to the left, and the smaller unique short (US) sequence flanked by the inverted repeats, IRS and TRS. The predicted locations of PRV ORFs (Table 3), 5' and 3' UTRs (Tables 4 and 6), DNA repeats (Table 5), splice sites (Fig. 1), and origin of replication are shown.
|
A few alphaherpesviruses, including HSV-1, HSV-2, BHV-1, herpes B virus, and PRV, have evolved genomes with a relatively high G+C content (68 to 74%). In these genomes, there is a pronounced periodicity in triplet base composition in the protein-coding sequences. The third codon position is particularly biased towards G or C, while the second position has the lowest G+C incidence. Since the third position is the most flexible concerning the amino acid encoded, the third-position nucleotides have evolved to contribute the most to the high G+C content of these genomes. The second position, on the other hand, is the most critical for specifying the amino acid, and as such the second-position nucleotides maintained a more moderate G+C content. The PRV genome sequence was analyzed with the codonpreference program, a frame-specific gene finder that can recognize protein-coding sequences by virtue of the G+C composition in the third position of each codon (31). All known functional ORFs were easily identified by this method, and no additional, hitherto unknown, ORFs were found. However, this method cannot detect smaller ORFs located completely within a larger ORF, whether on the sense or antisense strand. Therefore, the genome DNA sequence was translated in all six reading frames for further analysis. More than 380 ORFs with a coding capacity of more than 60 amino acids were identified: 194 were found on the top strand and 189 were on the bottom strand. A search for cellular or viral homologs of these ORFs failed to find any significant match, and none of these ORFs was considered a strong candidate for a new gene.
To confirm our analysis, the PRV genome sequence was submitted to GenMarkS, an ORF prediction program whose algorithm combines models of protein-coding and noncoding regions with models of regulatory sites near gene starts (6). The PRV genes predicted by GenMarkS matched those described in Table 3 very closely, with the following exceptions. UL26.5 and UL8.5 were not identified, since the two ORFs are located completely within another gene. The UL15 gene was not predicted to be spliced, probably due to the low conservation of the splice site (see details in "Search for splice sites," below). Genes coding for UL50, UL37, UL11, UL3, and US7 were predicted to be marginally shorter, starting at an internal ATG, while no prediction at all existed for the UL2 ORF. Finally, four new ORFs (data not included) were predicted, but further analysis failed to provide much support for their existence: no significant protein homologs were found for any of them, and a search for possible upstream promoters turned out negative as well (see details in "Search for promoters," below).
|
View this table: [in a new window] |
TABLE 3. PRV ORFs
|
As concerns ORF-1, experimental data indicate that there is an upstream in-frame extension, designated ORF1.2, with probable start codons at positions 1252 or 1375 (unpublished data). All but three PRV genes (ORF-1, ORF1.2, and UL3.5) have homologs in HSV-1. ORF-1 and ORF1.2 are located at the left terminus of the PRV UL region and show only homology to the first ORF of EHV-1 strain Ab4 (3). UL3.5 is conserved in many alphaherpesviruses, including BHV-1, EHV-1, VZV, ILTV, and MDV, but not HSV-1 or HSV-2. In marked contrast, a number of HSV-1 genes do not seem to have a PRV counterpart: US5 (gJ), US8.5, US10, US11, US12,
134.5, ORF P, ORF O, UL9.5, UL10.5, UL20.5, UL27.5, UL43.5, UL45, UL55, and UL56 (63).
Systematic search for core elements of gene expression control. Initially, all available DNA sequences were examined for their annotated information. While this approach yielded a complete and consistent annotation of ORFs, it failed to provide a complete picture of transcriptional elements and DNA repeats. We therefore took a systematic approach to search for these elements. Most, if not all, genes in the HSV-1 genome are transcribed as capped and polyadenylated mRNAs by host RNA polymerase II (64). It is widely assumed that the homologous genes in PRV are similarly transcribed. Computer prediction programs were used to identify RNA polymerase II transcriptional control elements, including core promoters, splice sites, and polyadenylation sites. A visual search for short repeat elements was also performed.
Search for transcription polyadenylation signals. Two sequence elements make up the core of mammalian 3' mRNA processing signals directing mRNA cleavage and polyadenylation. The first element, located 10 to 30 bases upstream of the cleavage site, is the conserved poly(A) signal AAUAAA and is found in 90% of all sequenced polyadenylation sites. In the remaining 10%, the sequence found differs only by a single substitution, with AUUAAA the most common variant. The second element is the downstream element (DE), a U- or GU-rich sequence located 20 to 40 bases after the cleavage site (reviewed in reference 16).
The PolyADQ program was used to search for all potential polyadenylation signals in the PRV genome. This program was designed to detect and evaluate potential poly(A) signals in human DNA sequences using weight matrices for base composition and position in the DE (69). Table 4 lists the results by gene along with an associated score between 0 and 1 that primarily reflects the presence of a consensus DE. The table lists the genes directly upstream of the poly(A) signals and the length of the predicted 3' UTR.
|
View this table: [in a new window] |
TABLE 4. PRV polyadenylation signals predicted by PolyADQ
|
|
View this table: [in a new window] |
TABLE 6. PRV
core promoters predicted by neural networki
|
Search for splice sites. Splicing of mRNA involves the recognition of acceptor and donor sequences by the spliceosome. We searched for splice donor and acceptor sites in PRV genes by using a neural network splice site prediction program conditioned for human splice site recognition. Sequences from cDNA had established the existence of three introns in PRV so far: two in the 5' UTR of US1 (27) and one in the LLT (13). A stringent search of the entire PRV genome found only one splice donor-acceptor pair in all the predicted PRV transcripts, matching the coordinates of the second intron in the 5' UTR of US1. The search failed to accurately predict the other two known introns and a putative PRV intron in UL15, a homolog of the spliced UL15 gene of HSV-1. UL15 is made up of two exons and is well conserved among herpesviruses. PRV and HSV-1 UL15 possess similar exon lengths, strong protein sequence homology, and a good DNA sequence homology at the donor and acceptor sites. The DNA sequences of splice donors and acceptors for PRV UL15, US1, and LLT compare favorably to the eukaryotic consensus (Fig. 1). Remarkably, the predicted UL15 splice donor site (PRV Ka) does not contain the invariant GT dinucleotide at the start of the intron. Whether this predicted donor site is really functional remains to be determined, but it is worth noting that identical splice sequences were found for UL15 in the Ea strain (GenBank accession no. AY189899), a recent PRV isolate from Wuhan (China) (12).
![]() View larger version (48K): [in a new window] |
FIG. 1. Comparison of the PRV splice donor and acceptor site sequences and the mammalian consensus. Intron sequences are underlined, and the locations of the sites in the PRV genome are indicated. For US1, the site locations given are for the IRS gene copy. For the TRS copy of US1, the sites are located as follows: US1 intron 1 donor (129209 to 129201) and acceptor (129103 to 129085); US1 intron 2 donor (129035 to 129027) and acceptor (128895 to 128877). M = A or C; R = A or G; Y = C or T.
|
|
View this table: [in a new window] |
TABLE 5. DNA repeats
|
A human core promoter prediction was used for an initial high-stringency search of the entire PRV genome, finding core promoters for 47 of the 73 genes. A search for the nearest consensus to a TATA box in these promoters was performed, and it found them all located 34 to 29 bp upstream of the predicted TSS. To find promoters for the remaining 26 genes, the search parameters were relaxed and the upstream 350 bp of all ORFs were analyzed, yielding promoters and TATA box predictions for all but 6 of the 73 genes. The genes regulated by a given promoter were defined by examining the translation product of the predicted transcripts. Table 6 lists the results by genes, along with the TATA and TSS locations and the associated promoter score (between 0 and 1). Unless performed at very high stringency, the searches often identified more than one putative promoter for a given ORF, in close proximity to each other. As such, the experimentally measured mRNA sizes, with their low precision and only rough estimation of the size of poly(A) tails, were of no help in validating our particular promoter predictions. In contrast, S1 nuclease transcript mapping and primer extension data can accurately assess the 5' end of transcripts (TSS location), and they provided a useful test for the validity of our promoter predictions. The predicted and experimentally determined TSS locations and mRNA sizes are indicated in Table 6. Predicted mRNA sizes relied on the data in Tables 4 and 6, while the 5' UTR length was calculated using the predicted location of the TSS. Table 6 describes the experimental evidence that located the TSS for 23 PRV genes, with our predicted TSS locations matching 19 of the 23. The degree of DNA identity between the sequences surrounding each ORF's start and the Kozak consensus is also indicated.
Overall genome structure and control of gene expression. Figure 2 is a visual summary of data contained in Tables 3, 4, 5, and 6, depicting the arrangement of the 73 genes (72 ORFs and the LLT) and their predicted transcripts in the PRV genome. The genome is organized in a UL region of 101.1 kb and a US region of 8.7 kb. The US region is bracketed by the IRS and TRS, two large inverted repeats 16.8 kb in length. Since the UL region is not flanked by inverted repeats, the PRV genome exhibits the typical D class herpesvirus genome structure also found in VZV, BHV-1, EHV-1, EHV-4, and ILTV (62). The gene content and arrangement in the PRV genome are similar to those of HSV-1 and the other alphaherpesviruses. Indeed, the PRV genome is colinear with these viruses except for an internal inversion of 39 kb extending from UL27 (gB) to UL44 (gC) (5, 7, 23). A similar inversion is also present in the genome of ILTV, extending from UL22 to UL44 (79).
A large portion of the genome (over 83%) serves as template for transcripts. The abundance of coterminal transcripts (48 of 73 genes) was readily apparent. Seven of the 11 repeat regions could be seen separating convergent transcripts, while one set of convergent transcripts was predicted to overlap (
194 bases) at their 3' end (UL30/UL31). Divergent transcripts in close proximity to each other were observed 13 times. Divergent transcripts were predicted to have short overlaps at their 5' end in five cases (
82 to 282 bases), raising the possibility of mutual negative regulation: an increase in the transcription of one gene would reduce the transcription of the other. Nonoverlapping divergent transcripts occurred in the eight other cases, with six cases sharing the same TATA box (bidirectional TATA, noted in Table 6). In two other cases, the TATA elements were within 100 bp of each other and the genes may be coregulated by the same regulatory factors bound in proximity (noted in Table 6). All six bifunctional TATA-poly(A) sites (Table 6) resulted in the appearance of transcripts arranged in a head-to-tail fashion. Finally, completely overlapping genes transcribed in opposite orientations were seen in four cases (IE180/LLT, EP0/LLT, UL15/UL16, and UL15/UL17). Simultaneous transcription of both strands seems unlikely in some cases, as the genes are predicted to be expressed at different times and in different tissues. LLT is only expressed in latently infected neurons, while IE180 and EP0 are expressed early during productive infection. The timing of UL15 gene expression may well overlap with that of UL16 and UL17, since all three homologous HSV-1 proteins are believed to be involved in the same process of capsid maturation and assembly later in infection.
Origins of replication. Figure 2 shows the three well-defined origins of replication found in PRV: OriL, located between UL21 and UL22 (76), and OriS, located in the IRS and TRS upstream of US1 (27). OriL and OriS contain the same sequence features: two inverted copies of the UL9 (OBP) binding sequence (GTTCGCAC) separated by a 43-bp AT-rich spacer sequence (76% A+T) (27, 41). This basic arrangement was present once in OriL and was found as three imperfect repeats in OriS, and it is very similar to the palindromic arrangement described for HSV-1 OriL and OriS (63).
An additional origin of replication had previously been proposed to be located in the BamHI-14' fragment, the 1.3-kb terminal end of the PRV UL region (76). However, our sequence analysis found only one UL9 consensus binding sequence in this region, at position 1243 to 1250. The PRV genome contains two more single UL9 protein recognition sequences, at positions 25580 to 25587 and 34847 to 34854. None of the three is adjacent to an AT-rich stretch of DNA. Therefore, it is questionable whether any of these has the potential to function as an origin of replication.
|
|
|---|
Survey of PRV genome sequence and gene content. The genome sequence data were assembled from the sequence fragments available in the GenBank database and completed by sequencing of the remaining gaps. While the completed sequence was derived from more than one strain source (Table 2), a DNA sequence analysis showed the PRV strains to be closely related (Table 1). An evaluation of the gene content of PRV found ORF1.2 as an additional ORF to those described in reference 47, though the complete coding sequences for UL15, UL16, and UL17 were unavailable at that time. The PRV genome is thus proposed to encode one LLT and 72 genes that encode 70 different proteins (Table 3). The genes encoding the US1 and IE180 proteins are present twice, once in the IRS and once in the TRS. The major and minor forms of US3 are treated as separate genes with distinct functions (73).
While the search for new PRV protein-coding genes found no convincing candidates, it is possible that PRV contains additional genes. We found 10 poly(A) orphan signals and discarded 3 of them because of extremely low scores. A significant number of promoters not assigned to known PRV genes were also found, even at the highest-stringency search. However, the predicted translation products of these putative transcripts tended to be small or preceded by an uncharacteristically long 5' UTR (data not shown). Thus, it is conceivable that several of these small ORFs are expressed or that non-protein-coding transcripts exist.
Computer searches of transcriptional control elements. We searched for transcriptional control elements in the PRV genome, including core promoters and TATA boxes, splice sites, and polyadenylation sites, using computerized prediction tools. The use of these programs relied on two assumptions: (i) that the core transcriptional elements between pigs and humans would be conserved, and (ii) that the core transcriptional elements of virus and host would be very similar.
poly(A) signals. Our poly(A) signal assignment to upstream genes implied that most or all poly(A) signals had been found and that the poly(A) signals found were all functional. We further assumed that focusing on the common consensus signals AAUAAA and AUUAAA would be sufficient, even though other variations, while rare, are known to exist (16). The experimental data in Table 4 supported these assumptions with the following two exceptions: (i) the UL19 cDNA sequence and mRNA size strongly suggest that a UL19 transcript uses an uncommon poly(A) signal, ATATAAA (77). While this sequence motif was found three more times in the PRV genome, it never affected any of our transcript predictions. (ii) The mRNA size of UL5 and the evidence that UL5 and UL4 transcripts are coterminal invalidate a functional poly(A) signal immediately downstream of the UL5 coding sequence (top strand, nucleotides [nt] 91895 to 91900) (21). It was also noted that had this poly(A) signal been functional, it would have prevented the transcription of the full-length UL4 from our predicted promoter (Table 6), as the signal is actually located in the UL4 ORF. Dean and Cheung (21) have hypothesized that transcription from the UL4 promoter might preclude the efficient use of this poly(A) signal. This is, most likely, a unique case, since all remaining experimental data agree with our predictions.
Experimental data on transcript size or 3' transcript location exist for 58 of the 73 PRV genes, and 56 agree with our poly(A) predictions (96% accuracy). Similar to what had been observed with HSV-1 (45), the 3' end of mRNA predicted from the location of poly(A) consensus sequences was much more reliable than the predictions of promoters and mRNA splice sites. PRV is proposed to have 44 poly(A) sites for 73 genes, while a previous analysis in HSV-1 proposed 46 poly(A) sites for 70 genes (45). The same analysis also predicted that HSV-1 transcripts were organized as 24 singlet transcripts and 19 coterminal families, highly similar to PRV's predicted 26 singlet transcripts and 18 coterminal families.
The PolyADQ scores for the various poly(A) signals were found to have very limited predictive value, and we offer three potential explanations. First, the PolyADQ weight matrices examine the first 100 bases downstream of the poly(A) signal to gauge the presence of a consensus DE. Sequences outside this window may play an important role. Second, the weight matrices were established with a limited set of false and true poly(A) signals: 81 true and 258 false AATAAA signals and 17 true and 204 false ATTAAA signals. Finally, the weight matrices were derived from human cDNA sequences, while we are examining a genome of a porcine virus.
Promoters, TATA elements, and splice sites. Our promoter prediction approach found 72 possible promoters for 67 of the 73 genes in PRV (Table 6). In five cases (UL49.5, UL42, UL39, UL37, and UL4), two good scoring promoters were found for each ORF. It is possible that regulatory transcription factors, DNA accessibility, or competition between the two promoters favors one over the other. Alternatively, both promoters may be used and even differentially regulated: each promoter could be used at different times during infection or function in specific cell types. The experimental evidence derived from the analysis of the 5' end of the major (M) and minor (m) UL37 transcripts agrees with our prediction of two distinct promoters, though we could not predict their different relative strengths. In the absence of better predictive tools that take into account more than just the basic core of the promoter sequences, we are unable to resolve how these dual promoters are used.
The promoter assignment to each gene assumed (i) that the first ATG after the TSS would be used, (ii) that there would be no splicing in the 5' UTR, with the exception of the reported case in US1 (27), (iii) that all promoters would contain a TATA-like element, and (iv) in the lower-stringency promoter search that the 5' UTR would be smaller than 310 nt.
Except for US9, none of the promoters found contained an intervening ATG before the predicted ORFs (Table 6). A direct comparison of the DNA sequences around the first ATG and the 13 nt of the Kozak consensus showed seven or more bases to be identical at most genes. In the few cases where the identity was lower, an in-frame ATG closer to the consensus was invariably found in the next 200 nt, which may indicate an additional or the true translation start site. This is the case for US9 (9): a downstream ATG close to the Kozak consensus (11 of 13) is used instead of a more divergent ATG (7 of 13) 24 nt upstream. The predictive value of these sequence comparisons is limited by two factors. The nucleotides adjacent to the ATG are known to be more important (purine at position -3 and G at position +4; CCA/GCCATGG) for efficient translation than the rest of the Kozak consensus (44), and the secondary structure of the RNA can affect the efficiency of translation (52).
Only a few genes have been found to be spliced in alphaherpesviruses. They are usually immediate-early or latency genes, as splicing is generally inhibited late in productive infections (67). A notable exception to this general rule seems to be UL15, whose spliced mRNA can be detected late (6 h postinfection) during HSV-1 infection (17).
All but three promoters (US1, UL12, and UL32) predict 5' UTR lengths under 300 nt. Furthermore, herpesvirus genes are generally reported to contain a 5' UTR 30 to 300 nt long (63).
Recent database analyses of Drosophila and human core promoters had found that only 30 to 40% of the promoters contain a TATAAA consensus or a sequence with one mismatch from the consensus (reviewed in reference 68). While it is possible that some TATA-less promoters exist in PRV, we have found a TATA-like consensus in almost all core promoters by using relaxed search parameters. These TATA-like elements were invariably located 34 to 28 nt upstream of the TSS. The finding of TATA-like elements at the predicted position is biologically significant and not the result of any preprogrammed bias for TATA elements in the promoter prediction program, as the neural network was trained with a set of naturally occurring core promoter sequences 51 bp long (-40 to +11 relative to the TSS).
The six genes without a predicted core promoter may either contain a long or spliced 5' UTR, a TATA-less promoter, and/or a poorly scoring promoter. Because the human core promoter sequences used to train the program included little or no sequences downstream of the TSS, the program did not consider any contributions from the DPE. In addition to the core promoter elements, a number of highly variable sequence elements are located upstream of core promoters and serve to regulate transcription. Clearly, the prediction scores of the various promoters do not take into account the absence or presence of such variable elements or of the DPE. Still, the predictions derived from our approach have already been useful in building a near-complete map of transcripts in the PRV genome (Fig. 2).
Our predicted start sites matched the experimental data fairly well, though less data were available for the location of the 5' end than for the 3' end of transcripts. Three types of experimental TSS data were available: (i) primer extension data yielding a precise 5' location but very dependent on probe location and specificity and often subject to differing interpretations of data; (ii) primer extension data with two primers, the second primer increasing data reliability; and (iii) S1 analysis, which mapped the 5' end of transcripts with less precision but with excellent reliability. All predicted TSS matched those mapped by primer extension analysis with two primers or by S1 analysis (10 matches). The predicted TSS matched only 7 of the 11 TSS mapped by primer extensions using a single primer. The discrepancy between predicted and experimental results could not be resolved: the primers used were often located too close to or too far from the TSS to pick up our predicted TSS. Moreover, the longest extended products were always chosen as representative of the transcript start to map the TSS despite the presence of abundant extension products of smaller sizes. While the total predicted and experimental TSS locations matched 17 times out of 21, the true accuracy rate of our predictions is likely to be closer to 80% (16 of 20), since the TSS match for the two copies of IE180 was counted twice. Because the promoters for UL6 and the minor and major forms of US3 were found based on the experimental TSS locations, they are not counted as a positive match.
It had been noted in HSV-1 that TATA boxes or other promoter elements were, by themselves, of little predictive value in identifying mRNA start sites (45). Our promoter predictions were more successful, largely due to advances of the last decade: highly improved predictive core promoter programs, along with more extensive and detailed databases of known core promoters. The neural network promoter prediction is particularly useful when used in conjunction with defined parameters, such as known ORF locations and mapped TSS.
Two new core element features have been discovered by our analysis: (i) the bidirectional TATA box (occurring six times), predicted to be shared by two overlapping promoters of oppositely transcribed genes, and (ii) the bifunctional TATA-poly(A) signal, a TATA box that also serves as polyadenylation signal for a gene upstream (occurring six times). The available experimental evidence suggests that both features exist. The mapped TSS for UL37 (M) and UL38 are located 50 bp apart and closely agree with our predicted promoters and our bidirectional TATA box (Table 6). Likewise, the mapped TSS for UL5 and UL6 also agree with our predicted promoter and bidirectional TATA box (Table 6). The mapped TSS for US2 (Table 6) is 6 bp apart from the sequenced end of the US9 and US8 transcripts (Table 4), providing support for the predicted bifunctional TATA(US2)-poly(A) signal (US8/US9). Finally, two cases of divergent transcripts with TATA boxes within 100 bp of each other were also noted, which may indicate shared regulatory elements.
In bidirectional TATA boxes, the TBP may bind in either orientation to the same sequence. The binding orientation of TBP then determines at which start site the transcription preinitiation complex assembles and which of the two genes will be transcribed. In support of the idea of bidirectional TATA boxes, TBP itself has been found to bind a TATA box consensus in both orientations in solution, with only a small preference for the correct orientation. Furthermore, recent studies suggest that the dominant mechanism in determining the direction of transcription may be the activator-enhanced polarity of TBP binding (reviewed in reference 68). Bidirectional TATA boxes also suggest a simple regulatory mechanism whereby increased expression from one gene lowers the expression of the other gene.
Features of transcript architecture conserved in alphaherpesviruses. The gene architecture is well conserved among alphaherpesviruses and can be defined by conserved blocks of genes that show homology in their protein-coding sequence and their position relative to each other. This conservation extends to details of the transcriptional architecture itself. BHV-1 is the closest known relative of PRV, and the transcription termination sites found in the annotated BHV-1 genome predict virtually the same arrangement of singlets and coterminal families that we predict for their homologs in PRV, with few exceptions. Indeed, the largest two coterminal transcript families have clearly been demonstrated to occur in both BHV-1 and PRV: UL1, UL2, UL3, and UL3.5 (20, 39) and UL24, UL25, UL26, and UL26.5 (23, 32). Similarly, the predicted HSV-1 transcript arrangement is highly homologous to the one predicted for PRV (46). As more alphaherpesviruses are examined, a picture of conserved transcriptional features is likely to emerge, including which genes are spliced, arranged in coterminal clusters or transcribed in overlapping and opposite directions. The conservation of many of these features among several viruses suggests that the transcript arrangement is critical for the viral life cycle, probably by properly regulating viral gene expression.
Significance of the transcriptional architecture for microarray analysis. Coterminal genes and oppositely transcribed regions present a first challenge for the microarray analysis of gene expression, not just in PRV but in related alphaherpesviruses as well. The array probes commonly used are often complementary to ORFs and many are guaranteed to hybridize to different overlapping transcripts, precluding the simple assignment of signal intensity from one array spot to one gene. Mapped transcript boundaries will not only help in understanding the proper source of array spot signals but will also allow the judicious positioning of probes to regions unique to one or just a few transcripts. The high G+C content of PRV and the long 3' UTR of many mRNAs present a second challenge, as these two factors can hinder the synthesis of labeled cDNA strands of sufficient length to encompass the ORF-based probes when oligo(dT) primers are used. The low signals for such genes are likely to be misinterpreted as indicating low expression levels. Again, knowledge of the transcription boundaries will lead to a more careful and accurate analysis.
B.G.K. and C.J.H. contributed equally to the manuscript. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»