| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Molecular Virology Program, University of Pittsburgh Cancer Institute,1 Departments of Computational Biology and Human Genetics, University of Pittsburgh, Pittsburgh, Pennsylvania 15213,2 Epidemiology Unit, Radcliffe Infirmary, Cancer Research UK, Gibson Building, Oxford OX2 6HE, United Kingdom,3 Ruharo Eye Hospital, Mbarara,4 Uganda Virus Research Institute, Entebbe, Uganda5
Received 24 April 2007/ Accepted 30 July 2007
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
This problem is illustrated by the search for human tumor viruses. Tumorigenic infectious agents have been classified into two broad categories (27): indirect carcinogens, which lead to tumorigenesis through chronic infection and nonspecific inflammation, and direct carcinogens, agents that express intracellular oncogenes that directly contribute to cell transformation. Indirect agents, such as Helicobacter pylori and possibly the hepatitis viruses, can ultimately be lost from the tumor mass. Direct carcinogens, such as papillomaviruses, Epstein-Barr virus, and Kaposi's sarcoma-associated virus (KSHV), are present in the tumor mass with at least one genome copy per cell and express a foreign oncogene transcript.
Indirect and direct infectious carcinogens have distinct epidemiologic properties. For example, immunosuppression may lessen the risk of indirect carcinogenesis by reducing chronic inflammation. In contrast, immunosuppression dramatically increases the risk of direct carcinogenic tumors by reducing host immune surveillance. Thus, patterns of disease occurrence may give clues as to which types of tumors are most likely to have a direct or indirect infectious carcinogenic trigger.
A tumor that has long been suspected of having a direct infectious origin is squamous cell conjunctival carcinoma (SCCC). SCCC is typically a low-grade malignancy with little potential for invasion and metastasis (34). This cancer is uncommon in the United States and is primarily a disease of the elderly (20). There has been, however, a sharp increase in the incidence of SCCC among patients infected with human immunodeficiency virus (HIV) (4, 14, 28, 40), as well as others with immunodeficiency, such as transplant recipients (17, 33). SCCC has emerged along with the AIDS epidemic as a very common cancer in Uganda and parts of sub-Saharan Africa over the past two decades (18, 21, 41, 44). The striking association of SCCC with HIV infection and its geographic localization in regions of Africa are consistent with a direct carcinogenic infectious etiology. Papillomaviruses are a leading candidate for causing this tumor, but current evidence for their role in this disease is contradictory. Some investigators have found an excess occurrence of cutaneous papillomavirus types in SCCCs by PCR (3, 35). Others have failed to find consistent evidence of papillomavirus infection by serology (24, 42) and by papillomavirus consensus PCR (Y. Chang, unpublished results) using primers HVP2/B5, CN1F/R, CP62/69, and EN1F/R (16) and GP+b-Gp5/6 (19). We also failed to find evidence for the presence of adenovirus, herpesvirus, or polyomavirus DNA in SCCCs by consensus PCR.
Molecular methods to discover new pathogens can also be divided into two broad categories (13, 29). One approach uses sequence information from known pathogens to identify related but undiscovered agents through cross amplification (consensus PCR) (25, 37) or cross hybridization (microarray hybridization) (36). Other approaches do not make assumptions about sequence homology for the agent. These techniques include cDNA library panning, used previously to discover hepatitis C virus (8), and representational difference analysis, used to discover KSHV (7).
Although each of these methods has been successful in identifying human viruses, current pathogen discovery approaches suffer from several shortcomings. None are quantitative, so if no candidate sequence is found, it is not possible to estimate how likely it is that an agent is present but missed in the search. Some approaches, such as representational difference analysis, rely on isogenic control samples that differ only in the presence or absence of agent nucleic acids. Since tumors may arise from rare cell populations, identifying an appropriate comparison control tissue may be difficult. Finally, there are no simple ways to scale up pathogen discovery with these methods short of analyzing additional samples.
It is unlikely that any technique can universally identify all human viral pathogens. But with the near completion of the human genome project, high-throughput sequencing may be exploited to examine specific classes of disease, such as direct infectious tumors. In theory, it should be possible to distinguish viral from human transcripts if sufficient sequence length is available. Longer sequences will increase the specificity of screening but at the cost of reduced transcriptome sampling for any given sequencing project. Further, sequences should be vetted for accuracy since sequencing errors will prevent alignment with deposited human sequences and a missequenced human nucleic acid may be wrongly described as nonhuman.
We have investigated long serial analysis of gene expression (L-SAGE) as a means to sample cellular transcriptomes for the presence of viral transcripts. L-SAGE quantitatively concatenates
21-bp cDNA tags from the 3' ends of mRNA transcripts, allowing the measurement of gene expression via high-throughput sequencing (32). If sufficiently stringent conditions are placed on L-SAGE sequencing accuracy, this should be a valuable method for rapidly searching for exogenous viral mRNAs. For a typical L-SAGE tag, there is a high level of confidence for the first 20 of 21 bp. Variations at the 21st bp position result from lax type II enzyme site cutting, leading to uncertainty in base pair assignment during ditag generation (47).
We have evaluated a novel approach to search for exogenous viral transcripts by using L-SAGE libraries, digital transcript subtraction (DTS). Physical transcriptome subtraction to identify differential or novel transcripts has been described previously (23). In addition, the direct examination of expressed-sequence-tag (EST) libraries has been proposed as a means to identify human pathogens (45, 46). Our approach differs in that we performed in silico subtraction by the alignment of short sequence tags with deposited human sequence data. Allowing a 1-in-20-base misalignment, we achieved complete in silico human transcriptome subtraction from a KSHV-infected cell line library, leaving candidate sequences belonging to KSHV. Performing the same analysis on a large L-SAGE library from three SCCCs, we identified 21 candidate sequences, all of which have experimental evidence for being polymorphic human sequences. These results show that DTS can screen human expression sequence data to identify sequences most likely to be of viral origin. DTS shows promise for rapidly identifying some types of new human tumor viruses and, when no viral transcripts are found, can establish quantitative limits on the presence or absence of exogenous mRNA.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Clinical samples and RNA preparation. SCCC tissues were obtained from a large case-control study (involving more than 800 cases) by the Uganda Eye Project (43). Tissues were made anonymous with respect to patient identifiers and assigned accession numbers prior to analysis. After surgical extraction, specimens were stored under liquid nitrogen or in RNAlater solution (Ambion Inc., Austin, TX) until processed for RNA extraction. Prior to L-SAGE, cryostat-processed frozen sections of tissues were examined to identify uniform tumor tissues to be used for RNA extraction. Three SCCCs were included in this study: (i) tumor tissue from an HIV-negative female (patient 448), (ii) tumor tissue from an HIV-positive male (patient 1811), and (iii) tumor tissue from an HIV-positive female (patient 1795). The tumor from patient 448 (tumor 448) was snap-frozen after surgery, whereas tumors 1811 and 1795 were placed in RNAlater (Ambion) immediately after surgery. These patients were negative for human papillomavirus types 16 and 18 in previous serological studies (24).
Total RNA from unstimulated BCBL-1 cells was extracted by TRIzol reagent (Invitrogen) according to the manufacturer's instructions. Total RNA from SCCC tissues was extracted after homogenization with the RNeasy midi kit (QIAGEN, Alameda, CA). RNAs were assayed for integrity using the Agilent 2100 bioanalyzer (Quantum Analytics, Foster City, CA) with the RNA 6000 Nano reagent kit. Tumor 448 showed marked degradation (an RNA integrity number of 2), while tumors 1811 and 1795 had high integrity scores (RNA integrity numbers of 8.2 and 7.5, respectively).
L-SAGE. Four L-SAGE libraries were generated from RNAs from BCBL-1 cells and from the three SCCCs by using the I-SAGE kit according to the instructions of the manufacturer (Invitrogen, Carlsbad, CA). Library 1811 was spiked with BCBL-1 RNA prior to L-SAGE to serve as an internal control for our ability to detect viral sequences in the tumor library. In brief, poly(A) RNA was captured onto magnetic beads and converted into double-stranded cDNA. The cDNA was cleaved with NlaIII and then ligated to oligonucleotide adapters containing recognition sites for MmeI. The linked cDNA was then released from magnetic beads by digestion with MmeI. Released tags were ligated to each other to create ditags and amplified by PCR. Amplified ditags were subsequently concatenated and cloned into the SphI site of pZero-1. Agar lawns were sent for sequencing of the inserts with the M13 forward primer to either the University of Pittsburgh Core Sequencing Facility (Pittsburgh, PA) or Agencourt Biosciences Corporation (Beverly, MA).
DTS.
All sequences were trimmed at a phred score of 20 (12), a stringent measure of base-calling accuracy for automated sequencer traces. Tags were extracted from 32- to
38-bp ditags using the SAGE2000 4.5 analysis software (Invitrogen) (39) and analyzed with UNIX Perl scripts (available on request) on a Mac Pro desktop computer. For SCCC DTS analysis, L-SAGE data from the three separate tumors were combined into a single data set. SAGE tag data files used in this analysis are available at http://www.kshv.pitt.edu/.
To reduce sequencing errors, the following criteria were used to generate high-fidelity (HiFi) sequences: tags with ambiguous base calls were eliminated, and only tags identified independently two or more times were included. After performing DTS, tags that were not subtracted were examined for linker sequences and missequencing errors on electropherograms. These tags were eliminated post hoc from the candidate sequences after initial alignment with sequences from reference databases. To generate the HiFi sequences, each unique sequence from HiFi tags was placed into a text file regardless of the frequency of occurrence. Since DTS does not rely on gene expression levels once a tag has been verified, highly abundant and infrequent tags have equal weights in the analysis.
HiFi sequences were examined using stand-alone BLAST (BLAST-2.2.14-univeral-macosx at ftp://ftp.ncbi.nih.gov/BLAST/excutables) against multiple nucleotide databases. The detailed information for BLAST parameters can be viewed at the NCBI website (http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_node23.html). The following parameter set was used to identify exact matches among sequences from human RefSeq RNA and genome databases: r = 1 (matched reward), q = –3 (mismatch penalty), F = F (no filtering of input sequence), W = 20 (word size), and e = 10,000 (expectation cutoff). For the lower-level-homology alignment, the same databases were searched for short, near-exact matches with the following parameters for cross-species sequence exploration (22): (r = 1, q = –1, G = 1 [gap opening cost], E = 2 [gap extension cost], F = F, W = 7, and e = 10,000) and (r = 5, q = –4, G = 25, E = 10, F = F, W = 7, and e = 10,000). Comparison of candidate sequences to sequences in viral nonredundant (NR), human NR, and human EST databases was performed manually using the NCBI website (http://www.ncbi.nlm.nih.gov/BLAST/) and the following parameter set for short, near-exact matches: r = 5, q = –4, G = 25, E = 10, F = F, W = 7, and e = 10,000. To ensure that NR and EST sequences aligning to candidate tags were not nonhuman sequences inadvertently deposited into the databases, MegaBLAST was used to track aligning sequences back to the human genome (45).
Sequential database comparisons of the HiFi sequences were performed using the human RefSeq RNA database first and then the human RefSeq genome database (http://www.ncbi.nlm.nih.gov/RefSeq/). The latter database includes human mitochondrial sequences. Further analysis was performed by comparing sequences to viral NR, human NR, and human EST database entries. All databases were downloaded between September and November 2006 from NIH NCBI (ftp://ftp.ncbi.nih.gov/BLAST/db).
Virtual herpesvirus expression sequence tag analysis. Genomic sequences of the human herpesviruses (HHV), including HHV-1 (NC_001806), HHV-2 (NC_001798), HHV-3 (NC_001348), HHV-4 (NC_007605), HHV-5 (NC_001347), HHV-6 (NC_001664), HHV-6B (NC_000898), HHV-7 (NC_001716), and HHV-8 (NC_003409), were downloaded from GenBank (NCBI). Virtual SAGE tags with the NlaIII CATG recognition sequence were extracted with Perl scripts. The pool of virtual herpesvirus tags included all sites with CATG sequences in both coding and noncoding regions. To examine KSHV gene expression in BCBL-1 experiments, 80 open reading frames (ORFs) from the deposited KSHV sequence (KSU75698) (31) were analyzed for NlaIII sites by using MacVector (Accelrys Inc.) to retrieve the 20-mer tag most proximal to the 3' end of the transcript.
Reverse transcription-PCR. Candidate sequences from BCBL-1 and SCCC SAGE libraries were used with rapid amplification of cDNA ends (RACE) or reverse SAGE (rSAGE) to obtain 3' sequences. Conventional RACE was performed with BCBL-1 cells and two SCCCs, those from patients 1811 and 1795, by using the 3' RACE system according to the instructions of the manufacturer (Invitrogen). rSAGE followed the protocol of Jian Yu (University of Pittsburgh; http://www.sagenet.org/protocol) to isolate cDNA fragments corresponding to novel SAGE tags. Briefly, poly(A) RNA from an SCCC from patient 1795 was used to synthesize double-stranded cDNA with a specifically designed biotin-labeled primer, BRS1, containing an AscI restriction enzyme site. The cDNA was digested with NlaIII and then bound to streptavidin beads. This complex was then bound to linker 2A/B and digested with AscI to release the 3' cDNA fragments from the beads. These 3' cDNA fragments were then enriched by PCRs to obtain a sufficient quantity of DNA (amplified rSAGE library). A second PCR amplification was performed with L-SAGE tag-specific primers and a common M13 forward primer to obtain specific products. PCR products from both RACE and rSAGE were TA cloned into pCR2.1 vector (Invitrogen), and the sequencing of the PCR products (at least two colonies) was used to verify the authenticity of a given L-SAGE tag and to obtain additional nucleotide sequences that could be used in database searches.
| RESULTS |
|---|
|
|
|---|
|
DTS analysis of PEL. We next sought to determine if a known tumor virus can be identified de novo using in silico subtraction and to determine the abundance of viral transcripts. KSHV was discovered in 1993 (7) using a physical subtraction process, representational difference analysis, that first isolated two small viral DNA fragments but allowed subsequent full characterization of the virus (31). To determine if DTS can achieve similar results, we performed L-SAGE with the PEL cell line BCBL-1 (30), which is latently infected with approximately 30 viral genome copies per cell. A total of 9,026 tags from the BCBL-1 L-SAGE library were sequenced (Table 2).
|
Sequential DTS database comparisons (Fig. 1) of the 994 HiFi sequences eliminated all but 34 sequences as having exact 20-of-20-bp alignment with deposited human sequences. An additional 31 sequences diverged from the human database sequences at 1 of 20 bp, leaving 3 candidate sequences represented by 19 tags (0.21%) for analysis. Two of these three sequences exactly map to transcripts of the KSHV genome (gi:2065526) T1.1 locus (nucleotides 29640 to 29659; 11 tags) and K12 locus (nucleotides 117460 to 117441; 6 tags), previously described as being present in BCBL-1 cells (48). These viral transcripts are highly abundant, accounting for 1,219 and 665 transcripts per million (TPM), respectively, in the BCBL-1 library. The viral origin of these sequences was confirmed by experimental 3' RACE with product sequencing (data not shown). The third candidate sequence precisely matches an unannotated human EST (gi:78486988; two tags). Reanalysis of the entire 8,824-tag library for KSHV-specific transcript tags identified five additional KSHV tags sequenced once, and therefore not included in our HiFi data set, for KSHV genes ORFK2 (encoding viral interleukin-6), ORFK13 (encoding v-FLIP), ORF58, ORF64, and ORF38. Taken together, KSHV transcripts constituted 0.24% [(11 + 6 + 1 + 1 + 1 + 1 + 1)/9,026] of the total transcriptome from this uniformly virus-infected cancer cell line. These results indicate that complete in silico subtraction of HiFi human L-SAGE tags using DTS is feasible.
|
The SCCC HiFi sequences were subtracted as shown in Fig. 2. Exact alignments to human RefSeq RNA and RefSeq genome sequences were found for all but 1,120 SCCC HiFi sequences. By allowing one mismatch, an additional 974 HiFi sequences aligned with RefSeq database entries, leaving 146 candidate sequences for analysis. To reduce the candidate pool to a high-likelihood data set, the 146 candidate sequences were compared to the sequences in the human NR and EST databases. This step subtracts unannotated and polymorphic tag sequences but also may subtract viral sequences that have been inadvertently deposited into these databases. To avoid this problem, all NR and EST sequences matching the candidate sequences were compared with sequences in the RefSeq databases by using MegaBLAST to allow mapping back to human chromosomes. Of the 146 candidate sequences, 96 aligned with NR and EST sequences, and the majority of these contained 3' polyadenylation tracts that were not included in RefSeq database sequences. The remaining group of 50 20-base sequences did not align with human sequences at 19 or more bases. Of these 50 sequences, manual analysis revealed that 24 arose artifactually from duplicated ditags and 2 contained sequencing errors found by the inspection of electropherograms. These 26 sequences were eliminated from the HiFi data set, leaving 24 candidate sequences.
|
Analysis of SCCC DTS candidate sequences. The 21 nonaligning sequences (Table 3) remaining after DTS screening were used for reverse transcription. rSAGE and 3' RACE products from all 21 sequences were cloned and sequenced, and all products corresponded to human transcript sequences, evidence that no distinguishable viral transcript sequences were present in our HiFi data set. Due to potential mispriming with these short sequences, however, each sequence was examined in greater detail. Four of the 21 sequences aligned with immunoglobulin variable-region gene sequences, and two aligned with highly polymorphic mitochondrial DNA sequences. Two additional sequences were deposited from multiple libraries into the NIH SAGEmap database (http://www.ncbi.nlm.nih.gov/projects/SAGE/). These eight sequences were considered likely to be human and excluded from further analysis (Table 4).
|
|
Although these 21 DTS candidates should be considered for future studies, present evidence suggests that all 21 sequences are of human origin. If none of the candidate tags belong to viral transcripts, and assuming a Poisson distribution for sequencing a viral transcript, the upper 95% confidence limit that a viral transcript exceeding our matching criteria is present in SCCC is 20 TPM. It is estimated that there are approximately 200,000 mRNA transcripts in a cell, corresponding to 5 TPM for a single mRNA copy per cell (5, 38). Using this value as a comparison, we sequenced to a level of two to three transcripts per cell (11 TPM) and can conclude with a 95% confidence level that DTS-identifiable viral transcripts are not present in conjunctival carcinoma at four transcripts per cell or higher (20 TPM).
| DISCUSSION |
|---|
|
|
|---|
Our analysis suggests that 20 bases is the minimum useful length for DTS. Early attempts to perform DTS with 14-base SAGE tags failed to discriminate between human and viral sequences (Y. Chang and P. S. Moore, unpublished results). At shallow sequencing depths as used for the BCBL-1 library, 20-bp DTS readily identified viral sequences and allowed complete transcriptome subtraction. When more extensive sequencing is performed, as seen with our SCCC library, the numbers of ambiguous and rare tag sequences increase, generating a large list of candidates (146) that require further subtraction using NR and EST library databases and manual inspection. We anticipate that even small increases in tag lengths should be able to markedly reduce the candidate pool size, as well as increase the discrimination between viral and human sequences.
Another important advantage in using SAGE or other quantitative expression profiling techniques for DTS is that a direct measurement of transcript levels can also be performed. This capacity is shown by our analysis of viral transcripts in a tumor virus-infected cell line (BCBL-1). KSHV transcripts constitute a high percentage (0.24%) of the total mRNA from BCBL-1 cells. Transcripts identified in PEL cells are identical to those previously found by Cornelissen and colleagues using short SAGE of Kaposi's sarcoma (KS) tumors (9). T1.1 is commonly referred to as a lytic gene product, but multiple studies have shown that it is also readily detected in resting PEL cells. Whether this is due to transcriptional control outside of the latent-lytic replication cycle or to a fraction of cells in culture undergoing lytic replication cannot be addressed by our study. An examination of unpublished 293 cell line SAGE data from the Cancer Genome Anatomy Project (293 human embryonic kidney cells) (6) for 41,955 tags shows similar levels of viral transcript abundance (0.31%). In contrast to BCBL-1 cells, 293 cells are artificially transformed with adenovirus DNA rather than naturally infected with a virus (15).
Tumor tissues are likely to have a far lower level of viral gene expression than cell lines. This is due to lower viral copy numbers and infiltration by uninfected, nontumor cells. Cornelissen et al. examined two KS tumors from HIV-infected patients by SAGE and found 24 of 45,913 tags (522 TPM, or 104 transcripts per cell) and 3 of 47,316 tags (63 TPM, or 12 transcripts per cell) belonging to KSHV (9). These levels are comparable to the transcript abundance found for KSHV (127 TPM) in our library 1811 spiking experiments. We did not identify a viral transcript after sequencing to a depth of two to three transcripts per cell (11 TPM), and we can exclude with 95% confidence the possibility that distinguishable viral transcripts are present in SCCC at approximately four transcripts per cell (20 TPM) or higher—a level far below that found empirically for KSHV in KS tumors. To confirm these results, we performed an entirely new long-EST analysis of library 1811, analyzing an additional 150,000 50- to 200-bp ESTs, but did not identify a candidate viral sequence (H. Feng, Y. Chang, and P. S. Moore, unpublished results). For technical reasons, these data cannot be directly added to our current DTS analysis, but they provide additional confidence in the validity of the analysis and substantially extend the lower limit for the detection of a unique viral transcript in SCCCs.
Despite DTS evidence against a direct viral carcinogen's being present in SCCC, there are important caveats to concluding that SCCC does not have an infectious etiology. Viral NlaIII SAGE tags indistinguishable from human transcripts will be missed by DTS. Our virtual herpesvirus tag analysis suggests that this would occur for approximately 25% of viral tags. This uncertainty is reduced if a virus expresses multiple transcripts. Similarly, if a viral transcript does not contain a suitable NlaIII site, it will not be processed into a SAGE tag. Subtraction using unannotated EST transcripts also increases the probability that a viral sequence previously identified during EST screening will falsely be subtracted. We have minimized this risk by mapping all unannotated EST sequences back to the human genome, but it remains a possible pitfall. While most RNA viruses and all DNA viruses express mRNA transcripts, those viruses that do not have polyadenylated transcripts will be missed. Finally, DTS is unlikely to identify endogenous retroviral sequences that have been annotated as human or retroviruses integrating into the host genome and acting through insertional mutagenesis (11). These and other caveats to DTS analysis are listed in Fig. 3 and should be kept in mind when interpreting DTS results. For these reasons, it remains possible that SCCC has an infectious etiology, but it is highly unlikely that this tumor is caused by a direct viral carcinogen (e.g., human papillomavirus, Epstein-Barr virus, or KSHV).
|
| ACKNOWLEDGMENTS |
|---|
This project was supported in part by funds from the Public Health Service, CDC (EIN 1250965591), and a National Cancer Institute, NIH, grant (R01 CA67391) and in part under a grant from the Pennsylvania Department of Health.
The Pennsylvania Department of Health specifically disclaims responsibility for any analyses, interpretations, or conclusions.
| FOOTNOTES |
|---|
Published ahead of print on 8 August 2007. ![]()
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| J. Bacteriol. | Mol. Cell. Biol. | Microbiol. Mol. Biol. Rev. |
|---|
| Clin. Vaccine Immunol. | ALL ASM JOURNALS |
|---|