ABSTRACT
Epstein-Barr virus (EBV) is a causative agent of a variety of lymphomas, nasopharyngeal carcinoma (NPC), and ∼9% of gastric carcinomas (GCs). An important question is whether particular EBV variants are more oncogenic than others, but conclusions are currently hampered by the lack of sequenced EBV genomes. Here, we contribute to this question by mining whole-genome sequences of 201 GCs to identify 13 EBV-positive GCs and by assembling 13 new EBV genome sequences, almost doubling the number of available GC-derived EBV genome sequences and providing the first non-Asian EBV genome sequences from GC. Whole-genome sequence comparisons of all EBV isolates sequenced to date (85 from tumors and 57 from healthy individuals) showed that most GC and NPC EBV isolates were closely related although American Caucasian GC samples were more distant, suggesting a geographical component. However, EBV GC isolates were found to contain some consistent changes in protein sequences regardless of geographical origin. In addition, transcriptome data available for eight of the EBV-positive GCs were analyzed to determine which EBV genes are expressed in GC. In addition to the expected latency proteins (EBNA1, LMP1, and LMP2A), specific subsets of lytic genes were consistently expressed that did not reflect a typical lytic or abortive lytic infection, suggesting a novel mechanism of EBV gene regulation in the context of GC. These results are consistent with a model in which a combination of specific latent and lytic EBV proteins promotes tumorigenesis.
IMPORTANCE Epstein-Barr virus (EBV) is a widespread virus that causes cancer, including gastric carcinoma (GC), in a small subset of individuals. An important question is whether particular EBV variants are more cancer associated than others, but more EBV sequences are required to address this question. Here, we have generated 13 new EBV genome sequences from GC, almost doubling the number of EBV sequences from GC isolates and providing the first EBV sequences from non-Asian GC. We further identify sequence changes in some EBV proteins common to GC isolates. In addition, gene expression analysis of eight of the EBV-positive GCs showed consistent expression of both the expected latency proteins and a subset of lytic proteins that was not consistent with typical lytic or abortive lytic expression. These results suggest that novel mechanisms activate expression of some EBV lytic proteins and that their expression may contribute to oncogenesis.
INTRODUCTION
Epstein-Barr virus (EBV) is a common gammaherpesvirus that is the causative agent of a variety of lymphomas as well as nasopharyngeal carcinoma (NPC) and gastric carcinoma (GC) (1). Gastric carcinoma comprises 2% of cancers in Western countries, and 9% of these are EBV infected. EBV-positive GCs have distinct molecular profiles and clinical features that distinguish them from other GCs, reflecting unique mechanisms by which EBV induces cancer (2). However, the mechanisms by which EBV infection promotes GC and other EBV-associated cancers are unclear.
EBV adopts both latent and lytic forms of infection. In latent infection, a small proportion of the EBV genome is expressed, and cells become immortalized. These cells can switch to a lytic infection that involves sequential expression of immediate early, early, and late EBV proteins (∼80 proteins in total), amplification of the viral genomes, and virion production. EBV-induced tumors are monoclonal expansions of latently infected cells, and many studies have focused on the contributions of the few EBV latency proteins expressed in these cells. For example, EBV-positive GCs have been reported to consistently express EBV nuclear antigen 1 (EBNA1) and can also express latent membrane protein 1 (LMP1) and LMP2A proteins (3). However, there have been many reports of some EBV lytic cycle proteins being expressed in EBV-induced tumors (4–10). In addition, mutant viruses that are unable to switch to the lytic cycle have been found to be impaired in their ability to induce EBV-mediated lymphoproliferative disease in mice (11). Together, these reports have led to the suggestion that abortive lytic EBV infection contributes to tumorigenesis (12) although a comprehensive analysis of lytic protein expression in EBV-induced tumors is currently lacking. The importance of expression of specific lytic proteins for cancer induction is consistent with studies on Kaposi's sarcoma-associated herpesvirus (KSHV), the other human gammaherpesvirus, in which specific lytic proteins appear to contribute to oncogenesis (13).
Another outstanding question concerning how EBV infection induces cancer is whether specific EBV variants are more associated with cancer than others. Precedence for this scenario exists for cancer induction by human papillomavirus (HPV), in which there are specific variants (high-risk strains) that promote cancer while most variants (low-risk strains) do not. This difference is due to the differing abilities of the encoded HPV proteins to bind and interfere with the functions of cellular tumor suppressors (14–17). For EBV, it is unclear whether there are high-risk and low-risk variants, but evidence suggests that EBV that is isolated from tumors can differ in sequence and properties from the widely studied blood EBV isolate (18). Efforts to generate EBV genome sequences for comparison are under way and have resulted in EBV genome sequences from NPC, Burkitt's lymphoma (BL), and Hodgkin lymphoma (HL) as well as from the blood or saliva of healthy individuals (19–24). Comparisons of 10 EBV genomes isolated from NPC tumors to those of the few previously sequenced EBV blood isolates suggested that NPC isolates are most similar to each other and identified nonsynonymous mutations of potential biological significance in genes encoding both latent and lytic proteins (22). However, the small number of EBV genome sequences analyzed has limited any conclusions on the association of specific EBV mutations with cancer.
In order to increase the number of EBV genome sequences and examine the relationship of GC EBV isolates to other EBV genomes, we analyzed whole-genome sequencing (WGS) data of GC samples, resulting in the assembly of 13 new EBV genome sequences. These were combined with 15 existing GC isolates and analyzed against 114 EBV genomes from other tumors, blood, and saliva to identify GC-associated EBV nonsynonymous mutations that could potentially impact oncogenesis. In addition, we analyzed whole-transcriptome data for these GC samples to determine EBV expression profiles in GC, identifying distinct sets of EBV lytic proteins that are consistently expressed in GC.
RESULTS
Characterization of newly assembled GC EBV genomes.We analyzed whole-genome sequencing (WGS) data of 201 gastric adenocarcinoma samples for their EBV content, using 122 WGS samples available from The Cancer Genome Atlas ([TCGA] Stomach Adenocarcinoma, project code STAD-US) (25) and 79 from the new Pan-Cancer Analysis of Whole Genomes ([PCAWG] Stomach Adenocarcinoma, project codes STAD-US and GACA-CN [gastric cancer-China]) from the International Cancer Genome Consortium (ICGC) (26). We identified EBV genomes in 13 of these samples with sufficient read depth (average read depth of 125×) to assemble the EBV genome, which was done using a reference-based approach.
To further assess the quality of our newly assembled GC EBV sequences, we compared them to 15 GC EBV whole-genome sequences downloaded from the NCBI for their average sequence lengths (calculated as the length of the GC-EBV query sequence that aligned to the EBV NC_007605 reference sequence in NCBI), sequence identities, number of single nucleotide variants (SNVs), number of gaps, and number of ambiguous bases. Comparisons of five sequence metrics in Table 1 show that the 13 newly assembled GC EBV genome sequences are of similar quality to those previously reported, with the exception of the number of ambiguous bases, which is significantly lower in our newly assembled sequences.
Comparison of the newly assembled EBV genomes to previous GC-derived EBV genomes
Relationship of GC-derived EBV to other EBV isolates.We next investigated the relationship between GC-derived EBV and all other currently available EBV genomes. To generate the comparison group, we downloaded 126 EBV genome sequences from NCBI: 72 from tumors and 54 derived from the blood or saliva of people without cancer, as indicated in Table 2. The EBV blood isolates had been used to generate lymphoblastoid cell lines (LCLs), and EBV sequences from three additional LCLs were obtained from the study described in Santpere et al. (27) (cell lines NA19114, NA19315 and NA19384). We then combined our 13 new GC-derived EBV sequences with the 15 GC EBV sequences available in NCBI and compared these to the B95.8-Raji reference EBV sequence commonly used for comparison (NCBI accession number NC_007605) (19). This comparison identified a total of 4,172 SNVs and 112 insertions and 103 deletions (indels).
Summary of EBV isolates analyzed
We conducted a phylogenetic analysis on 142 whole EBV genomes, combining our new GC EBV sequences with those from public databases (as indicated above and in Table 2). Figure 1 shows that the majority of GC EBV isolates cocluster with the majority of NPC EBV isolates. The similarity between GC and NPC EBV isolates is also evident in the similar patterns of sequence changes within these isolates, as shown in Fig. 2. One of the issues regarding the clustering and analysis of EBV sequences isolated from specific tumor types is the potential confounding effect due to their geographical localizations. This is particularly an issue for NPC and BL since all NPC-derived EBV genome sequences are from Asia while all BL-derived sequences are from Africa. Prior to this study, all GC-derived EBV whole-genome sequences were also Asian. However, 10 of the EBV sequences we assembled were from American GC samples, and 5 of these were from Caucasians. The phylogenetic tree shown in Fig. 1 shows that all of the Asian GC isolates and a subset of the American GC isolates clustered with the NPC isolates, indicating that they are most closely related. However, four of the five Caucasian American GC isolates were not part of this cluster, suggesting that geography/and or ethnicity is a factor in the close relationship between NPC and GC EBV sequences.
Phylogenetic analysis on whole EBV genomes. All available EBV genome sequences were compared based on SNVs (disregarding repeat regions). EBV isolates from GC (red), NPC (blue), and the B95.8-Raji reference strain (green) are shown. In black are EBV isolates from LCLs, Burkitt's lymphoma (BL), and Hodgkin's lymphoma (HL). The geographical locations of the EBV isolates are also indicated. U.S. GC isolates from Caucasians are indicated with a “C,” and the remaining samples are from Asians. The phylogenetic tree was rooted using a midpoint rooting.
Comparison of polymorphisms across GC and NPC EBV isolates. Amino acid changes across GC and NPC EBV sequences (compared to all other EBV isolates) as shown in Tables 3 and 4 (columns headed GC + NPC vs all others) were subjected to row-wise (across isolates) and column-wise (across mutation profiles) hierarchical clustering analysis; the Manhattan method was used for obtaining the distance matrix, and the complete-linkage method was used for agglomerating the clusters. Using row-wise clustering, three main clusters were identified; the clusters in red and green indicate amino acid changes that occur in most of the GC and NPC EBV sequences, and the cluster in blue indicates changes that occur in a subset of GC and NPC sequences. The sizes of the cluster in red and green indicate that the majority of amino acid changes observed in GC also occur in NPC isolates. U.S. GC isolates from Caucasians are indicated with a “C,” and the remaining samples are from Asians. Nonconservative amino acid changes are indicated with a plus symbol.
To gain more information about the GC/NPC cluster and how these EBV sequences differ from other EBV sequences, we examined indels and nonsynonymous SNVs that would result in amino acid changes in EBV proteins. One deletion was consistently found: an in-frame deletion of 30 bases in the third exon of the LMP1 gene, resulting in the deletion of 10 amino acids (from Gly343 to His352). This deletion was found in 82% of GC, 61% of NPC, 75% of HL, 48% of BL, and 32% of normal EBV isolates. This deletion has been previously reported in some GC and NPC EBV isolates and was shown not to affect the transforming potential of LMP1 (28–30). This LMP1 deletion appears to be strongly influenced by geography and/or ethnicity as we found it in 100% of the Asian GC isolates but in only one of the five American Caucasian GC isolates.
Nonsynonymous SNVs that are common in both GC (≥50%) and NPC (≥50%) isolates and uncommon in all other EBV isolates (occurring in <25% and significantly different, with an adjusted P value of <0.05) are shown in Tables 3 and 4 (columns headed GC + NPC vs all others) for latent and lytic EBV proteins, respectively. The majority of changes map to the latency proteins, including EBNA1 and LMP1 which are expressed in both NPC and GC. Nonconservative amino acid changes are of particular interest as these are the most likely to affect protein function (shown in boldface in Tables 3 and 4). A graphical representation of the nonconservative amino acids changes common in GC and NPC is shown for individual EBV genomes in Fig. 2, showing that most of these changes occur together. Similar to the phylogenetic tree, these results show coclustering of the Asian GC samples with the NPC samples but less so with American GC samples.
EBV latent protein sequence changes occurring in GC-derived EBV isolates
EBV lytic protein sequence changes occurring in GC-derived EBV isolates
Ten nonsynonymous mutations associated with GC and NPC were identified in LMP1, all resulting in nonconservative amino acid changes within the first 189 amino acids (Table 3). Six nonsynonymous mutations were identified in EBNA1, including three nonconservative changes (Thr85Ala, His418Leu, and Ala439Thr) (Table 3). These changes occur, on average, in 71% of GC and 99% of NPC isolates but are much less common in lymphoma and normal EBV isolates (10% of BL, 0% of HL, and 7% of normal EBV isolates; P value of <2e−10). Five of the nonsynonymous changes have been previously reported to be associated with EBV isolated from NPC (30). Our analysis shows that these changes are also significantly associated (P value of 0.05) with Asian GC samples (19/23) but not with U.S. Caucasian GC samples (1/5), suggesting that these changes may reflect strain prevalence in Asia. In addition, our analysis identified a previously unreported nonconservative change in Thr85Ala in EBNA1 that is found in 91% of NPC and 68% of GC isolates but in only 8% of other EBV isolates. Interestingly, this change is found with similar frequencies in Asian (16/23) and American Caucasian (3/5) GC isolates.
GC-associated EBV genome alterations.In order to identify EBV protein sequence changes that might be involved in induction of GC, we investigated EBV protein sequence changes that are common in GC and uncommon in LCLs. Since few of the LCLs are from Asia whereas most of GC samples are Asian, there is the potential for a strong geographical bias. To address this issue, we looked for EBV sequence changes that were common in both Asian and American Caucasian GC isolates (≥60% for each group) and uncommon (<25%) in LCL isolates. This identified 11 significant nonsynonymous SNV changes (adjusted P value of <0.05) as shown in Fig. 3. As shown in Tables 3 and 4 (columns headed GC vs LCLs), these SNVs corresponded to amino acid changes in one EBV lytic protein (BPLF1) as well as eight nonconservative changes in LMP1 and one nonconservative change in EBNA1. The EBNA1 sequence change (Thr85Ala) and seven of the LMP changes are the same as those reported above to be altered in the GC/NPC cluster compared to sequences of other EBV isolates. This suggests that there are a small number of EBV sequence alterations in GC EBV isolates that are independent of the geographical origin or ethnicity of the GC.
Genome-wide analysis of EBV sequences and gene expression associated with gastric adenocarcinoma. An annotated Circos plot depicting EBV amino acid changes common (≥60%) in both Asian and Caucasian GC isolates relative to the sequence of B95.8 (blue track) or common (≥60%) in both Asian and Caucasian GC samples but uncommon in LCLs (green track) and RNA-Seq read coverage across the EBV (NCBI accession number NC_007605) reference genome (gray track). For the EBV genome changes, nonconservative and conservative amino acid changes are indicated in red and blue, respectively; the BOLF1 insertion is shown in yellow, and the LMP1 deletion is marked in green and labeled. In the outer track, the positions of EBV genes that are expressed (red) or not expressed (blue), as well as the repeat regions excluded from the analysis (violet), are indicated.
We also asked whether there are changes in GC isolates that do not occur in NPC isolates. To avoid a bias due to geography and/or ethnicity, we included only Asian GC samples in this comparison since all NPC samples were Asian. We looked for amino acid changes found in >75% of Asian GC isolates but in <25% of NPC (Tables 3 and 4, GC vs NPC). The results showed only two conservative changes in the latency proteins (in LMP2A and LMP2B) and several changes in 10 lytic proteins. This is consistent with the above conclusions that Asian NPC and GC EBV isolates are very similar.
Finally, since the B95.8 EBV isolate is the most commonly studied and since protein expression clones are typically based on this variant, it was of interest to determine what changes in EBV proteins are common in GC isolates relative to the B95.8 sequence. Again, we looked for changes that occurred in ≥60% of Asian and in ≥60% of Caucasian GC isolates to counter any geographical/ethnicity effects. Figure 3 shows a graphic summary of the genetic variants from this comparison, which identified 87 nonsynonymous SNVs and two indels. Amino acid changes in 9 latency and 25 lytic proteins are shown in Tables 3 and 4 (columns headed GC vs B95.8, ł60%), respectively. The indels were the LMP1 deletion indicated above and an insertion in BOLF1, consisting of a glycine at amino acid 261 relative to B95.8 sequence. We also asked whether there were any changes that occurred in 100% of GC samples relative to the sequence of B95.8. This showed that a subset of the above GC-associated changes in both latent and lytic proteins occurred in all GC-derived EBVs regardless of geography (Tables 3 and 4, GC vs B95.8, 100%), including one change in EBNA1 (Thr524Ile). This change is common in many EBV isolates, indicating that the B95.8 reference sequence is unusual at this position (30, 31). Thr524 is in the DNA binding/dimerization domain in an α-helix important for contacting the DNA (32, 33) although the contribution of Thr524 in DNA recognition has not been determined. In addition, all GC isolates have the BOLF1 insertion at amino acid 261 indicated above. These sequence changes will be important to incorporate when expression clones are generated to study the functions and protein interactions of these EBV proteins in the context of gastric infections.
EBV gene expression in gastric carcinoma.Another important question for understanding the mechanism of cancer induction by EBV is which EBV proteins are expressed. Although many studies have focused on the EBV latent proteins that are consistently expressed in tumors, there have been many reports of detection of lytic proteins in a variety of EBV tumors although the profile of which EBV lytic proteins are expressed and of their frequencies of expression is not clear. Of the 13 EBV-positive gastric tumor samples that we used to assemble EBV genome sequences, eight had whole-transcriptome data [on purified poly(A)-containing RNA] which we used to determine the level of each EBV transcript (in reads per kilobase of transcript per million mapped reads [RPKM]) using the EBV NC_007605 in NCBI as the reference genome. A previous paper had also analyzed EBV transcripts in four GC samples but had not reported the complete profile of EBV lytic protein transcripts (12). Therefore, we attempted to reanalyze these data as well. However, we learned that TCGA had determined that all four samples were from the same patient, and hence three out of four of these duplicate samples (BR-4298, BR-4376, and BR-4271) were removed from the TCGA database (https://portal.gdc.cancer.gov/). Analysis of the remaining sample (BR-4253) showed a lower overall number of reads mapping to the EBV transcriptome (4- to 18-fold less) than for the eight samples we analyzed, which would hinder identification of low-abundance transcripts; hence, this sample was not included in our further analyses.
The EBV transcriptome profiles for each of the eight GC samples are shown in Fig. 4A. Since it is well established that EBNA2, -3A, -3B, and -3C are not expressed in GC, the average RPKM value for these transcripts was used as background in each sample, and transcripts that were above this level in 7 or 8 of the 8 samples are shown in Table 5. As expected based on previous reports, EBNA1 transcripts were readily detected in all samples. We also detected variable levels of LMP1 transcripts and low levels of LMP2A transcripts in 7 out of 8 samples.
EBV gene expression profiles for GC samples. (A) EBV transcriptome read coverage is shown for eight individual GC samples relative to the reference EBV genome (NCBI accession number NC_007605; represented using a log2 scale). The red bars in the top track indicate the location of the EBV repeat regions. The annotation track at the bottom shows EBV expressed genes in red and nonexpressed genes in blue. (B) Expressed genes shown in Table 5 were subjected to the row-wise hierarchical clustering analysis across eight EBV-positive GC transcriptome samples. Pearson correlation was used for obtaining the distance matrix, and the complete-linkage method was used for agglomerating the clusters. The three main clusters shown in red, green, and blue indicate three different groups of genes with highest correlations between their gene expression levels across samples.
EBV transcripts detected in GC samplesa
In addition to latency genes, we identified 18 EBV lytic genes that were consistently expressed (in 7 or 8 out of 8 samples) (Table 5). The expression pattern was not consistent with a lytic or abortive lytic infection because subsets of both early and late EBV proteins were expressed. For example, a subset of only early genes necessary for viral DNA replication were expressed; transcripts for BALF5 (DNA polymerase) and BALF2 (single-stranded DNA binding protein) were readily detectable in all samples, whereas transcripts for BMRF1 (polymerase processivity factor), BSLF1 (primase), BBLF4 (helicase), and BBLF2/BBLF3 (primase accessory protein) were not consistently detected. These results suggest that lytic DNA replication would not occur. However, this does not prevent the expression of specific late lytic genes that would normally be transcribed after DNA replication as transcripts for four late genes (BALF4, LF3, BNRF1, and BPLF1) were consistently detected. Together, the expression profiles suggest that EBV lytic gene expression in GC is regulated differently than in a lytic infection.
For the expressed genes shown in Table 5, we show in Fig. 4B the agglomerative hierarchical clustering of their expression patterns across 8 samples using the Pearson correlation distance measure. Three clusters (shown in red, green, and blue) were identified. All 13 genes in the red and green clusters map to a ∼40-kb region of the genome between LF3 and BNRF1, which includes LMP1 and LMP2A (Fig. 4A). This region contains the most highly expressed genes (i.e., BALF5, BALF4, BALF3, BALF2, BILF1, LF1, LF2, BNLF2a, and BNLF2b) (Table 5), defining a region of the genome that is activated. We note that the read coverage across the BALF4 and BALF5 transcripts could also be associated with the expression of the RPMS1 and A73 genes (situated on the opposite strand from BAFL4/5); however, our analysis shows that portions of BALF4 and BALF5 transcripts that do not overlap the RPMS1 and A73 (BART) transcripts are expressed. Among genes in the green cluster, a subset of six genes (BARF1, BALF2, BALF1, BNRF1, LMP1, and LMP2A) have the strongest positive correlation (mean Pearson's r = 0.85 versus r = 0.53 for overall correlations of genes in the green cluster), suggesting that they may be coordinately regulated. The blue cluster includes EBNA1 and seven lytic genes, five of which (BPLF1, BZLF1, BRLF1, BKRF3, and BKRF4) are spread over the EBV genome outside the activated 40-kb region and two genes (BNLF2a and BNLF2b) that belong to it. BNLF2a and BNLF2b have high positive correlation (mean Pearson's r = 0.92 versus r = 0.45 for overall correlations of genes in the blue cluster) and cluster apart from the rest of the genes in this cluster. Among the genes outside the 40-kb activated region, three genes (BZLF1, BPLF1, and EBNA1) show the highest expression correlation (Pearson's r = 0.93). In contrast, BKRF3 and BKRF4, which are localized close to the EBNA1 gene, show poor correlation with EBNA1 expression (mean Pearson's r = 0.1), which suggests that they are not coregulated with EBNA1.
Association of GC sequence changes with T cell epitopes.Previous studies have identified epitopes in EBV proteins recognized by CD4+ and CD8+ T cells (34). To determine how immune recognition might be altered in GC EBV isolates, we looked at the sequence changes in all of the EBV proteins that we found to be expressed in GC to determine if the amino acid changes correspond to known T cell epitopes. We included all of the changes identified in the second to fourth columns of Tables 3 and 4. By using the NCBI entry NC_007605 (B95.8-Raji) as the reference sequence and the epitope information from Taylor et al. (34), this analysis showed that 21 of these amino acid changes within five different expressed EBV proteins mapped to known epitopes (Table 6). In particular, most of the amino acid changes found to be common in both Asian and American Caucasian GC but uncommon in LCLs (Tables 3 and 4, GC vs LCL) mapped to T cell epitopes.
Association of GC sequence changes with T cell epitopes
DISCUSSION
Whether particular EBV variants are more oncogenic than others is an important question and one that requires analysis of many EBV whole-genome sequences from tumors and healthy people in different geographical locations. We have contributed to this question by generating 13 new EBV sequences from gastric carcinoma, including 10 samples from the United States, 5 of which are from Caucasians. These are the first reported non-Asian GC-derived EBV genome sequences, enabling initial studies on the effect of geography/ethnicity on GC-derived EBV sequences. We have combined these new EBV sequences with all preexisting EBV whole-genome sequences to conduct the most extensive sequence analysis to date of EBV isolates.
Whole-genome phylogenetic tree analysis showed that most GC isolates are most closely related to NPC isolates; however, there was a strong geographical or/and ethnic component in that few American Caucasian EBV isolates were part of this cluster. We also identified a variety of amino acid sequence changes common to GC and NPC isolates but uncommon in other EBV isolates. Many of these result in nonconservative amino acid changes in subsets of the EBV proteins, which might affect their functions and/or host protein interactions, thereby promoting oncogenesis. Of particular interest are the multiple nonconservative amino acid changes in LMP1 and EBNA1, both of which are expressed in NPC and GC. All of the 10 nonconservative changes in LMP1 are within the first 189 amino acids of LMP1, corresponding to the transmembrane region, which can impact the ability of LMP1 to activate the NF-κB pathway (35). EBNA1 was found to have three nonconservative sequence changes (Thr85Ala, His418Leu, and Ala439Thr, all of which are outside the DNA binding domain) that are common in both NPC and GC but uncommon in other EBV isolates. His418Leu and Ala439Thr were previously reported as common changes in NPC isolates (30), but the Thr85Ala change has not been previously identified.
We also identified EBV sequences that are usually different in GC isolates, regardless of geography or ethnicity, compared to EBV genomes that are not from tumors (LCLs). These changes were seen in only three EBV proteins (EBNA1, LMP1, and BPLF1) and, according to our transcriptome analysis, all are consistently expressed in GC. This raises the intriguing possibility that these changes may impact GC by altering the functions or host interactions of these proteins. Interestingly, the changes in EBNA1 and LMP1 largely fell within known T cell epitopes, suggesting that immune pressure could be partially responsible for the sequence changes.
The EBNA1 change is Thr85Ala, which falls in a region of EBNA1 required for transcriptional activation of other EBV latency genes (36–38) and therefore could affect this important EBNA1 function. Eight nonconservative amino acid changes were identified in the transmembrane region in the first 189 amino acids of LMP1, all of which were also found to be common in NPC (identified in the isolate analysis of GC plus NPC versus other EBV isolates). BPLF1 is a deubiquitinase that has been found to contribute to innate immune evasions by interfering with Toll-like receptor signaling (39), as well as by contributing to B cell transformation (40). The roles of the two BPLF1 amino acids that are commonly altered in GC isolates (Lys515Gly and Ser405GLy) have yet to be determined.
Another important question in understanding how EBV induces GC is determining which EBV proteins are expressed. While the EBV latency proteins that are expressed in GC have been well characterized, there have also been reports of the presence of EBV lytic transcripts and proteins in the absence of a full lytic infection (12). However, a comprehensive analysis of EBV transcripts in GC has not been reported. For eight of the GC WGS samples from which we generated EBV genome sequences, transcriptome sequencing data (RNA-Seq) were also available, enabling the determination of which EBV genes are transcribed in the context of GC. We detected consistent expression of three latency proteins, EBNA1, LMP1, and LMP2A. This was expected although the frequency of LMP2A expression was higher than the previously reported 50% detection rate (41).
In addition, we identified specific subsets of lytic genes that are consistently expressed in GC. The expression profiles do not fit with conventional EBV lytic infection or abortive lytic infection since specific subsets of early and late genes are expressed. Consistent with previous studies on EBV expression in GC (2, 12), we observed a cluster of genes (BAFL3, BALF4, BALF5, BILF1, LF1, LF2, and BNLF2a) that were highly activated. This cluster of highly activated genes has also been reported in NPC and Burkitt's lymphoma although BALF5 transcripts are not detected in Burkitt's lymphoma (6–8). However, in addition to this cluster, we now identify several other transcripts from lytic genes, from immediate early, early, and late gene classes, that are consistently expressed at levels similar to those of LMP1 and LMP2A. In a lytic infection, expression of the late genes requires viral DNA replication. However, in the GC samples analyzed here, several of the early viral proteins needed for viral DNA replication are not expressed, and yet a subset of late genes (5) are expressed. Our data suggest that there are novel mechanisms of regulating expression of specific lytic genes in the context of GC. Interestingly, the transcription of BPLF1 was previously shown to be regulated in a manner distinct from that of most late genes (42), which may enable its expression in the absence of lytic infection, as we have observed in the GC samples.
Several lytic EBV proteins that are expressed in GC have functions that could contribute to tumorigenicity. As mentioned above, BPLF1 interferes with Toll-like receptor signaling in innate immunity and can promote cell transformation (39, 40). In addition, BARF1 is known to stimulate the proliferation of GC cells (43, 44). The highly expressed BILF1 is a seven-transmembrane, constitutively active, G protein-coupled receptor with transforming activity (45, 46). BILF1 and BNLF2a have also been shown to cooperate in immune evasion by inhibiting the presentation of viral antigens (47). Similarly, LF2 has been found to antagonize type I interferon signaling, suggesting that it would be important for avoiding host immune responses (48). BALF1 is a Bcl-2 homologue that increases tumorigenicity and cell survival (49) and has also been reported to be expressed in Burkitt's lymphoma cell lines and NPC samples (50). Finally, BNRF1 induces centrosome amplification leading to chromosome instability and therefore would be expected to increase the risk of oncogenesis (51). Overall our data support a model in which expression of specific EBV lytic proteins contributes to tumorigenesis in gastric and perhaps other cancers.
MATERIALS AND METHODS
EBV identification using whole-genome sequencing data.The CaPSID (Computational Pathogen Sequence Identification) bioinformatics platform (52) (developed by our group) was used to identify EBV in whole-genome sequencing data of gastric adenocarcinoma samples, with additional filtering and alignment steps described below. For each whole-genome-sequenced sample, BAM (53) files containing reads aligned to the human reference sequence (GRCh37/hg19) were downloaded from The Cancer Genome Atlas ([TCGA] Stomach Adenocarcinoma, project code STAD) (25) and from the new PanCancer Analysis of Whole Genomes ([PCAWG] Stomach Adenocarcinoma, project codes STAD-US and GACA-CN) from the International Cancer Genome Consortium (ICGC) (26). Reads that did not map to the human reference were extracted and filtered for low complexity and quality and then aligned in single-end mode using the Bowtie2 aligner (54) to a database containing a complete set of 5,652 NCBI RefSeq viral reference sequences (including the EBV reference sequences NC_007605 and NC_009334) and a filter reference database composed of 5,242 bacterial and 1,138 fungal reference sequences that was downloaded from the NCBI (55). In order to improve the sensitivity and specificity with which viral sequences were detected, reads that did not map to any reference with Bowtie2 were realigned against the same RefSeq viral reference database, using a more sensitive SHRiMP2 aligner with the ability to perform local alignments (56). To reduce the number of potential false positives, we then applied filtering criteria using CaPSID's average gene coverage metric (average gene coverage of >90%) to identify samples in which EBV was present with the highest confidence.
EBV genome assembly.For each individual sample that was identified as harboring EBV (as described in the previous section), reads that did not map to the human reference were realigned (using Bowtie2 and SHRiMP2 as described above) to an initial reference sequence database composed of 57 complete EBV sequences downloaded from the NCBI. Following this realignment step, reads with ambiguous alignments were reassigned to the most probable EBV genome of origin using a statistical model based on the read alignment scores as described in Hong et al. (57). Based on the information about the most probable EBV genome of origin, the read depth coverage, and the overall EBV genome coverage, 13 samples in total were selected for the EBV genome reference-based assembly. Twelve of these adenocarcinoma samples harboring EBV can be downloaded from the ICGC data portal (https://dcc.icgc.org/repositories/) using their unique file identification numbers: FI13619, FI49302, FI24570, FI28165, FI35962, FI48909, FI31442, FI33260, FI19435, FI49266, FI17300, and FI51320; the additional TCGA sample, TCGA-D7-5577-01A-01D-1598-02, can be downloaded from the Genomic Data Commons (GDC) data portal (https://portal.gdc.cancer.gov/). Of these 13 samples, 11 had read depth of coverage ranging between 54× and 176×, 1 sample had a read depth of 14×, and 1 sample had a read depth of 3.4× but even genome coverage (3.4× ± 1.4×). For each of these 13 samples, the top-ranked EBV reference sequence to which the majority of the reads aligned was then used as the input reference sequence for the reference-based assembly. The reference-based assembly was performed using SAMtools (53) for variant calls (with the read base quality parameter threshold set to -Q15), followed by the FastaAlternateReferenceMaker (a tool available from the Genome Analysis Toolkit [GATK]) (58) to generate the newly assembled EBV reference sequence.
EBV gene expression analysis.Additional transcriptome sequencing (RNA-Seq) data [on purified poly(A)-containing RNA] available for 8 out of 13 whole-genome gastric adenocarcinoma (primary solid) sequenced samples that tested positive for EBV were downloaded from PCAWG (26) (Stomach Adenocarcinoma, project code STAD-US [25]). RNA-Seq samples used in this study can be downloaded from the ICGC data portal (https://dcc.icgc.org/repositories/) using their unique file identification numbers: FI35960, FI33258, FI17298, FI31440, FI19433, FI48907, FI28163, and FI49264. Reads that did not map to the human reference were extracted and filtered for low complexity and quality and then aligned to the NCBI EBV reference genome NC_007605 using the Bowtie2 alignment algorithms in single-end mode as previously described in Borozan et al. (59). RNA-Seq analysis and transcript read quantification were performed using the R Biocondoctor packages (60). Levels of gene expression (in units of reads per kilobase of transcript per million mapped reads [RPKM]) were calculated using the formula RPKM = (109 × C)/(N × L), where C is the number of reads mapped to a gene, N is the total number of mapped reads in the experiment, and L is the transcript length in base pairs. For each gene, transcripts in this study were defined over gene coding sequence (CDS) regions. P values were calculated using a two-sample t test with the alternative hypothesis set to “greater” as implemented in the R function t.test (61). P values were then adjusted for multiple testing in order to control for the false discovery rate (FDR) using the Benjamini-Hochberg method as implemented in the R stats package (61).
Mutation analysis.For each EBV sequence, the lists of single nucleotide variants (SNVs) and insertions and deletions (indels) was generated by performing pairwise sequence alignments to the NCBI reference EBV genome (NC_007605) using the EMBOSS Stretcher algorithm (62). Genetic variations among EBV genomes were determined by considering the complete set of variants (i.e., substitutions, insertions, and deletions) using a combination of bioinformatics tools including the VCFtools (63) and custom Python scripts. The statistical significance of the number of occurrences of each variant found in EBV sequences isolated from GC samples was evaluated by comparing it to the number of occurrences of the same variant across EBV sequences found in other cancers or healthy blood and saliva using Fisher's exact test as implemented in the R stats package (61). P values calculated using Fisher's exact test were then adjusted for multiple testing in order to control for the false discovery rate (FDR) using the Benjamini-Hochberg method as implemented in the R stats package (61). Variants considered significant were annotated using the genetic variant annotation and effect prediction toolbox (snpEff) (64) using the NCBI NC_007605 genome as the reference database. Variants that occurred in the repeat regions of the NCBI reference sequence NC_007605 were discarded from further analysis. Phylogenetic analysis and visualization were performed using FastTree-2 (65) and FigTree software (http://tree.bio.ed.ac.uk/software/figtree). The phylogenetic tree (Fig. 1) was rooted using a midpoint rooting. The annotated circular plot of the EBV genome (Fig. 3) was made by using the Circos visualization tool (66).
Accession number(s).Sequence data for the 13 GC EBV genomes were submitted to GenBank under accession numbers MG021314 (GC-variant-1), MG021305.1 (GC-variant-2), MG021315 (GC-variant-3), MG021317 (GC-variant-4), MG021308 (GC-variant-5), MG021307 (GC-variant-6), MG021312 (GC-variant-7), MG021316 (GC-variant-8), MG021310 (GC-variant-9) MG021311 (GC-variant-10), MG021309 (GC-variant-11), MG021313 (GC-variant-12), and MG021306 (GC-variant-13).
ACKNOWLEDGMENTS
V.F. and I.B. received support for their work from the Ontario Institute for Cancer Research (OICR) through funding provided by the government of Ontario. I.B., M.Z., and V.F. performed the work on behalf of the WGS pan-cancer consortium. L.F. is a tier 1 Canada Research Chair in Molecular Virology, and her work is supported by Canadian Institutes of Health Research project grant 153014. We also thank the Stomach Adenocarcinoma project groups (TCGA/STAD-US and ICGC/GACA-CN) that have contributed their data to the PCAWG.
FOOTNOTES
- Received 19 July 2017.
- Accepted 5 October 2017.
- Accepted manuscript posted online 1 November 2017.
- Copyright © 2018 American Society for Microbiology.