Genome-Wide Analysis of 18 Epstein-Barr Viruses Isolated from Primary Nasopharyngeal Carcinoma Biopsy Specimens

ABSTRACT Epstein-Barr virus (EBV) is a ubiquitous gammaherpesvirus that is highly prevalent in almost all human populations and is associated with many human cancers, such as nasopharyngeal carcinoma (NPC), Hodgkin's disease, and gastric carcinoma. However, in these EBV-associated cancers, only NPC exhibits remarkable ethnic and geographic distribution. We hypothesized that EBV genomic variations might contribute to the pathogenesis of different human cancers in different geographic areas. In this study, we collected 18 NPC biopsy specimens from the Hunan Province in southern China and de novo assembled 18 NPC biopsy specimen-derived EBV (NPC-EBV) genomes, designated HN1 to HN18. This was achieved through target enrichment of EBV DNA by hybridization, followed by next-generation sequencing, to reveal sequence diversity. These EBV genomes harbored 20,570 variations totally, including 20,328 substitutions, 88 insertions, and 154 deletions, compared to the EBV reference genome. Phylogenetic analysis revealed that all NPC-EBV genomes were distinct from other EBV genomes. Furthermore, HN1 to HN18 had some nonsynonymous variations in EBV genes including genes encoding latent, early lytic, and tegument proteins, such as substitutions within transmembrane domains 1 and 3 of LMP1, FoP_duplication, and zf-AD domains of ENBA1, in addition to aberrations in noncoding regions, especially in BamHI A rightward transcript microRNAs. These variations might have potential biological significance. In conclusion, we reported a genome-wide view of sequence variation in EBV isolated from primary NPC biopsy specimens obtained from the Hunan Province. This might contribute to further understanding of how genomic variations contribute to carcinogenesis, which would impact the treatment of EBV-associated cancer. IMPORTANCE Nasopharyngeal carcinoma (NPC) is highly associated with Epstein-Barr virus (EBV) infection and exhibits remarkable ethnic and geographic distribution. Hunan Province in southern China has a high incidence rate of NPCs. Here, we report 18 novel EBV genome sequences from viruses isolated from primary NPC biopsy specimens in this region, revealing whole-genome sequence diversity.

E pstein-Barr virus (EBV) is a ubiquitous gammaherpesvirus that is highly prevalent in all human populations (1). It is associated with both nonmalignant diseases such as infectious mononucleosis (2) and a number of human cancers such as undifferentiated nasopharyngeal carcinoma (NPC) (3), Hodgkin's disease (4), and gastric carcinoma (5). However, regarding EBV-associated cancers, only NPC exhibits remarkable ethnic and geographic distribution, with a particularly high prevalence in southern China and Southeast Asia (3). Hunan Province represents a region with a high NPC prevalence in southern China (6)(7)(8). We hypothesized that EBV genomic variations might contribute to the pathogenesis of different human cancers in different geographical locations. However, genome-wide variation analysis of EBV subtypes in Hunan Province has not yet been performed.
In this study, EBV genomes from 18 NPC biopsy specimens collected from Hunan Province, southern China, were de novo assembled through target hybridization enrichment, followed by NGS, which is analogous to human exome sequencing. The sequences of 18 NPC biopsy-specimen-derived EBV (NPC-EBV) genomes were designated HN1 to HN18. Heterogeneity, variations in protein-coding and noncoding regions, and phylogenetic and protein sequence variation analyses were subsequently performed to assess the genomic diversity of NPC-EBV genomes. These data are expected to help clarify the complex role that NPC-EBV genomes play in the pathogenesis of NPC.

RESULTS
Target enrichment, sequencing, and de novo assembly of EBV genomes. In this study, 18 EBV-positive NPC biopsy specimens were collected from Hunan Province in southern China and subjected to capture sequencing. Initially, an average total of 10,000,000 reads (1 gigabase of raw data) were collected from each sample. Subsequently, all sequencing reads from 18 NPC-derived EBV genomes were mapped to the EBV reference sequence (accession number NC007605). The reads that matched the EBV reference sequence were used for EBV genome assembly using Burrows-Wheeler Aligner (BWA) programs (29). All 18 NPC-EBV genomes, designated HN1 to HN18, were successfully sequenced and assembled. The coverage of HN1 to HN18, mapped to the reference EBV sequence, ranged from 98.93 to 99.96%. The average sequencing depth on target for these 18 NPC-EBV genomes was 4,047-fold, ranging from 248-to 11,751fold, with assembled EBV genome sizes ranging from 168,444 bp (HN4) to 171,822 bp (HN15). Details of the sequencing data are listed in Table 1.
Variation analysis of NPC-EBV genomes. A total of 20,570 variations were observed in these 18 EBV genomes compared to the reference EBV sequence, including 20,238 substitutions, 88 insertions, and 154 deletions. Among these variations, 15,440 substitutions, 3 insertions, and 8 deletions were located in the coding regions of the EBV genomes, with a median variation rate of 0.502% and a range of 0.458% (HN10) to 0.583% (HN2) ( Table 2). Figure 1 illustrates the variations in the NPC-EBV genomes (HN1 to HN9 [ Fig. 1A] and HN10 to HN18 [ Fig. 1B]) relative to the reference EBV genome. Compared to the reference EBV genome, the highest incidence of variations was found at kb 95 to 105 of the EBV genome, corresponding to the location of the genes ENBA1 (encoding EBNA1 and related to replication of the EBV genome) and BKRF1 (which is associated with host infection). The second highest incidence of variations was found at kb 165 to 172, corresponding to the location of the LMP1 gene (encoding LMP1, which is known as an oncogenic virus protein) (Fig. 1C).
The EBV genome encodes 86 proteins (30). To investigate genetic variations and their potential roles in the carcinogenesis of NPC, we classified affected proteins and found that nonsynonymous variations in latent membrane protein-encoding genes were of the highest proportion (21.4%), based on all NPC-EBV genome variations, with an average of 120 nonsynonymous substitutions per sample. Variations in tegument and membrane protein-encoding genes, respectively, accounted for 19.6 and 12.0% of mutations ( Fig. 2 and Table 3), supporting the important role of latent EBV infection in NPC oncogenesis.  Phylogenetic analysis of NPC-EBV genomes. Phylogenetic trees representing whole NPC-EBV genomes were constructed from 18 sequences assembled in this study and other published EBV genomes (18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29) (Fig. 3). All NPC-EBV genomes from Hunan Province (HN1 to HN18) clustered in one branch, which was separate from EBV subtypes from other areas, indicating that EBV subtypes show specific geographical distribution.
Previous studies have demonstrated associations between host genetic variations  and certain EBV variations and NPC risk (59). Linkage analysis is a major strategy to identify genes that contribute to NPC risk. However, four linkage studies have been performed, wherein susceptibility loci, including 6p21 (59), 4p15.1-q12 (60), 3p21.31-21.2 (6, 7), and 5p13 (61), were implicated as NPC risk factors in families of southern Chinese origin such as those from Guangdong, Hong Kong, and Hunan. These results are not concordant, since no studies provided supporting linkage evidence for each other. This might be partially explained by the heterogeneity of the population or NPC subtype. EBV is highly prevalent in all human populations and is associated with both nonmalignant disease and some human cancers (32); however, only NPC has remarkable ethnic and geographic distribution, indicating that the EBV genome might also exhibit heterogeneity with different subtypes resulting in different diseases (62,63). We hypothesized that the subtype of EBV that is prevalent in southern China might have unique genetic variations that play an important role in the development of NPC as opposed to other diseases. Actually, NPC-EBV strains have been characterized by genotyping polymorphic markers in the EBER1, EBER2, LMP1, BHRF1, BZLF1, and EBNA1 gene loci; however, genetic variations in the small subsets of genes investigated were not sufficient to assess the geographical distribution of EBV variations and their precise association with disease. NGS technology provides an important and highly efficient method to delineate the genomic sequences of EBV genomes, and as such, several EBV strains derived from Guangdong (GD1 and GD2) and Hong Kong (HKNPC1 to HKNPC9) have been sequenced and assembled (21)(22)(23)(24). We previously reported that the susceptibility locus on chromosome 3p21 in NPC families from Hunan Province (6) is distinct from other susceptibility loci in NPC families from Guangdong (60) and Hong Kong (61); therefore, determining EBV genomic sequences from NPC patients in Hunan Province is worthwhile.
In this study, we sequenced and assembled 18 EBV genomes from NPC patients in Human Province through target hybridization enrichment, followed by an NGS strategy, a technique that is associated with better specificity, lower cost, and a higher depth of coverage (the average sequencing depth was 4,047-fold). This suggests that our results are more accurate and credible. A total of 20,570 variations were observed in these 18 NPC-EBV genomes compared to the EBV reference genome, and the average number of variations in protein coding regions was 261 per EBV genome. Among them, latent genes such as LMP1, EBNA1, and BKRF2 represented nearly one-third of all variations. The EBV reference genome was constructed using B95-8 (infecting marmoset B cells) as the backbone, with an 11-kb deletion segment provided by the Raji (Burkitt's lymphoma cell line) sequence; this suggests that latent EBV infection plays a specific role in NPC tumorigenesis and that this role might be different from that in other human cancers, especially B-cell-lineage malignancies. Actually, there are four different latent infection statuses (latencies I to IV) of EBV in different EBV-associated human cancers, and EBV expresses different latency genes; for example, Burkitt's lymphoma is latency I, and nasopharyngeal carcinoma is latency II.
In this study, we also performed phylogenetic analysis according to whole EBV genomic sequences and based on certain important EBV genes such as LMP1, EBNA1, and BKRF2. Results showed that EBV subtypes in Hunan areas could be distinguished from those of other areas. In addition, EBV subtypes in Southeast Asia and North Africa were shown to cluster as one branch that is separate from subtypes from other areas; this was based on LMP1, EBNA1, and BKRF2 gene sequences, which indicates that LMP1, EBNA1, and BKRF2 can act as potential classification biomarkers for EBV subtypes. Based on BKRF3, BKRF4, and BZLF1, the EBV subtypes in Hunan areas also clustered as one branch, whereas HKNPC6 and HKNPC7 from Hong Kong and Mutu, as well as K4123-Mi from North Africa, clustered as another branch, indicating that subtypes can be differentiated based only on a single gene. Clinical information showed that HKNPC6 and HKNPC7 were stage IV and that the differentiated degree of these two samples was higher than that of other samples, indicating that clinical stages might affect HKNPC6 and HKNPC7 clustering. By analyzing the LMP1-encoding amino acid sequence, insertions/deletions within the transmembrane domain and Pfam domain of LMP1 correlated with geographical distribution of the samples, which was confirmed by phylogenetic analysis, indicating that LMP1 could be a potential biomarker for the classification of various subtypes of EBV.
EBV encodes 44 microRNAs (miRNAs), including the BART cluster and BHRF cluster (64). The largest set is the BART miRNAs, which consists of 40 miRNAs that are expressed in all forms during EBV infection, promoting viral latency or cancer development by targeting both viral and cellular genes (65)(66)(67). Many studies have reported that polymorphisms in human miRNAs or genetic variations in miRNA-binding sites in the 3= untranslated region (UTR) of human genes influence cancer susceptibility or promote cancer progression, affecting prognosis (68). In this study, several variations in EBV miRNAs were found; especially, polymorphisms in the mature miRNA EBV-miR-BART19-5p might influence minimum free energy or potency and stability during binding to the 3= UTR of target genes. This could alter host target gene expression and affect cancer progression. Other variations were identified in pre-miRNA sequences, albeit not in EBV-miRNA mature sequences, but rather located in stem-loop sequences. These variations might lead to structural alterations in pre-miRNAs and could result in the aberrant expression of EBV-miRNAs. The detailed biological function of these EBV mRNA variants is worthy of further study.
In summary, we performed virus sequence capture sequencing to sequence 18 EBV-associated NPC biopsy specimens from the Hunan area. These EBV genomic sequences provide an important reference to further identify variations in EBV. These data will provide a more appropriate standard to classify EBV subtypes and provide a large amount of resources to further define the epidemiological and pathogenic roles of EBV in associated diseases.

MATERIALS AND METHODS
Sample DNA preparation. A total of 18 primary undifferentiated NPC biopsy specimens were collected from the Hunan Cancer Hospital in Changsha, Hunan Province, China. Ethics approval was obtained from the Institutional Review Board. Detailed clinical information about the patients is summarized in the Table 4. DNA extraction was performed using a Qiagen blood and tissue kit according to the manufacturer's protocol (Qiagen, Hilden, Germany) (69)(70)(71). DNA concentrations were determined using a Nano-Drop (Thermo Fisher Scientific, Wilmington, DE).
Workflow for capture sequencing of EBV genomes. In brief, the workflow for capture sequencing of EBV genomes included EBV capturing probe preparation, NPC biopsy DNA library preparation, target capture and index tagging, sequencing, de novo assembly, and bioinformatics analysis. A flow chart for the complete workflow is shown in Fig. 7. These steps are described in further detail as follows.
EBV Genomes from NPC Biopsy Specimens Journal of Virology segments using a Covaris E-210 ultrasonicator (Covaris, Inc., Woburn, MA). Subsequently, the single EBV capture probe was denatured at 94°C. Purified genomic DNA (containing human and EBV genomes) from NPC biopsy specimens was fragmented randomly to produce a 170-to 200-bp DNA library using an ultrasonicator Covaris E-210 (70,72). DNA fragments were end repaired using a DNA terminator end repair kit (Lucigen, Middleton, WI) and purified using a QIAquick PCR purification kit (Qiagen) according to the manufacturer's protocol. Subsequently, 3=-dA tailing was performed using a Klenow fragment (Enzymatics, Beverly, MA). After 10 cycles of PCR amplification, the dA-tailed DNA fragments were purified using 2% polyacrylamide gels (Life Technologies, Carlsbad, CA), and the DNA library was analyzed using a Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA).
EBV capturing probes and DNA libraries, purified from NPC biopsy specimens, were hybridized at 47°C for 24 h, after which nonhybridized DNA fragments were cleared. The purified fragments were amplified for 16 cycles using AccuPrime Pfx DNA polymerase (Life Technologies), and the sequence library was used for next-generation sequencing. Cluster generation and 150-bp paired-end sequencing reactions were performed using an Illumina HiSeq 2000 sequencer (Illumina, Inc., San Diego, CA) using 101 cycles and in accordance with the manufacturer's instructions.
De novo assembly of EBV genomes. All sequencing reads were de novo assembled using SOAPdenovo software (73), which merged overlapping reads based on the de Bruijn graph algorithm to generate the contigs. The paired-end data were then used to link contigs into scaffolds, and the assembled scaffolds were aligned to the EBV reference genome to fill the gaps, which were mostly located in the repeat sequence regions.
Detection of SNVs. All sequencing reads were aligned to the EBV reference genome using BWA software. Single-nucleotide variations (SNVs) were detect by SAMtools software (74) and applying the following criteria to filter the raw variation results: (i) the minimal depth cover of the variation site was noless than t 4; (ii) the minimal variation quality score was not less than 20; (iii) the distance between two nearby variations was not less than 5; and (iv) the reads supporting the mutated allele were not fewer than 4.
Accession number(s). Sequence data for the 18 NPC-EBV genomes, HN1 to HN18, were submitted to the DDJB and GenBank database under accession numbers AB850643 to AB850660 (Table 1).