ABSTRACT
Persistent infection with oncogenic human papillomaviruses (HPVs) causes cervical cancer, accompanied by the accumulation of somatic mutations into the host genome. There are concomitant genetic changes in the HPV genome during viral infection; however, their relevance to cervical carcinogenesis is poorly understood. Here, we explored within-host genetic diversity of HPV by performing deep-sequencing analyses of viral whole-genome sequences in clinical specimens. The whole genomes of HPV types 16, 52, and 58 were amplified by type-specific PCR from total cellular DNA of cervical exfoliated cells collected from patients with cervical intraepithelial neoplasia (CIN) and invasive cervical cancer (ICC) and were deep sequenced. After constructing a reference viral genome sequence for each specimen, nucleotide positions showing changes with >0.5% frequencies compared to the reference sequence were determined for individual samples. In total, 1,052 positions of nucleotide variations were detected in HPV genomes from 151 samples (CIN1, n = 56; CIN2/3, n = 68; ICC, n = 27), with various numbers per sample. Overall, C-to-T and C-to-A substitutions were the dominant changes observed across all histological grades. While C-to-T transitions were predominantly detected in CIN1, their prevalence was decreased in CIN2/3 and fell below that of C-to-A transversions in ICC. Analysis of the trinucleotide context encompassing substituted bases revealed that TpCpN, a preferred target sequence for cellular APOBEC cytosine deaminases, was a primary site for C-to-T substitutions in the HPV genome. These results strongly imply that the APOBEC proteins are drivers of HPV genome mutation, particularly in CIN1 lesions.
IMPORTANCE HPVs exhibit surprisingly high levels of genetic diversity, including a large repertoire of minor genomic variants in each viral genotype. Here, by conducting deep-sequencing analyses, we show for the first time a comprehensive snapshot of the within-host genetic diversity of high-risk HPVs during cervical carcinogenesis. Quasispecies harboring minor nucleotide variations in viral whole-genome sequences were extensively observed across different grades of CIN and cervical cancer. Among the within-host variations, C-to-T transitions, a characteristic change mediated by cellular APOBEC cytosine deaminases, were predominantly detected throughout the whole viral genome, most strikingly in low-grade CIN lesions. The results strongly suggest that within-host variations of the HPV genome are primarily generated through the interaction with host cell DNA-editing enzymes and that such within-host variability is an evolutionary source of the genetic diversity of HPVs.
INTRODUCTION
Human papillomaviruses (HPVs) constitute a large family of small double-stranded DNA viruses, including over 200 different genotypes that can be distinguished based on greater than 10% difference within the L1 gene sequence of the viral genome (1, 2). The HPV genotypes are phylogenetically classified into five genera (Alpha-, Beta-, Gamma-, Mu-, and Nupapillomavirus) (3), among which the Alphapapillomavirus genus has been most extensively studied because of its clinical significance. At least 13 genotypes of the Alphapapillomavirus genus (HPV16, -18, -31, -33, -35, -39, -45, -51, -52, -56, -58, -59, and -68) are defined as high-risk HPVs (4); they infect anogenital mucosal tissues through sexual contact and cause cervical intraepithelial neoplasia (CIN), eventually resulting in invasive cervical cancer (ICC) in a small fraction of women persistently infected with high-risk HPVs (5).
Individual HPV genotypes harbor minor genetic variations with less than 10% differences in the whole viral genome sequence and are recognized as intratype variants, thus extending the genetic diversity of each genotype and the overall HPV family (6). These intratype variants are grouped by molecular phylogeny into distinct lineages and sublineages, which contain 1.0 to 10.0% and 0.5 to 1.0% nucleotide variations in the complete viral genome sequence, respectively (6). Below the levels of lineages/sublineages, there is another level of genetic variability with less than 0.5% differences in the viral whole-genome sequence, which includes a large number of single-nucleotide polymorphisms within the viral genome.
Indeed, a recent viral genomics study collecting more than 5,000 HPV16-positive cervical specimens has demonstrated surprisingly high levels of variability in HPV16 genome sequences across normal and CIN/ICC samples (7). In that study, nearly 80% of the HPV16 isolates represented unique viral isolates that differ by at least two nucleotides across the whole viral genome sequence, thereby revealing the extremely high genetic diversity of HPV16 among infected individuals. Another study also collected HPV16-positive cervical swabs during a longitudinal follow-up and reported a high level of genetic diversity of HPV16 whole-genome sequences between infected individuals (8). Since the HPV genome replication is completely dependent on host cell high-fidelity DNA polymerases, the high genetic variability observed for HPV16 is surprising and suggests that other mechanisms underlie the elevated mutation rate. However, determining how such between-host diversity arises is challenging. This is due to a lack of thorough understanding of how within-host HPV genetic diversity is generated and how HPV genome mutational dynamics evolve over a long period of persistent infection.
Viral genetic diversity is generally acquired through various mutagenic processes exerted on the viral genome inside the infected cell (9). Intracellular mutagenesis of the viral genome is often the result of viral genome replication by low-fidelity RNA polymerases encoded by RNA viruses; it can also be a result of virus restriction by the host, such as viral genome editing or disruption by host antiviral proteins. One prominent example of these antiviral proteins is the apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like (APOBEC) proteins, which are a family of endogenous mutagenic enzymes that convert cytosine to uracil on single-stranded DNA/RNA, including 11 member proteins (AID and APOBEC1, -2, -3A, -3B, -3C, -3D, -3F, -3G, -3H, and -4) (10). APOBEC-mediated mutagenesis of viral genomes has been described for retroviruses (e.g., human immunodeficiency virus) (11, 12) and DNA viruses (e.g., human hepatitis B virus and herpesviruses) (13–16). APOBEC signature mutations were also observed for the HPV16 genome in clinical samples, including CIN and ICC (17–19), albeit with very low frequencies in the total viral population; the APOBEC activity thus generates a quasispecies status of viral genomes in infected cells or tissues. However, the biological and clinical relevance of such low-level APOBEC mutations is unclear in the context of HPV-associated cervical carcinogenesis.
Beyond functioning in its role in the innate immune response to viral infection, recent cancer genomics studies have established a primary role for APOBEC in generating C-to-T and C-to-G mutations in the genomes of multiple malignancies, including bladder, cervical, lung, oropharyngeal, and breast cancers (20–23). Based on analyses of the trinucleotide context surrounding the mutated bases, these studies identified specific mutation patterns attributable to APOBEC in the cancer genome and specifically highlighted APOBEC3B, which is upregulated in cancer tissues, as an endogenous source of such mutations (22, 23). Intriguingly, the involvement of APOBEC3 in HPV-associated carcinogenesis is supported by findings that oncoproteins E6/E7 of high-risk HPVs upregulate APOBEC3A and APOBEC3B expression (24–27).
Here, by employing next-generation sequencing (NGS) techniques, we explore the genetic variability of HPVs in individual clinical specimens, particularly focusing on HPV16 and two Asia-prevalent high-risk HPVs, HPV52 and HPV58. We find extremely high levels of within-host diversity of HPV16/52/58 genomes across different grades of CIN and ICC samples. Through detailed analyses of the patterns of nucleotide substitutions, we suggest a potential involvement of APOBEC in yielding such within-host genetic variations and discuss this in relation to the life cycle and evolution of HPVs.
RESULTS
Within-host variations of HPV genomic sequences.We previously developed an NGS-based analytical pipeline to detect minor nucleotide variations in the HPV genome compared to a major viral genome sequence (here called the reference sequence) for individual clinical specimens (28). In that study, low-frequency variations (≤5%) were observed in the HPV16 genome from a portion of CIN1 and ICC samples. To further comprehensively catalog HPV genomic changes in relation to cervical cancer progression, we designed a cross-sectional study collecting cervical exfoliated cells from patients with different grades of CIN (CIN1 and CIN2/3) and ICC. The study subjects (total, n = 151) consisted of 56 CIN1 (HPV16, n = 13; HPV52, n = 22; HPV58, n = 21), 68 CIN2/3 (HPV16, n = 24; HPV52, n = 23; HPV58, n = 21), and 27 ICC (HPV16, n = 8; HPV52, n = 11; HPV58, n = 8) samples.
A viral genomic landscape of nucleotide variations obtained from the 151 samples is shown in Fig. 1A. A total of 1,052 positions of nucleotide variations were found across whole-genome sequences of HPV16/52/58 compared to each reference sequence, with different numbers of variation per sample (range, 0 to 85; average, 7.0). The majority of the variations (1,014/1,052; 96.4%) were unique to individual samples and not shared among the reference sequences of other samples in the same run, ruling out the possibility of cross talk between samples due to misassignments of index sequences. The patterns of variations throughout the viral genome were classified into three types. The first was a homogeneous pattern showing no nucleotide substitutions with >0.5% frequencies of the total read count (25/151 samples; 16.6%). The second was a pattern showing one to multiple positions of variations (maximum of 38 positions) compared to each reference sequence, constituting the majority of the samples analyzed (123/151 samples; 81.5%). The last was a peculiar pattern exhibiting a large number of specific substitutions, C-to-T and G-to-A substitutions, throughout the viral genome, as observed for samples 006, 116, and 121 (3/151 samples; 2.0%), which are shown in Fig. 1B.
Viral whole-genome landscape of nucleotide variations observed in individual clinical specimens. (A) Positions of nucleotide variations with >0.5% frequencies of the total read count at the corresponding positions in each reference viral genome are indicated on black lines, which represent whole viral genomes of individual clinical samples. Red, orange, blue, and gray circles indicate >10%, 5 to 10%, 2 to 5%, and 0.5 to 2% frequencies, respectively. Asterisks indicate samples showing no nucleotide substitutions with >0.5% frequencies. Red stars indicate samples having a large number of C-to-T and G-to-A substitutions. The genome organizations of HPV16/52/58 are shown below: pE, the early promoter; pL, the late promoter; polyA (early) and polyA (late), the early and late polyadenylation signals, respectively. (B) Profiles of C-to-T and G-to-A substitutions throughout the viral whole genome in samples 006, 116, and 121. The frequencies of substitutions are on the y axes, whereas the x axes display nucleotide positions in the HPV genome. Red and green bars indicate C-to-T and G-to-A substitutions, respectively.
Characteristics of within-host HPV genetic variations.As shown in Fig. 2A, the mean frequency of variations was not significantly different among the histological grades for HPV16 and HPV52 but showed significant difference between CIN1 and CIN2/3 for HPV58 (P value of 0.02 by Kruskal-Wallis test). The mean number of variations per sample was also significantly different between CIN1 and CIN2/3 (P value of 0.0004 by Kruskal-Wallis test) and between CIN1 and ICC for HPV16 (P value of 0.01 by Kruskal-Wallis test), with a decrease in the variation number in CIN2/3 and ICC (Fig. 2B). In contrast, there was no significant difference in the mean number of variations between the histological grades of the HPV52 and HPV58 samples.
Frequency and number of nucleotide variations in the HPV genome. (A) Distribution of variation frequencies observed across the histology for HPV16/52/58. The mean frequency of nucleotide variations is indicated with a bar. (B) Boxplots of the number of nucleotide variations per sample across the histology. Statistically significant differences (P < 0.05) are indicated with asterisks.
As shown in Fig. 3A, the nucleotide variations were distributed in both coding and noncoding regions in HPV genomes. Almost similar patterns of variant distribution across the viral genome were observed between CIN1 and CIN2/3 regardless of HPV type. In contrast, the ICC samples showed slightly different patterns of distribution compared to those of the CIN samples; no variation was detected in E6/E7 and a noncoding region between E5 and L2 for HPV16, and no variation was observed in E5 for HPV58. Overall, larger numbers of variations were detected in E1, E2, L1, and L2 compared to other viral regions; however, the number of nucleotide variations in each viral region significantly correlated with the length of each nucleotide sequence (Fig. 3B). Specifically, the squared correlation coefficients were 0.964 for HPV16, 0.798 for HPV52, and 0.933 for HPV58 (Pearson's test for correlation for HPV16, P value of 0.000003; HPV52, P value of 0.001; HPV58, P value of 0.00002). This indicates that the variations are evenly distributed throughout the HPV genome and do not specifically target any subregions of the viral genome.
Nonsynonymous and synonymous variations in HPV genomic regions. (A) The number of total, nonsynonymous, and synonymous nucleotide variations observed in eight viral genes (E6, E7, E1, E2, E4, E5, L2, and L1), the long control region (LCR), and a noncoding region between E5 and L2 (NC) are shown for each HPV type. (B) The correlation between the variation number and the length of nucleotide sequences in each viral genomic region. (C) Distribution of nucleotide variations between three codon positions (first, second, and third positions) in the viral open reading frames.
The nucleotide variations in the open reading frames (ORFs) of the viral genome showed a predominance of nonsynonymous versus synonymous substitutions (Fig. 3A). The ratios of the total number of nonsynonymous substitutions to that of synonymous substitutions were 6.7 for HPV16, 4.2 for HPV52, and 5.2 for HPV58, which were higher than the 3.3 ratio that is expected to occur at random sequences by chance. These high ratios reflected biased distributions of the substituted nucleotides at the first and second positions of codons in the viral ORFs compared to the third codon position (Fig. 3C), because nucleotide substitutions at the former positions usually result in nonsynonymous substitutions.
Table 1 presents a list of nucleotide variations showing relatively high frequencies (>15%). These variations were detected throughout the viral genome regardless of histology and HPV type, and most of the variations found in the viral ORFs were nonsynonymous substitutions (17/21; 81%). Among the high-frequency variations, C-to-T and G-to-A substitutions were most prevalent (8/24; 33%), followed by C-to-G and G-to-C substitutions (7/24; 29%).
High-frequency nucleotide variations detected in individual clinical specimens
Within-host variations of HPV16 E6/E7.High levels of between-host variability of HPV16 genomic sequences from women with CIN/ICC and without cervical abnormalities have previously been reported (7). This study revealed that E7 sequence conservation is critical for cervical cancer development, which is in sharp contrast to the E6 sequence variability shared between cases and controls. Given this, we focused on within-host variations of E6 and E7 in our HPV16-positive samples. In the CIN samples, a total of 12 and 8 nucleotide variations were detected in the E6 and E7 sequences, respectively. Among these variations, 11 E6 and 6 E7 variations were nonsynonymous/nonsense substitutions, which are shown in Fig. 4: Q3*, C16*, S71C, R77K, D98Y, Q107E, Q107*, D120N, F125L, W132C, and S143L for E6 and H2Y, E33K, E34K, E35K, R77C, and R77L for E7 (an asterisk indicates a stop codon). Among these variants, S71C, R77K, and D120N in E6 and E33K in E7 were identical to those observed in the above-mentioned HPV16 genomics study. Furthermore, most of the nucleotide changes were C-to-T or G-to-A substitutions (10/17; 59%) (Fig. 4).
Nonsynonymous/nonsense variations in HPV16 E6 and E7. Amino acid variations or premature stop codons (indicated with asterisks) as within-host variations are shown in the diagrams of E6 and E7. Blue and yellow circles indicate variations observed in CIN1 and CIN2/3 samples, respectively. Stars indicate C-to-T or G-to-A substitutions. CR1, conserved region 1; CR2, conserved region 2; CR3, conserved region 3.
Mutation signatures in HPV genomes.All nucleotide substitutions detected were classified into six patterns, i.e., C to A, C to G, C to T, T to A, T to C, and T to G (all substitutions are referred to by the pyrimidine of the mutated Watson-Crick base pair), and stratified based on histological grade. Overall, C-to-T and C-to-A substitutions were dominantly observed across all histological grades (Fig. 5A). While C-to-T substitutions were predominantly detected in CIN1, their relative prevalence was decreased in CIN2/3 and further fell below that of C-to-A substitutions.
Mutational signature analysis of substitutions in the HPV genome. (A) Pie charts of the distribution of substitution types across the histology. All substitutions (total, n = 1,052; CIN1, n = 585; CIN2/3, n = 298; ICC, n = 169) are classified into six base substitutions, i.e., C-to-A, C-to-G, C-to-T, T-to-A, T-to-G, and T-to-G substitutions, and the relative proportion of each substitution type to all substitutions is shown in pie charts for each histological grade. (B) Distribution of 96 substitution types across the histology. All substitutions are classified into 96 base substitution types of the trinucleotide sequence context that includes the bases immediately 5′ and 3′ of each mutated base. Different colors are used to display the various types of substitutions. The relative frequencies of specific substitution types are on the y axes, whereas the x axes display different types of substitutions.
Different environmental exposures and endogenous mutagenic processes leave characteristic somatic mutation signatures, which include preference for particular bases adjacent to the location of the mutation, in the cancer genome (29, 30). To uncover the underlying mechanisms for nucleotide substitutions in the HPV genome, the six patterns of substitution were further classified into 96 base substitution types within the trinucleotide context of the bases immediately 5′ and 3′ to each substituted base and stratified by histology. As shown in Fig. 5B, TpCpN (N is any nucleotide), a preferred target sequence for the APOBEC proteins (23), was a primary site for the C-to-T substitutions in HPV genomes in CIN1 and CIN2/3. Although much less prevalent, C-to-G substitutions were similarly observed in the TpCpN context, which suggests that the C-to-G changes are also the result of APOBEC editing; uracils generated by APOBEC deamination can be excised by uracil DNA N-glycosylase, followed by incorporation of cytosines opposite abasic sites by translesion DNA polymerases, thereby resulting in C-to-G substitutions (31). In contrast, no particular preference for trinucleotide sequences was observed for C-to-A substitutions, consistent with previous reports (20, 32). When the substitution classes were stratified by HPV type, no apparent difference was found between HPV16, HPV52, and HPV58 (data not shown), suggesting that mechanisms for generating within-host variations are common among high-risk HPVs.
The strand specificity of nucleotide substitutions was further examined for C-to-T substitutions. As with the overall pattern of nucleotide variations, C-to-T substitutions were distributed throughout the HPV genome and almost evenly detected on both sense and antisense strands of the viral genome (Fig. 6A). Lastly, to assess the strand availability for APOBEC action, the density of TpC dinucleotides was explored across the HPV genome on each strand of the viral genome. As shown in Fig. 6B, the TpC density throughout the HPV16/52/58 genomes showed an inverse correlation between sense and antisense strands of the viral genomes. Intriguingly, these patterns mirrored the distribution of TpC dinucleotides across the genome of BK polyomavirus (BKPyV) (33).
Strand specificity of C-to-T substitutions in the HPV genome. (A) Profiles of all C-to-T substitutions detected in clinical samples on each strand of HPV16/52/58 genomes. The frequencies of C-to-T substitutions are on the y axes, whereas the x axes display nucleotide positions in the HPV genome. Sense, the nontranscribed strand; antisense, the transcribed strand. N, the total number of C-to-T substitutions on the sense or antisense strand. (B) Line graphs of TpC dinucleotide density across the HPV genomes, including all lineages/sublineages for HPV16/52/58, on sense (blue line) or antisense strand (red line) of the viral genome. The gray area represents 95% confidence intervals. The genome organizations of HPV16/52/58 are shown below.
DISCUSSION
This is the first study to comprehensively examine the within-host diversity of HPV genomic sequences in relation to cervical carcinogenesis by employing deep-sequencing techniques. Our bioinformatics pipeline to reconstruct major viral whole-genome reference sequences from short-read sequences yielded only one reference sequence for each clinical specimen and enabled the detection of minor nucleotide variations throughout the whole viral genome. Surprisingly, while the genome of DNA viruses is generally considered to be stable or less mutagenic in infected individuals, the HPV genomes within individual clinical samples showed an astonishingly high level of variations across histological grades and HPV types. We cannot completely rule out the possibility that individuals were coinfected with different HPV variants, thus leading to a quasispecies status. However, based on our result that particular nucleotide changes, such as C-to-T and C-to-A substitutions, were predominantly observed for the majority of samples analyzed, it is reasonable to conclude that high levels of HPV genomic variation are the result of accumulating de novo somatic mutations in the viral genome during persistent infection.
Based on the trinucleotide context of analysis of substituted nucleotides, we conclude that APOBEC-mediated mutagenesis is the most likely cause for the C-to-T substitutions observed in the HPV genome. Single-stranded DNA exposed during cellular DNA replication is reportedly the primary target for APOBEC activity (31, 34–36). Considering these findings, we speculate that HPV DNA replication also creates a window of opportunity for APOBEC-mediated mutagenesis of the viral genome. Interestingly, C-to-T substitutions were predominantly detected in CIN1 lesions, where HPV productive replication actively takes place, generating thousands of progeny viral genomes. This high-level viral replication likely renders the HPV genome more susceptible to attack by APOBEC. In contrast, the decrease in relative prevalence of C-to-T substitutions in CIN2/3 and ICC may reflect low levels of viral genome replication in these lesions or viral integration into the host genome. Although we were not able to address the physical state of the HPV genome in our clinical samples, viral integration would lead to abortive infection with no viral replication, thus making the viral genome less vulnerable to APOBEC-mediated mutagenesis.
Other frequently observed changes, the C-to-A substitutions, are prevalent in genomes of lung, head and neck, and liver cancers (20, 32), tumors that are often associated with tobacco smoking. Tobacco carcinogens such as benzo[a]pyrene form bulky adducts on guanine bases, and these adducts are generally removed by nucleotide excision repair. However, if not repaired correctly, adenine bases are often incorporated on the opposite strand by translesion DNA polymerases, such as DNA polymerases η and ζ, thereby leading to C-to-A substitutions (37, 38). Of particular note, tobacco smoking is also a risk factor for the development of cervical cancer in HPV-infected women (39). Although C-to-A substitutions are not enriched in cervical cancer genomes (32), DNA adducts formed with tobacco carcinogens are detected in the cervical epithelium of smokers (40), and these adducts might be responsible for C-to-A changes observed in the HPV genome.
The impact of within-host variations in the HPV genome on cervical cancer development is currently unclear. However, some de novo somatic variations may be selected by conferring a growth advantage to the infected cell, because several high-frequency nucleotide variations were found in the CIN2/3 and ICC samples (Fig. 2A and Table 1). A small number of variations per sample observed for HPV16 in CIN2/3 and ICC compared to CIN1 may also suggest selection of some variant genomes. Regarding the major viral reference sequences obtained from individual clinical specimens, however, there was no particular nucleotide variation enriched in the CIN2/3 and ICC samples (data not shown). This argues against the scenario of purifying selection of de novo-generated variants leading to cervical cancer development. Large-scale longitudinal studies on HPV genomic sequences monitoring the same patients will be required to conclusively address this issue.
C-to-G substitutions are another characteristic change supposed to be generated through APOBEC editing on cancer genomes (20, 31), and they were also observed for HPV genomes in this study, albeit with a much lower prevalence than C-to-T substitutions (Fig. 5). However, out of 42 C-to-G substitutions detected in the HPV genomes, seven substitutions were high-frequency variations (Table 1), of which five were located within the TpCpN context, suggesting an involvement of APOBEC in generating these high-frequency mutations. Interestingly, the prototype HPV16 genome isolated from a cervical cancer specimen harbors a C-to-G substitution in the TpCpA sequence at nucleotide position 6240, causing an L1 D202H mutation that severely impairs virion assembly (41), which may offer one example of APOBEC action for counteracting HPV infection.
Intriguingly, the nucleotide variations observed in the HPV genome are strongly biased toward nonsynonymous substitutions, potentially affecting normal functions of the viral proteins. This phenomenon might be explained by underrepresentation of TpC dinucleotides, the preferred target site of the APOBEC3 proteins, in mucosal Alphapapillomavirus genomes (42). In particular, TpC depletion is most prominent when the cytidine is at the third codon position in the ORFs of the viral genome (42). It is thought that this TpC reduction is derived from evolutionary history, in which mucosal HPV strains tolerated APOBEC editing, because third codon position changes are usually silent (synonymous) mutations and do not alter protein function. As a consequence of cytidine depletion at the third codon position, C-to-T and C-to-A substitutions are likely forced to occur at the first/second codon positions in the HPV genome (Fig. 3C), leading to the predominance of nonsynonymous versus synonymous substitutions observed across the viral genome.
Recently, the BKPyV genome has been reported to be depleted of TpC dinucleotides, and this depletion is more prominent on the nontranscribed strand of the viral genome or the lagging strand for viral DNA replication (33). These findings suggest that APOBEC editing during virus transcription and/or replication imposed an evolutionary pressure upon BKPyV genomic sequences to reduce the TpC content in the viral genome. Interestingly, similar patterns of TpC depletion were observed for HPV genomic sequences (Fig. 6B), which implies the same mode of APOBEC interaction with the HPV genome as that with BKPyV. Since the lagging strand during cellular DNA replication is considered to be a preferred target for APOBEC-mediated deamination (34, 36), one may expect that in within-host variations of HPV genomes, the lagging strand of the viral genome (corresponding roughly to the sense strand in the viral early region and the antisense strand in the viral late region) is also enriched for C-to-T substitutions compared to its complementary strand. However, such biased distribution apparently was not observed for the clinical samples (Fig. 6A). We speculate that the preference of APOBEC action for the lagging strand is counterbalanced by the low availability of the APOBEC target site (i.e., TpC dinucleotides) on this strand (Fig. 6B), thereby resulting in the unbiased distribution of C-to-T substitutions between two strands of the HPV genome.
Finally, although C-to-T substitutions, likely mediated by APOBEC, mostly result in nonsynonymous mutations in the HPV genome, their presence seems to be insufficient to eliminate virus infection, because the number and frequency of nucleotide substitutions in the CIN1 samples was not a prognostic indicator of whether lesions would regress or progress (data not shown). Regarding the genetic diversity of HPVs, however, it is interesting that several within-host variations of E6 and E7 found in this study were also identified in a recent study that reported high levels of between-host HPV16 genetic diversity (7). In the latter study, most HPV16 variants detected in control women had APOBEC signature mutations compared to the prototype HPV16 (7). We thus propose that the within-host variability generated by APOBEC is a source of such between-host HPV diversity and also has significantly contributed to HPV evolution by generating a huge variety of viral genomic sequences. Although further investigation is required to uncover the evolutionary principles that have shaped and diversified the HPV genome, our results offer a novel perspective on HPV evolution by demonstrating the diversity and richness of the viral genomic sequences in infected individuals.
MATERIALS AND METHODS
Clinical samples.Cervical exfoliated cells were collected in ThinPrep (Hologic, Bedford, MA) using a cytobrush from patients histologically diagnosed with CIN1, CIN2/3, or ICC at three hospitals (Keio University Hospital, Tsukuba University Hospital, and Showa University Hospital). Total cellular DNA was extracted from the cells on a MagNA Pure LC 2.0 (Roche Diagnostic, Indianapolis, IN) and subjected to PGMY09/11 PCR to amplify HPV L1 DNA, followed by reverse blot hybridization for HPV genotyping, as described previously (43). Based on the genotyping results, DNA samples positive for HPV16 (total, n = 45; CIN1, n = 13; CIN2/3, n = 24; ICC, n = 8), HPV52 (total, n = 56; CIN1, n = 22; CIN2/3, n = 23; ICC, n = 11), and HPV58 (total, n = 50; CIN1, n = 21; CIN2/3, n = 21; ICC, n = 8) were subjected to subsequent deep-sequencing analyses. The study protocol was approved by the Ethics Committees at each hospital and the National Institute of Infectious Diseases, and written informed consent for study participation was obtained from each patient.
Deep sequencing of the whole viral genome.Full-circle PCR or overlapping PCR was performed with PrimeSTAR GXL DNA polymerase (TaKaRa, Ohtsu, Japan) to amplify the whole-genome sequences of HPV16/52/58, as described previously (28, 44). The sequences of PCR primers were the following: full-circle PCR for HPV16, HPV16-1742F (5′-TGC TGT CTA AAC TAT TAT GTG TGT CTC-3′) and HPV16-1873R (5′-GCG TGT CTC CAT ACA CTT CA-3′); overlapping PCR for HPV16, HPV16-1744F (5′-TGT CTA AAC TAT TAT GTG TGT CTC CAA TG-3′) and HPV16-5692R (5′-GAT ACT GGG ACA GGA GGC AAG TAG ACA GT-3′); HPV16-5531F (5′-GGG TCT CCA CAA TAT ACA ATT ATT GCT G-3′) and HPV16-1980R (5′-TAT CGT CTA CTA TGT CAT TAT CGT AGG CCC-3′); full-circle PCR for HPV52, HPV52-1758F (5′-ACA CAT ATG GTA ATA GAA CCA CCA AAA-3′) and HPV52-1908R (5′-TAT TGT CAA AGC TAT GCT GTA ATA CTG-3′); overlapping PCR for HPV52, HPV52-1758F and HPV52-5968R (5′-TCC AAG CCT GTA CAG GCC CAC ACC AAC-3′); HPV52-5673F (5′-GTG TAC CTG CCT CCT GTA CCT GTC TCT-3′) and HPV52-1908R; full-circle PCR for HPV58, HPV58-1751F (5′-TAC TAT CAA TTC CTG AAA CAT GTA TGA-3′) and HPV58-1889R (5′-AAT CTA TCT ATC CAT TCT GGT GTT G-3′); overlapping PCR for HPV58, HPV58-1751F, HPV58-5846R (5′-GCC TGA TAC CTT GGG AAC TAA TAC TTT-3′), HPV58-5677F (5′-ACC TGC CTC CTG TGC CTG TGT CTA AGG-3′), and HPV58-1889R. The amplified DNA was subjected to agarose gel electrophoresis and purified with the Wizard gel purification kit (Promega, Madison, WI). The purified DNA was converted to a DNA library using the Nextera XT DNA sample preparation kit (Illumina, San Diego, CA), followed by size selection with SPRIselect (Beckman Coulter, Brea, CA). The multiplexed libraries, including up to 36 samples, were analyzed on a MiSeq sequencer (Illumina) with the MiSeq reagent kit v3 (150 cycles) to yield single-end 150-mer read sequences.
De novo assembly of the viral reference genome.The complete genome sequences of HPV16/52/58 were de novo assembled from the total read sequences using the VirusTAP pipeline (https://gph.niid.go.jp/cgi-bin/virustap/index.cgi) (45). The accuracy of the reconstructed whole-genome sequences was verified by read mapping with Burrows-Wheeler Aligner (BWA) v0.7.12 and subsequent visual inspection by Integrative Genomics Viewer (IGV) v2.3.90.
Detection of nucleotide variations in the viral genome.Nucleotide mismatches compared to the assembled reference genome and positions of variations in each sample were identified by using BWA and SAMtools v1.3.1 with in-house Perl scripts (28). Based on a quality score confidence threshold of a Phred quality score of >30 (error probability of <0.001) that was used to extract variation positions in the read sequences, we defined a position as homogeneous if the variation frequency was <0.5% and a position as heterogeneous if the variation frequency was >0.5%. This threshold value was above the reported rates of cross talk among multiplexed samples in the same run (0.06 to 0.21%) due to misassignments (46), ensuring that the detected variations were not spillover from other samples. The presence of nucleotide substitutions was finally confirmed by manual inspection of mismatched read sequences using IGV.
Mutational signature analysis.All nucleotide substitutions detected were first classified into the six base substitutions, i.e., C-to-A, C-to-G, C-to-T, T-to-A, T-to-G, and T-to-G substitutions, and then into 96 base substitution types of the trinucleotide sequence context that includes the bases immediately 5′ and 3′ of the mutated base. The relative proportion of each substitution type to all substitutions was calculated in each histological grade.
Dinucleotide content analysis.The density of TpC dinucleotides was calculated across HPV16/52/58 genomes, including all variant lineages/sublineages (GenBank accession numbers: HPV16, K02718, AF536179, HQ644236, AF534061, AF536180, HQ644298, AF472509, HQ644257, AY686579, and AF402678; HPV52, X74481, HQ537739, HQ537740, HQ537743, HQ537744, HQ537746, and HQ537748; HPV58, D90400, HQ537752, HQ537758, HQ537762, HQ537764, HQ537774, HQ537768, and HQ537770) (6) using 100-base nonoverlapping windows. Smoothed fitted lines and 95% confidence intervals of these densities were visualized using the ggplot2 package in R, v3.4.0.
Statistical analysis.Statistical analyses were carried out with GraphPad Prism 6 (GraphPad Software, Inc.) and R, v3.4.0. Kruskal-Wallis test was used to examine differences of the mean frequency and number of variations between different groups. Pearson's test for correlation was used to examine correlation between the variation number and the sequence length of viral genomic regions. Two-sided P values were calculated and considered to be significant at a P value of <0.05.
Accession number(s).Short-read sequencing data are available from the DNA Data Bank of Japan (DDBJ), Sequence Read Archive, under accession number DRA006584. The viral reference genome sequences determined from individual clinical samples are available from the DDBJ under the following accession numbers: HPV16, LC368952 to LC368996; HPV52, LC270024 to LC270075 and LC373204 to LC373207; HPV58, LC270076 to LC270088, LC270090 to LC270123, and LC373208 to LC373210.
ACKNOWLEDGMENTS
This work was supported by JSPS KAKENHI grant numbers 15K10701, 26460564, and 17K11297 and by Grants-in-Aid for Reemerging Infectious Diseases from the Japan Agency for Medical Research and Development (JP17fk0108218j0202).
We thank Tsuyoshi Sekizuka for his help in bioinformatics analyses and Masamichi Muramatsu for his critical comments on our manuscript.
We have no conflicts of interest to declare.
FOOTNOTES
- Received 4 January 2018.
- Accepted 8 March 2018.
- Accepted manuscript posted online 28 March 2018.
- Copyright © 2018 American Society for Microbiology.