Genotype-Specific Genomic Markers Associated with Primary Hepatomas, Based on Complete Genomic Sequencing of Hepatitis B Virus

ABSTRACT We aimed to identify genomic markers in hepatitis B virus (HBV) that are associated with hepatocellular carcinoma (HCC) development by comparing the complete genomic sequences of HBVs among patients with HCC and those without. One hundred patients with HBV-related HCC and 100 age-matched HBV-infected non-HCC patients (controls) were studied. HBV DNA from serum was directly sequenced to study the whole viral genome. Data mining and rule learning were employed to develop diagnostic algorithms. An independent cohort of 132 cases (43 HCC and 89 non-HCC) was used to validate the accuracy of these algorithms. Among the 100 cases of HCC, 37 had genotype B (all subgenotype Ba) and 63 had genotype C (16 subgenotype Ce and 47 subgenotype Cs) HBV infection. In the control group, 51 had genotype B and 49 had genotype C (10 subgenotype Ce and 39 subgenotype Cs) HBV infection. Genomic algorithms associated with HCC were derived based on genotype/subgenotype-specific mutations. In genotype B HBV, mutations C1165T, A1762T and G1764A, T2712C/A/G, and A/T2525C were associated with HCC. HCC-related mutations T31C, T53C, and A1499G were associated with HBV subgenotype Ce, and mutations G1613A, G1899A, T2170C/G, and T2441C were associated with HBV subgenotype Cs. Amino acid changes caused by these mutations were found in the X, envelope, and precore/core regions in association with HBV genotype B, Ce, and Cs, respectively. In conclusion, infections with different genotypes of HBV (B, Ce, and Cs) carry different genomic markers for HCC at different parts of the HBV genome. Different HBV genotypes may have different virologic mechanisms of hepatocarcinogenesis.

Chronic infection by the hepatitis B virus (HBV) causes an increased risk of hepatocellular carcinoma (HCC) of more than 100-fold (2). The relationship between HBV genotype and viral mutation with hepatocarcinogenesis is controversial. A case-control study from Taiwan suggested that genotype C HBV is more closely associated with cirrhosis and HCC in those who are older than 50 years, whereas genotype B is more common in patients with HCC who are less than 50 years old (18). Our previous cohort study of 426 cases of chronic hepatitis B patients also revealed a higher risk of HCC and liver cirrhosis in genotype C infection (5). On the other hand, reports from Japan and China did not confirm the higher malignant potential of genotype C HBV (27,33).
Recently studies reported that the prevalence of basal core promoter mutants (A1762T and G1764A) is associated with more aggressive progression of liver disease and development of HCC (19,30,33). Several HBV genes, including truncated pre-S2/S and X genes, have been found in hepatoma tissue (15,16,23). Another hot spot mutation in the core promoter region is the G1896A and G1899A mutation (12,30). HBV DNA integration into the host genome may allow persistence of the viral genome in the host and alteration of cell kinetics and cellular metabolism (3,4,29). Whether certain mutations of the HBV genes facilitate the integration of the viral genome and virus-host interaction is not known.
Two major reasons for discrepant results from various studies are (i) the small numbers of patients involved in these studies and (ii) the fact that most studies focus on a particular portion of the HBV genome (22). The aim of the present study was to identify markers in the HBV genome for HCC development by studying the complete genomic sequence of HBV among patients with HCC compared to age-matched individuals presenting with the infection but no HCC development.
(Part of this work has been presented at Digestive Diseases Week, 14 to 19 May 2005, Chicago, IL.)

Patients.
We conducted a case control-study of 100 patients with HBV-related HCC and 100 age-matched HBV-infected patients as controls. All patients who presented with HBV-related HCC to the Joint Hepatoma Clinic, Prince of Wales Hospital, Hong Kong, from July 1999 to December 2000 were studied. HCC was diagnosed by histology or a combination of ultrasonography, computerized tomography or magnetic resonance imaging, and/or hepatic angiography. Agematched patients with no evidence of HCC in the control group were selected from a cohort of chronic hepatitis B patients recruited from the liver clinics of the same hospital (5). The control cohort was recruited in a similar time frame as the cases (from December 1997 to July 2000). They were prospectively followed up until June 2003 with regular ultrasound and alfa-fetoprotein surveillance to confirm the absence of HCC. Serum samples from all patients were stored at Ϫ80°C. Informed consent was obtained from all patients, and ethics committee approval was obtained. An independent cohort of patients with known HBV infection (HBsAg positive) with or without HCC was studied to validate the findings of the case-control study.
DNA extraction, amplification, sequencing, and determination of genotype. HBV DNA was extracted from 100 l of serum using the QIAamp DNA blood mini kit (Qiagen GmBH, Hilden, Germany) according to the manufacturer's instructions. To obtain the full-length HBV DNA sequence, we performed seminested PCR to amplify three overlapping fragments of the HBV genome. For each fragment, 5 l of the extracted DNA was used with Taq DNA polymerase (Amersham Biosciences, Uppsala, Sweden) and Pfu DNA polymerase (Promega, Madison, WI) in the first-round PCR and with Taq DNA polymerase alone in the second-round PCR. The final PCR product was examined on a 1.0% agaroseethidium bromide gel run in 1ϫ Tris-borate-EDTA buffer.
For fragment A, PCR was carried out with P1 and P2 primers with a 5-min initial denaturation at 95°C, followed by 10 cycles of amplification (94°C for 36 s, 60°C for 36 s, and 72°C for 2.5 min), then 30 cycles of amplification (94°C for 36 s, 50°C for 36 s, and 72°C for 2.5 min), and a 7-min final extension at 72°C. The sequences of all primers used for PCR and sequencing in this study are shown in Table 1. The PCR product was further amplified in a seminested PCR with P1 and P3. PCR was carried out with a 5-min initial denaturation at 95°C, followed by 10 cycles of amplification (94°C for 36 s, 60°C for 36 s, and 72°C for 2 min), then 30 cycles of amplification (94°C for 36 s, 52°C for 36 s, and 72°C for 2 min), and a 7-min final extension at 72°C. For fragment B, PCR was carried out with the P4 and P5 primers, and the PCR product was further amplified in a seminested PCR with the P5 and P6 primers. For fragment C, PCR was carried out with the P7 and P9 primers. The PCR product was further amplified in a seminested PCR with the P8 and P9 primers. Both strands of PCR products were directly sequenced with the DYEnamic ET Dye Terminator cycling sequencing kit for MegaBACE (Amersham Biosciences, Piscataway, NJ).
Molecular evolutionary analyses. All HBV genomic sequences in this study and typical genome sequences of different genotypes were multiply aligned using CLUSTALW (28) version 1.83 and corrected manually by visual inspection. The numbering of HBV nucleotides started at the EcoRI cleavage site. Genetic distances were estimated by Kimura's two-parameter method (20), and the phylogenetic trees were constructed by the neighbor-joining method (24). The reliability of the pairwise comparison and phylogenetic tree analysis was assessed by bootstrap resampling (13) with 1,000 replicates. Phylogenetic and molecular evolutionary analyses were done using MEGA version 3.0 (21). HBV genotypes and subgenotypes were determined by comparison with 122 full genome sequences downloaded from GenBank.
Data mining framework. The data mining framework is shown in Fig. 1. The process involved seven modules. After the molecular evolutionary analyses, the data were passed to the clustering module to check whether clusters existed, based on the phylogenetic tree analysis. These clusters are possible genotypes or subgroups possessing differences in some nucleotides which do not have any effects on the classification of HCC. If clusters were found, each cluster was analyzed separately for potential genetic marker sites. While genotype B HBV appeared to be a homogenous group, the phylogenetic tree results showed that there exist two subgroups (clusters) in genotype C among the HBV strains collected (Fig. 2) (9). All three (sub)groups (B, Cs, and Ce) were analyzed separately in the learning and classification parts.
For each cluster, the data were divided into training and testing sets. The training samples were then passed to the feature selection module to find the useful features (potential marker sites) for classification. The feature selection was based on the information gains of each aligned site. The details of the information gain calculation are given in the appendix. The main purpose of feature selection was to reduce the number of features used in classification while maintaining acceptable classification accuracy. The sites were then ranked according to their respective information gains, which can reflect their potentials to distinguish between the control and the HCC groups. The ranked and computed information gain of each aligned site can be displayed with the aligned sequences by our viewer tools. The top 10, 20, 30, 40, and 50 ranked (information gain) sites were included as the selected features for the classifier learning module and preprocessing modules in turn to see which one gave the best result.
The selected features were extracted and passed to the classifier learning module, wherein a rule-based classifier was learned. Rule learning tries to learn rules from a set of training data (samples). It can be modeled as a search problem of finding the best rules that classify the training examples with minimum classification error. Generic genetic programming (31), which is a type of evolutionary algorithm (1), was adopted as our search and optimization algorithm to learn the rules. The testing data were then transferred to the preprocessing module with the marker sites selected by the feature selection module. The testing data were preprocessed, and only the part relevant to the selected sites was kept. This part of the testing data was then used for prediction evaluation in the classification module.
Prediction results were output from the classification module. They were then verified by the actual classes given in the testing samples. If the verification results were unsatisfactory, the process was repeated, starting from the features selection.
In the final validation module, when a reasonable classifier was obtained, the classifier could be further validated by testing with previously unanalyzed validation samples.
Statistical analysis. In the case-control study with 100 HCC cases and 100 non-HCC age-matched controls, 90% of the samples were selected randomly as the training set and the remaining 10% formed the testing set in each experiment. For each data set, the experiment was repeated 10 times by picking different training sets. For each learning and evaluation experiment, sensitivity and specificity as defined below were estimated as the fitness or performance indicators of the classification rules. The average sensitivity and specificity of the testing set in the case-control study and of the validation cohort were determined. The 95% confidence intervals (CIs) of the sensitivity and specificity as well as the likelihood ratios were determined based on the performance of the algorithms on the entire data set. The odds ratios (ORs) and 95% CIs for HCC among patients with different numbers of HCC-related mutations were also calculated. When any zero cell occurred in the two-by-two contingency table, we added 0.5, based on the Haldane correction (14), to all of the cells in the calculation of ORs and 95% CIs. The statistical significance was examined at the conventional level of 0.05 by analysis of variance, the chi-square test, or Fisher's exact test as appropriate.

RESULTS
Case-control study. The demographic characteristics, clinical diagnoses, and HBV genotypes of the HCC and control groups are listed in Table 2. Among the 100 cases of HCC, there were 37 cases who had genotype B HBV and 63 cases who had genotype C HBV. In the control group, 51 cases were infected with HBV genotype B HBV and 49 cases with genotype C HBV. There was a significant male preponderance in both the HCC and control groups for both genotypes. Sixtyseven percent of cases in the HCC group had cirrhosis, compared to only 13% in the control group. The percentages of patients with cirrhosis in the genotype B subgroup (35.1%) and genotype C subgroup (31.7%) were quite similar. Validation cohort. The validation samples came from a serum bank of patients with known HBV infection (HBsAg positive) with or without HCC. This was an independent cohort of 132 cases, including 43 patients with HCC (18 patients with genotype B and 25 with genotype C HBV infection) and 89 non-HCC subjects (41 with genotype B and 48 with genotype C HBV infection). There was no overlap between this validation cohort and the test cohort described above.
Subgenotype prevalence in subjects. Genotype B HBV appeared to be a homogenous group, and all belonged to subgenotype Ba (26). However, the phylogenetic tree results showed that there existed two subgroups, namely, Ce (found predominantly in East Asia) and Cs (found predominantly in Southeast Asia), in genotype C among the HBV strains collected (Fig. 2). This is in concordance with our previous phylogeny with published full-length sequences in GenBank (9).
The clinical characteristics of patients with genotype B and subgenotypes of genotype C are shown in Table 2. No significant difference in age (P ϭ 0.46), gender (P ϭ 0.06), or presence of HCC (P ϭ 0.11) was observed between patients with genotype B and subgenotype C. The proportions of cirrhosis in HCC patients with HBV genotype B, subgenotype Ce, and subgenotype Cs were 65%, 88%, and 62%, respectively. The risk of cirrhosis and HCC for subgenotype Ce was higher than for the others, but this result did not show a statistically significant difference (P ϭ 0.16%). These percentages of cirrhosis were much higher than the proportions of cirrhosis in control patients with HBV genotype B (P Ͻ 0.001), subgenotype Ce (P Ͻ 0.001), or subgenotype Cs (P Ͻ 0.001).
HCC-related mutations. Among HCC patients with genotype B HBV, mutations in the following sites were commonly found: A1762T (81.1%) and G1764A (81.1%), C1165T (18.9%), T2712C/A/G (70.3%), and A/T2525C (21.6%). The mutations at these nucleotide positions in the HCC and control groups are shown in Table 3. In the group with HBV subgenotype Ce, the mutations T31C (37.5%), T53C (37.5%), and A1499G (62.5%) were associated with HCC development: (Table 3). In the group with HBV subgenotype Cs, the mutations G1613A (38.3%), G1899A (27.7%), T2170C/G (34.0%), and T2441C (21.3%) were associated with HCC development (Table 3). Combining the patients from the case-control study and the independent validation cohort, the presence of an increasing number of HCC-related mutations in each HBV genotype/subtype was associated with an increased risk of HCC (Table 4). All mutations associated with HCC development had amino acid changes in at least one of the four open reading frame of HBV (Table 5). Amino acid changes in the X region were found only in genotype B HBV. Envelope region amino acid changes were found in HBV subgenotype Ce, whereas precore/core region amino acid changes were found in HBV subgenotype Cs. Diagnostic algorithm. Using the method of rule learning using evolutionary algorithms, the following algorithms for risk estimation was established. The classification rules for genotype B were as follows: IF A1762G1764 and T1165, then HCC IF T1762A1764 and ACG2712, then HCC IF T1762A1764 and T2712 and C2525, then HCC ELSE, non-HCC. Using this algorithm, the sensitivity (95% CI) and specificity (95% CI) of diagnosing HCC in the testing cohort were 0.75 FIG. 2. Phylogenetic tree of the full-genome sequencing of HBV in the case-control study. All patients were infected with either genotype B or C HBV. Two subgenotypes (Ce and Cs) could be identified in genotype C HBV due to a more than 4% difference in the entire HBV sequence. Using this algorithm, the sensitivity (95% CI) and specificity (95% CI) of diagnosing HCC in the testing cohort were 0.75 (0.54 to 0.96) and 0.70 (0.42 to 0.98), respectively, and those in the validation cohort were 1.00 (not available) and 0.75 (0.47 to 1.00), respectively. The positive and negative likelihood ratios (95% CIs) for the performance of the algorithm in the testing cohort were 2.50 (0.03 to 4.97) and 0.36 (0.02 to 0.69), respectively, and those in the validation cohort were 4.00 (0.00 to 8.53) and 0.00 (not available), respectively.
The classification rules for the Cs cluster of genotype C were as follows: IF A1613 OR A1899 OR CG2170 OR C2441, then HCC ELSE, control. Using this algorithm, the sensitivity (95% CI) and specificity (95% CI) of diagnosing HCC in the testing cohort were 0.72 (0.59 to 0.85), 0.72 (0.58 to 0.86), respectively, and those in the validation cohort were 0.88 (0.73 to 1.00) and 0.63 (0.48 to

DISCUSSION
In this study, we demonstrated that certain genotypes (and subgenotypes) and mutations are associated with development of hepatic carcinogenesis. There seems to be a stratified risk of HCC, with each genotype (or subgenotype) being associated with a certain pattern of mutations. The significance of these genotypes and mutations was verified by use of an independent cohort which was composed of both HCC and non-HCC patients. Using these algorithms, the sensitivity of identifying a high-risk case ranged from 72% to 75% and the specificity ranged from 66% to 72%. Although the use of these algorithms had only moderate discriminatory capability to predict HCC (positive likelihood ratio of 2.21 to 2.57 and negative likelihood ratio of 0.36 to 0.39), our data suggested that different HBV genotypes and subgenotypes might have different predominant carcinogenic mechanisms.
The issue of HBV genotypes has been debated due to discrepant results in previous studies from different countries (18,27). These differences may be explained by a distinct distribution of HBV subgenotypes in different geographical regions. In most Asian countries, only subgroup Ba of HBV is found, while the majority of Japanese patients with HBV have subgroup Bj (9). Genotype C HBV has a higher risk of HCC than genotype B HBV, which is probably related to a delayed HBeAg seroconversion, more active hepatitis, and a higher prevalence of basal core promoter mutations (5,19,32). Among genotype C HBV, there were also differences in the disease activity associated with different subgenotypes (6). Recently, we have shown that subgenotype Ce HBV was associated with the highest risk of HCC independent of other risk factors, including high HBV DNA levels and liver cirrhosis, among a longitudinal cohort of 1,006 chronic hepatitis B patients followed up for 7.7 years (10). The proportion of HCC in patients with subtype adw was found to be higher than that in patients with subtype adr (25). Going beyond attributing HCC to a specific genotype, this study suggests that different genotypes of HBV are associated with different mutations of the viral genome and thus may have separate mechanisms of hepatic carcinogenesis.
The basal core promoter mutant (T1762/A1764) is found to parallel the progression of liver disease and increases the risk of HCC for both genotype B and C HBV (19,33). In common with previous studies, we also found mutation at codon 1762/ 1764 to be associated with HCC in genotype B HBV infection. The reason why 1762/1764 mutations were not identified as a marker for HCC in genotype C HBV was related to the high prevalence of mutations at these sites even among the non-HCC patients (8). However, this phenomenon may also mean that a selection pressure on the basal core promoter/X region of the HBV genome in genotype B HBV is associated with the development of HCC. The HCC-associated mutations selected by HBV subgenotype Ce are located in the envelope region, while those selected by HBV subgenotype Cs are located in the precore/core region. These findings offer additional support for the presence of various virologic mechanisms of hepatocarcinogenesis by different HBV genotypes/subgenotypes. The functions of these mutations and their gene products need further investigation.
HBV DNA appears to integrate into host DNA at different sites, exerting direct and indirect effects on the host genes (7). It has also been postulated that the integrated HBV genes can activate cellular genes remote from the site of HBV DNA integration, thereby influencing cellular proliferation and differentiation. This transactivation effect could be mediated through different signal transduction pathways. Identification of HCC-related mutations is only the first step in understanding the viral mechanism of hepatic carcinogenesis. Functional genomic studies of these mutations would have to be carried out in the future to elucidate the effects of these mutations on cell growth and death of hepatocytes.
There are several limitations in this study. First, although patients in the control group were age matched with those in the HCC group, the possibility of developing malignancy in the future cannot be denied. As there is no matching in the disease severity and liver cirrhosis, the HCC-related mutations may have an indirect effect on HCC development through increas- ing hepatic inflammation and liver cirrhosis. When the algorithms were tested with the independent validation cohort, a very high sensitivity and a satisfactory specificity were reported for both genotype B and C subgenotypes. Second, although this is by far the largest cohort of HCC and non-HCC cases to have full-length viral genomic analysis of HBV compared to previous studies (17,22), the sample size is still relatively small. The 95% CIs for the sensitivity and specificity of the genomic algorithms are still wide. In the future, laboratory methods to detect these mutants in a more robust manner than does fullgenome sequencing are needed to facilitate a larger-scale validation study. A larger cohort, preferably from a different geographic location, would also be needed to validate the generalization of our results. Third, we can only study patients with genotype B and subgenotypes Ce and Cs of HBV. We cannot study genotype A HBV and genotype D HBV, which are prevalent in Europe and Africa, because of our geographic limitations. Moreover, as most Hong Kong residents are immigrants from China, we did not have the information on the place where the ancestors of the patients acquired the infection. We believe that most of our patients originated from southern China, where HBV subgenotype Cs is more prevalent than subgenotype Ce. However, the methodology adopted in this study could be used in countries with other HBV genotypes for mining of HBV-related mutations. Finally, we have not worked out the functionality of these mutated codons and why they might lead to development of HCC. More work is required to elucidate the virologic and host responses to mutations. We cannot draw a conclusion on the causal relationship between these HBV mutations and HCC.
In conclusion, this study suggests that HBV genotypes B and C demonstrate different point mutations which might be associated with high risk of hepatic carcinogenesis. The difference in the locations of these mutations in the HBV genome may reflect the underlying mechanisms of hepatocarcinogenesis of the different HBV genotype/subgenotypes. The detection of these mutations has shown promising results in the association with a higher cancer risk. By combining this information with other clinical risk factors for HCC, including HBV DNA levels and liver cirrhosis status (10,11), future clinical algorithms can be refined. It is possible that these diagnostic algorithms may shed light on which patients with chronic HBV infection require more frequent screening and surveillance for HCC development.

APPENDIX
The information gain of a feature (attribute) is the reduction in uncertainty (entropy) that results if the attribute is used for classification. Hence, the higher the information gain, the better. The following equation gives the entropy, E, of an attribute X with n values, X 1 . . . X n , where P(X j ) is the frequency of the value X j : E(X) ϭ j ϭ 1 n Ϫ P(X j )log 2 P(X j ).
Specific to a typical DNA classification problem, we assumed that the data had M classes, C 1 . . . C M . For each aligned site position, it has N possible nucleotides, V 1 . . . V N . We defined C m as the number of sequences in class C m . C mi is the number of sequences in class C m whose character at the aligned site is V i , which could be A, T, G, or C in our case. The remainder of X, R(X) was defined as follows: The information gain, IG j , of the aligned site j is the difference between the original information content E(C) of the data set and the amount of information needed to classify all the unclassified data left in the data set after applying site j for classification: IGj ϭ E(C) Ϫ R(j).
The features were ranked by the information gains, and then the top-ranked features were chosen for classification. A site with higher information gain would contribute more discriminatory power to the classification such that more samples could be distinguished by this site.