Molecular Evolution of Hepatitis A Virus: a New Classification Based on the Complete VP1 Protein

ABSTRACT Hepatitis A virus (HAV) is a positive-stranded RNA virus in the genus Hepatovirus in the family Picornaviridae. So far, analysis of the genetic variability of HAV has been based on two discrete regions, the VP1/2A junction and the VP1 N terminus. In this report, we determined the nucleotide and deduced amino acid sequences of the complete VP1 gene of 81 strains from France, Kosovo, Mexico, Argentina, Chile, and Uruguay and compared them with the sequences of seven strains of HAV isolated elsewhere. Overall strain variation in the complete VP1 gene was found to be as high as 23.7% at the nucleotide level and 10.5% at the amino acid level. Different phylogenetic methods revealed that HAV sequences form five distinct and well-supported genetic lineages. Within these lineages, HAV sequences clustered by geographical origin only for European strains. The analysis of the complete VP1 gene allowed insight into the mode of evolution of HAV and revealed the emergence of a novel variant with a 15-amino-acid deletion located on the VP1 region where neutralization escape mutations were found. This could be the first antigenic variant of HAV so far identified.

Human hepatitis A virus (HAV) is a hepatotropic member of the Picornaviridae family (31,33). The clinical manifestations of HAV infection in humans can vary greatly, ranging from asymptomatic infection, commonly seen in young children, to fulminant hepatitis, which in some cases can result in death (51). HAV is transmitted primarily by the fecal-oral route, and epidemics are common in regions where sanitation is poor.
The virion is nonenveloped with a capsid composed of three main polypeptides (VP1, VP2, and VP3). There is only one serotype of HAV identified thus far, and the only known antigenic variants are HAV strains collected from Old World monkeys (37,58). Monoclonal antibody studies suggest that there are a limited number of antigenic epitopes which are closely grouped at the surface of the virus (42,43,54).
These three regions, i.e., the VP3 C terminus, the VP1 amino terminus, and the VP1/2A junction, present some genetic differences. The VP3 C-terminal region is relatively con-served, while the VP1/2A junction is more variable and can be used to distinguish one strain from another (46,47,49,50). The VP1 amino terminus presents an intermediate variability between the two regions cited above. On the basis of the genetic variability observed within the putative VP1/2A junction, seven HAV genotypes have been identified (49). However, an extensive study of South American HAV strains revealed that the VP1 amino terminus contains moreinformative variable positions than the VP1/2A junction (12).
To gain insight into the molecular epidemiology of HAV, we investigated the genetic variability of HAV strains recovered from different countries using the complete sequences of the VP1 gene. The sequences of 81 HAV isolates from France, Kosovo, Mexico, and three other South American countries have been determined. The analysis of these sequences allowed us to estimate the mode of HAV evolution, to determine the multiple genotypes cocirculating during epidemic outbreaks, and to document the emergence of a novel variant which could not be detected using either the VP1 aminoterminal region or the VP1/2A junction.

MATERIALS AND METHODS
Virus strains. HAV strains were collected from different geographical areas. The source, origin, year of isolation, and epidemiological data from these and other HAV strains whose sequences were taken from GenBank and used in our analysis are listed in Table 1. The French isolates were obtained from an epi-demic outbreak associated with shellfish consumption which occurred north of Bretagne during the winter of 1999 (14); these isolates were taken from sludge samples obtained from a wastewater treatment plant in Nantes (France) and from a virus collection of the Unité de Virologie at the Centre Hôspitalier Universitaire, Nantes, France, from 1983 to 2001. Seven serum samples from patients with HAV during an epidemic outbreak during the Kosovo war (1998 to 1999) were collected and processed at the Hôpital Val de Grâce, Paris, France.
South American strains were collected from three countries. Fourteen HAVpositive serum samples were collected at Pereira Rossell Hospital and Asociacion Espanola Primera de Socorros Mutuos during an epidemic outbreak that occurred in Montevideo, Uruguay, from September 1999 to February 2000. Over the same period of time, four stool samples from Chilean patients were collected at Hospital Regional in Valdivia, and nine stool and serum samples from Argentinian patients were collected at Hospital San Juan de Dios in Buenos Aires (Table 1). In September 2000, stool samples were isolated in Nantes, France, from a recently arrived Mexican student.
RNA extraction and amplification. Viral RNA was extracted from 140 l of serum or 400 l of a stool suspension (0.3 g of stool diluted in 2 ml of water) using the QIAmp viral extraction kit (Qiagen) or the RNeasy plant minikit (Qiagen), respectively. HAV RNA was extracted from sludge samples as previously described (35). Briefly, viral RNA was extracted from the samples, and the amount obtained was quantified by a real-time reverse transcriptase PCR (RT-PCR) method. Typically, there were approximately 6.38 ϫ 10 5 copies of viral RNA/ml (13). The primers used to reverse transcription and sequencing are listed in Table 2. From each viral RNA, an amplicon, encompassing the 3Јend of VP3, all of VP1, and the 5Ј end of 2A was amplified by using primers HAV1 and HAV2 (Table 2) in a reaction driven by the SuperScript One-Step RT-PCR with Platinum Taq kit (Invitrogen). RT-PCR was performed with a one-tube reaction mixture (25 l) containing 5 l of viral RNA, 20 pmol of each primer, a 100 M concentration of each deoxynucleoside triphosphate, 50 mM Tris-HCl (pH 8.3), 2 mM MgSO 4 , 20 IU of RNase inhibitor (Promega), and 1 l of the RT/Platinum Taq Mix (Invitrogen). VP1-specific cDNA was synthesized by incubation of the reaction mixture for 45 min at 42°C and 5 min at 94°C, and it was amplified by 35 cycles, with 1 cycle consisting of 30 s at 94°C, 45 s at 50°C, and 1 min at 68°C. For some samples, the complete VP1 gene was amplified as two overlapping fragments with internal primers. The primers used either to amplify or to sequence the entire VP1 genes are listed in Table 2. The products were purified and sequenced using the Big Dye DNA sequencing kit (Perkin-Elmer) and a To distinguish natural heterogeneity from possible technical artifacts, both strands of each DNA fragment were sequenced. When discrepancies were observed, the procedure was repeated three times using a different set of primers (Table 2). Sequence analysis. The entire VP1 nucleotide sequences were aligned using the CLUSTAL W program (57). Matrix distances for the Kimura two-parameter model were then generated (18) and used to compute neighbor-joining phylogenetic trees. The robustness of each node was assessed by bootstrap resampling (1,000 pseudoreplicates). These methods were implemented by using software from the MEGA program (28).
Substitution rate analysis. The substitution rate along the VP1 gene was measured using a sliding window by the procedure of Alvarez-Valin et al. (1). Pairwise nucleotide distances (synonymous and nonsynonymous) within each window were estimated by the method of Comeron (11) as implemented in the computer program k estimator, where k is the number of nucleotide substitutions between sequences. For those windows where this method could not be applied, the Jukes-Cantor method (25) was used for correction for multiple hits. The window had a size of 30 codons and a movement of 3.
Nucleotide sequence accession numbers. The complete VP1 sequences obtained from the 81 HAV strains were deposited in GenBank, and their accession numbers are shown in Table 1.

VP1 and picornavirus phylogeny.
To assess the utility of the complete VP1 protein to infer the relationships among HAVs and other human picornaviruses, a phylogenetic tree was constructed with representative strains from the Picornaviridae family, taken from databases using p-distance and the neighbor-joining method. The results of this study are shown in Fig.  1. As can be seen, the enteroviruses clustered into four major lineages. This result is consistent with previous phylogeny studies (22,40,44). Moreover, human rhinoviruses form a cluster among the human enterovirus group, consistent with the results of previous phylogeny studies (40). Each lineage was very strongly supported with bootstrap values from 94 to 100% (Fig.  1). Aphthoviruses (foot-and-mouth disease virus), cardioviruses, and hepatoviruses were assigned to different clusters, which were also strongly supported (supported by a bootstrap value of 100%).
Phylogenetic analysis of HAV strains. The complete VP1 nucleotide sequences (900 nucleotides) for 81 HAV strains from Argentina, Chile, Mexico, Uruguay, Kosovo, and France isolated from 1983 to 2001 were determined and aligned with those of 10 isolates of HAV taken from the database (Table 1). Phylogenetic trees were generated using Kimura two-parameter distance and the neighbor-joining method (Fig. 2). These results revealed the existence of five genetic groups, strongly supported by bootstrap values.
The majority of the human HAV strains included in these studies belong to genotype I. This genotype is subdivided into two well-identified subgenotypes, namely, IA and IB (Fig. 2) Genotype IA, supported by a bootstrap value of 87%, includes strains from Argentina, Chile, Mexico, Uruguay, Kosovo, and France. Nevertheless, genetic heterogeneity was observed within the main IA cluster. The South American strains clustered (even those from the same country) in different branches, and no geographical cluster was found.
Interestingly, different clusters of strains isolated in France were well identified within the main subgenotype IA cluster. While one of these clusters contains strains isolated during an epidemic outbreak north of Bretagne (France), another contains environmental strains isolated from sludge samples recovered near the Loire region (Nantes, France). Remarkably, virus causing sporadic cases clustered separately from the two French clusters cited above (sporadic and epidemic French clusters in Fig. 2). Nevertheless, bootstrap values did not allow us to establish definitive genetic relationships among these strains. Genotype IB, also strongly supported by bootstrap values, contains strains from France which cluster with the  genotype IB reference strain, HM-175. Within this genotype, two distinct lineages were found. One cluster included HAV strains isolated from sporadic cases of infection, while the other one was made up of the only isolate of this subgenotype found in the environment (Boue2 strain). Genotype IIIA comprises the simian prototype strain from Panama (PA21 [29]), one strain isolated in the United States (26), and the only genotype IIIA strain isolated in France thus far (14).
Genotypes IV and V are represented by strains recovered from Old World monkey species (37,58). The VP1 sequences from genotypes IV and V were obtained from databases and were included in this analysis. These two genotypes also clustered separately (Fig. 2).
The VP1 sequence from the only example of genotype VII identified so far was recently published (9) and was included in our analysis. This strain was isolated from an African (Sierra Leone) woman who developed fulminant hepatitis.
Surprisingly, strain 9F94 clustered separately from all other strains, even though it was genetically closer to genotype VII than to any other genotype (Fig. 2).
These findings were further confirmed by using both the Tamura-Nei (55) and Jukes-Cantor distances (25) (data not shown).
Because the complete VP1 sequences from genotypes II, IIIB, and VI were not available and could not be included in this study, phylogenetic analysis using the VP1/2A region was performed to ascertain whether strain 9F94 was related to any of these genotypes or subgenotypes (Fig. 3). As can be seen, the 9F94 strain has a close genetic relationship with the only identified genotype II strain of HAV, the CF-53/Berne strain (Fig. 3).
Interlineage and intralineage sequence diversity. Sequence differences within and between HAV types and subtypes are shown in Table 3. In the VP1 region, there is a great level of overall genetic diversity within HAV strains, approaching a  Within lineages, differences are less than 11.1% at the nucleotide level and 5.6% at the amino acid level. Genetic variations between genotypes range from 10.6 to almost 23.5% at the nucleotide level and 0.7 to almost 10.5% to the amino acid level.
Amino acid sequence diversity. Despite the extensive nucleotide variation in the HAV strains, the deduced amino acid sequences for all 81 HAV strains showed a high degree of sequence homology (89.5 to 100%) within the VP1 protein.
The majority of differences involve synonymous mutations with only few changes resulting in amino acid changes.
Within-gene covariation between synonymous and nonsynonymous substitutions. Figure 5 shows the variation in the rates of synonymous and nonsynonymous substitutions within the HAV VP1 region. We have compared strains of three different subgenotypes (Chile-6, IA; MBB, IB; P27, IIIA). Synonymous distances are significantly higher than nonsynonymous ones for almost all pairwise comparisons (Fig. 5). As a consequence, the synonymous distance/nonsynonymous distance (k a /k s ) ratio is very low along the whole sequences; this has usually been associated with purifying selection acting at the level of amino acid conservation.
To obtain a clearer picture of the evolutionary mechanisms underlying the changes in VP1, we analyzed the processes of divergence within phylogenetically independent lineages (by the method of Alvarez-Valin et al. [1]). Therefore, we obtained profiles of synonymous and nonsynonymous distances for pairs of strains of types IA, IB, and IIIA (shown in the phylogenetic tree presented in Fig. 2). The results are shown in Fig. 6.
Although comparison of the synonymous substitution profiles in the three different genetic lineages revealed quite different patterns and no significant association between the k s values was observed, the profiles obtained for the three pairs show that nonsynonymous substitutions revealed extremely low rates all over the gene.

DISCUSSION
VP1 Picornaviridae phylogeny. VP1 is the major surfaceaccessible protein in the mature picornavirus virion (21,30,42,43). Monoclonal antibody-resistant mutants have shown that a number of amino acids within the VP1 protein contribute to the major immunodominant site of HAV (42,43,54). Most escape mutations in other picornaviruses are similarly located on loops connecting ␤-strands within an eight-strand antiparallel ␤-barrel structure assumed by other picornaviral proteins (42,43). Therefore, the use of VP1 protein to establish the genetic relationships in the family Picornaviridae, using complete VP1 sequences, will be extremely useful for strain characterization and molecular evolution studies. Using this approach, it was possible to establish a well-defined phylogeny for the family Picornaviridae with very strong statistical and phylogenetic support (Fig. 1).
New classification of HAV based on the VP1 region. Genetic characterization of HAV was based on the traditional criteria applied to poliovirus (45). It was based upon the percentage identity within the putative VP1/2A junction (168 nucleotides) (49), and seven different genotypes, designated I to VII were determined. Four of these genotypes have been associated with human disease (I, II, III, and VII).
Recently, evolution and phylogenetic analysis of different members of the Picornavirus family took into account the complete VP1 gene (6,20,32,38,40).
In this study, which includes virus strains recovered from six different countries, for the first time, the complete VP1 protein sequence (900 nucleotides) was used to determine the molecular epidemiology and evolution of HAV strains isolated from 1983 to 2001. The results revealed the existence of five genetic groups or genotypes. Each genetic group was strongly supported by bootstrap values. The bootstrap values observed for the phylogenetic trees (Fig. 2) allowed us to differentiate among the different genotypes and subgenotypes and even established some definitive relationships among strains within each subgenotype.
How many genotypes of HAV? So far, HAV strains have been classified into seven distinct genotypes by the method of Robertson et al. (49), who considered only 168 bases of the VP1/2A junction and/or the first 148 bases of the VP1 gene (VP1 N-terminal region). By using this approach, viruses in three of these genotypes (I, II, and VII) were recovered from infected humans, while one genotype (III) contained viruses isolated from humans and owl monkeys. Genotypes I and III comprised the vast majority of human strains studied, while genotypes II and VII were each represented by only a single strain. Unexpectedly, phylogenetic analysis using the complete VP1 region revealed the presence of five distinct genetic groups, all of them supported by high bootstrap values (Fig. 2). This was surprising, since the only sequence not included in our analysis (not available at present) was from genotype VI (JM-55 strain). Strain 9F94, which clustered separately from all other strains included in this analysis (Fig. 1), was shown to be closely related to strain CF-53/Berne (genotype II) when the available sequences of VP1/2A region are studied (Fig. 3) and has the same genetic lineage as SLF88 strain (genotype VII). Moreover, sequence differences between HAV genotypes re- vealed that the least variation observed was found between genotypes II and VII (Table 3). Considering all these observations together, it was tempting to speculate that this two genotypes may just be one or two subgenotypes of the same type, the type being genotypes II and VII described by Robertson et al. (49).
Further studies are needed to test this hypothesis. Since the complete sequence of the SLF88 strain has been recently published (9), the determination of the complete sequence of strain 9F94 might help to clarify this issue.
HAV variant strain. In the course of this study, a second, unexpected finding emerged: we were able to detect a 45nucleotide deletion within the VP1 gene of the Uru-3 strain, resulting in a 15-amino-acid deletion (Fig. 4).
The only VP1 deletion (18 nucleotides) reported so far was found in strains adapted to grow in cell culture (4,19,48). As these adapted strains grow in FrhK/4 cells, this deletion appears to be related to the adaptation of the virus to that particular cell line. Since the Uru-3 strain was directly amplified from serum samples without previous passage in cell culture, mutations induced by adaptation to growth in cell culture may be ruled out. Moreover, the Uru-3 45-nucleotide deletion mapped in a VP1 region far removed from the 18-nucleotide deletions associated with the adaptation to grow in cell culture.
The VP1 Uru-3 deletion is intriguing, since there is no evidence of antigenic variation among human HAV strains detected by an immunological method. The only known antigenic variants of HAV are strains collected from Old World monkeys (genotypes IV and V) (37,58). These variants presented two amino acid mutations (Ser102 of VP1 and Asp70 of VP3) which have been identified as part of the immunodominant FIG. 5. Profiles of synonymous (blue line) and nonsynonymous (red line) distances between different HAV genetic lineages. Sequences from strains Chile-6 (subgenotype IA) and MBB (subgenotype IB) (A), strains Chile-6 (subgenotype IA) and P27 (subgenotype IIIA) (B), and strains MBB (subgenotype IB) and P27 (subgenotype IIIA) (C). The x axis depicts the window number, and the y axis depicts distance. region in human HAV using escape mutants to monoclonal antibody K24F2 (42).
Neutralization escape mutations for the HAV strain HM-175 were identified at Asp70 and Gln74 of the VP3 protein and at Ser102, Val171, and Lys221 of the VP1 protein (42,43), and those for strain HAS15 were identified at Pro65, Asp70, and Ser71 of the VP3 protein and at Asn104, Lys105, and Gln232 of the VP1 protein (36). The deleted region in the Uru-3 strain contains three amino acids (Ser102, Asn104, and Asn105) which were reported to be able to induce a escape response in neutralization experiments (Fig. 4). Moreover, these residues align with recognized immunogenic sites in human rhinovirus 14 (HRV14) (52) and poliovirus type 3 (PV3) (21,34). This results suggest that this residues are part of an immunogenic site that is analogous to neutralization immunogenic sites found in other picornaviruses (HRV14 and PV3). Therefore, it is possible that the deletion found in this strain would alter the antigenic structure of this virus. This observation suggests that this strain may be the first antigenic variant of HAV found in humans. Although the deletion of 45 nucleotides would conserve the reading frame, we cannot rule out the possibility that this strain is a defective virus.
The construction of chimeric full-length cDNA clones carrying the sequences coding for the capsid protein of Uru-3, followed by cell culture growing experiments and monoclonal antibody studies, may allow us to address this issue.
Since all the data obtained from escape mutant studies performed thus far (27,42,43) suggested that the immunodominant antigenic site is formed for the VP3 and VP1 proteins and only the VP1 region was examined in this study, mutations located in another parts of the genome may complement those located in the VP1 protein. The possibility that mutation in one of these sites could affect antibody binding cannot be ruled out until the HAV structure is resolved. Mode of evolution and substitution rates in HAV. The different patterns between the intragenic distributions of synonymous substitutions in the HAV VP1 protein (also observed among different genetic groups [ Fig. 5]) suggest that synonymous divergence could be random in the VP1 gene. The distribution of nonsynonymous substitutions shows a complete different situation, with extremely low rates of substitutions compared to those of synonymous substitutions. This suggests that the pattern of divergence observed for HAV VP1 is probably due to selective forces that do not allow amino acid replacements, despite the relative high rates of synonymous substitutions observed all over the gene. These results show that both kinds of nucleotide substitutions are undergoing quite different modes of change. While synonymous divergence is expected to follow a neutral mode of evolution ( Fig. 5 and 6) (even though the well-documented existence of cis-acting regulatory elements within the open reading frames of picornaviruses suggest that further work will be needed to qualify this statement), negative selection appears to be the main force shaping the pattern of nonsynonymous substitutions, selecting against most replacement changes in all protein regions, giving a quite conserved protein. In contrast, the antigenic sites of multiple serotype viruses, such as the hemagglutinin gene of influenza virus (23), the complete capsid region of serotypes A and C of foot-and-mouth disease virus (20), and the VP3 region of human immunodeficiency virus (53), were subject to positive selection. As a consequence, the mode of evolution of HAV appears, at least in part, to contribute to explain the presence of only one serological group of HAV so far.