HIV-1 Protease, Reverse Transcriptase, and Integrase Variation

ABSTRACT HIV-1 protease (PR), reverse transcriptase (RT), and integrase (IN) variability presents a challenge to laboratories performing genotypic resistance testing. This challenge will grow with increased sequencing of samples enriched for proviral DNA such as dried blood spots and increased use of next-generation sequencing (NGS) to detect low-abundance HIV-1 variants. We analyzed PR and RT sequences from >100,000 individuals and IN sequences from >10,000 individuals to characterize variation at each amino acid position, identify mutations indicating APOBEC-mediated G-to-A editing, and identify mutations resulting from selective drug pressure. Forty-seven percent of PR, 37% of RT, and 34% of IN positions had one or more amino acid variants with a prevalence of ≥1%. Seventy percent of PR, 60% of RT, and 60% of IN positions had one or more variants with a prevalence of ≥0.1%. Overall 201 PR, 636 RT, and 346 IN variants had a prevalence of ≥0.1%. The median intersubtype prevalence ratios were 2.9-, 2.1-, and 1.9-fold for these PR, RT, and IN variants, respectively. Only 5.0% of PR, 3.7% of RT, and 2.0% of IN variants had a median intersubtype prevalence ratio of ≥10-fold. Variants at lower prevalences were more likely to differ biochemically and to be part of an electrophoretic mixture compared to high-prevalence variants. There were 209 mutations indicative of APOBEC-mediated G-to-A editing and 326 mutations nonpolymorphic treatment selected. Identification of viruses with a high number of APOBEC-associated mutations will facilitate the quality control of dried blood spot sequencing. Identifying sequences with a high proportion of rare mutations will facilitate the quality control of NGS. IMPORTANCE Most antiretroviral drugs target three HIV-1 proteins: PR, RT, and IN. These proteins are highly variable: many different amino acids can be present at the same position in viruses from different individuals. Some of the amino acid variants cause drug resistance and occur mainly in individuals receiving antiretroviral drugs. Some variants result from a human cellular defense mechanism called APOBEC-mediated hypermutation. Many variants result from naturally occurring mutation. Some variants may represent technical artifacts. We studied PR and RT sequences from >100,000 individuals and IN sequences from >10,000 individuals to quantify variation at each amino acid position in these three HIV-1 proteins. We performed analyses to determine which amino acid variants resulted from antiretroviral drug selection pressure, APOBEC-mediated editing, and naturally occurring variation. Our results provide information essential to clinical, research, and public health laboratories performing genotypic resistance testing by sequencing HIV-1 PR, RT, and IN.

A s HIV-1 has spread among humans, it has developed an extraordinary amount of genetic diversity (1). This diversity arises from HIV-1's high mutation rate and predilection for recombination (2,3). Amino acid variants accumulate within an individual as a result of various selective pressures and HIV-1's genetic robustness or tolerance for a large number of different amino acid variants (4,5). The large number of protease (PR), reverse transcriptase (RT), and integrase (IN) amino acid variants has implications for antiretroviral (ARV) therapy and presents a challenge to laboratories performing genotypic resistance testing.
The challenge of HIV-1 genotypic resistance test interpretation is increasing with the adoption of dried blood spot sequencing in low-and middle-income countries and the expansion of nextgeneration sequencing (NGS) in upper-income countries. Dried blood spot samples contain proviral DNA, which is more likely to contain APOBEC-mediated G-to-A hypermutation, an ancient host defense mechanism responsible for lethal mutagenesis (6). NGS technologies are intrinsically more error prone than dideoxynucleotide terminator Sanger sequencing and are at risk of yielding reports of low-abundance variants that result from PCR error (7,8).
We analyzed PR and RT direct PCR Sanger sequences from more than 100,000 individuals and IN direct PCR Sanger sequences from more than 10,000 individuals to characterize the amino acid variation at each amino acid position in these genes. We also analyzed sequences from individuals with known ARV treatment histories to identify those mutations resulting from selective drug pressure. Knowledge of the observed variation and selection pressure on the molecular targets of HIV therapy can be from which they were obtained. Thus, they represent the extent of biochemical similarity between amino acids, which is independent of historical evolution and local sequence context. For notational purposes, amino acid variants were defined as differences from the consensus subtype B amino acid sequence because this is a commonly used reference and because it was nearly always the same as the consensus of all pooled sequences.
We also determined the proportion of times that each amino acid variant occurred as part of an electrophoretic mixture in which two peaks were present on the sequence electropherogram resulting in one of the following ambiguous nucleotide calls: R (combination of A and G), Y (combination of C and T), M (combination of A and C), W (combination of A and T), K (combination of G and T), and S (combination of C and G) (15). Amino acids that always occurred as part of an electrophoretic mixture were excluded.
Nonpolymorphic TSMs. To identify nonpolymorphic treatment-selected mutations (TSMs), we examined the treatment history of the individuals from whom each sequenced virus was obtained. For each drug class-PR inhibitor (PI), nucleoside RT inhibitor (NRTI), nonnucleoside RT inhibitor (NNRTI), and IN strand transfer inhibitor (INSTI)-sequences were characterized as being either from an ARV class-naive individual who received no drugs belonging to the class or an ARV classexperienced individual who received at least one drug from that class. Sequences from individuals of unknown or uncertain treatment history were excluded from this analysis. In sequences from patients with multiple virus isolates, mutations occurring in more than one isolate were counted just once.
We then examined each amino acid variant for its association with ARV selection pressure. The proportion of each variant in ARV-experienced individuals was compared to its proportion in ARV-naive individuals using a chi-square test with Yates' correction. The Holm's method was then used to control the family-wise error rate for multiple-hypothesis testing at an adjusted P value of Ͻ0.01 (16). To exclude TSMs under minimal drug selection pressure, we included only those TSMs that were five times more frequent in ARV-experienced than in ARV-naive individuals. To identify the TSMs that are most specific for ARV selection across subtypes, we identified those TSMs that were nonpolymorphic in the absence of selective drug pressure, defined as occurring at a frequency below 1.0% in ARV-naive individuals infected with viruses belonging to each of the five most common subtypes.
Transmitted drug resistance (TDR) will cause many nonpolymorphic TSMs to appear in virus sequences from untreated individuals. This will cause the proportion of these mutations in ARV-naive individuals to be higher than what would be expected in ARV-naive individuals whose viruses had not experienced selective drug pressure. This in turn will reduce the ratio of the prevalence of these mutations in ARV-experienced individuals divided by their inflated prevalence in ARV-naive individuals. Therefore, we restricted our analysis of ARV-naive sequences to those lacking any of the 93 surveillance drug resistance mutations (SDRMs) that have become established markers of TDR (17). For IN for which the SDRM list is not available, we used major INSTI resistance mutations defined in Stanford HIVDB: T66I/A/K, E92Q, F121Y, G140S/A/C, Y143C/R/H, S147G, Q148H/K/R, and N155H/S. Among RT inhibitor (RTI)-experienced individuals, 75% received NRTIs in combination with an NNRTI, 22% received NRTIs without an NNRTI, and 3% received an NNRTI without an NRTI. The frequent use of NRTIs in combination with an NNRTI makes it difficult to determine for some mutations whether they are selected by NRTIs or NNRTIs. Therefore, we first determined whether RT mutations were treatment selected by comparing the proportions of mutations in sequences from RTI-naive and RTI-experienced individuals. We then determined whether the selection appeared to be primarily associated with NRTIs versus NNRTIs using a previously described approach (18). Those mutations that did not demonstrate a strong significant association with just one class were classified as (i) NRTI associated if their positions are known to be associated with NRTI resistance, (ii) NNRTI associated if their positions are known to be associated with NNRTI resistance, or (iii) undifferentiated RTI associated if their positions were not previously associated with NRTI or NNRTI resistance.
Synonymous and nonsynonymous mutation rates. To determine whether the overall nucleotide mutation rate at a codon influenced the likelihood of developing amino acid variants, we estimated the synonymous and nonsynonymous rates at each codon in PR, RT, and IN for the five most common subtypes. For each subtype, we used FastML (19) to determine the most probable ancestral codon and then compared the codon of each sequence to this codon to estimate the number of synonymous changes/number of potential synonymous changes (dS) and the number of nonsynonymous changes/number of potential nonsynonymous changes (dN). Additionally, we examined each consensus amino acid and TSM to determine the minimum number of nucleotide differences between their respective codons.

Signature mutations indicating APOBEC-mediated editing.
Of 297 PR nucleic acids, 24 GG and GA dinucleotides at 22 amino acid positions were conserved in more than 98% of sequences in each of the most common five subtypes. Canonical APOBEC-mediated changes at these positions-GG¡AG, GA¡AA, and GG¡AA (if GG is followed by G)-would result in 58 different amino acid mutations and two stop codons. Fifty of the 58 mutations occurred in sequences from one or more plasma samples. Of the 50 observed mutations, 32 were strongly associated with one or more stop codon or with a canonical APOBEC-mediated mutation at one or more of the active-site residues D25, G27, G49, G51, and G52. Table S1 in the supplemental material lists the two stop codons and the 32 PR mutations, which our analysis suggests indicate APOBEC-mediated editing.
Of 1,680 RT nucleic acids, 128 GG and GA dinucleotides at 115 amino acid positions were conserved in Ͼ98% of sequences in each of the five most common subtypes. Canonical APOBECmediated changes at these positions would result in 241 different amino acid mutations and 19 stop codons. One hundred eighty of the 245 mutations occurred in sequences from one or more plasma samples. Of the 180 observed mutations, 89 were significantly associated with one or more of stop codons or with a canonical APOBEC-mediated mutation at one of the active-site residues D110, D185, and D186. One of the 89 mutations, M230I, has recently been reported to cause resistance to the NNRTI rilpivirine (20). Table S1 in the supplemental material lists the 19 stop codons and the 88 RT mutations that our analysis suggests indicate APOBEC-mediated editing.
Of the 864 IN nucleic acids, 76 GG and GA dinucleotides at 65 amino acid positions were conserved in Ͼ98% of sequences in each of the five most common subtypes. Canonical APOBECmediated changes at these positions would result in 136 different amino acid mutations and 7 stop codons. Eighty of the 136 mutations occurred in sequences from one or more plasma samples. Of these 80 mutations, 62 were significantly associated with one or more stop codons or with a canonical APOBEC-mediated mutation at one of the active-site residues D64, D116, and E152. One of the 62 mutations, G118R, has recently been reported to reduce susceptibility to multiple INSTIs (21,22). Table S1 in the supplemental material lists the seven stop codons and the 61 IN mutations that our analysis suggests indicate APOBEC-mediated editing.
The local false discovery rate derived from the mixture model described in Materials and Methods was used to classify sequences as hypermutated or nonhypermutated based on the number of signature APOBEC mutations within PR, RT, and IN (see Table S2 in the supplemental material). The presence of one signature mutation predicted risks of hypermutation of 18%, 19%, and 16% for PR, RT, and IN sequences, respectively. The presence of two signature mutations predicted risks of hypermutation of 86%, 79%, and 76%, respectively. The presence of three signature mutations predicted risks of hypermutation of 99.8%, 98.5%, and 97.8%, respectively. Therefore, in our subsequent analyses, we excluded 112 PR, 225 RT, and 81 IN plasma sequences containing two or more signature APOBEC mutations.
Amino acid variation. Overall, we analyzed 110,357 PR sequences obtained from 101,154 individuals, 118,246 RT sequences from 108,681 individuals, and 11,838 IN sequences from 11,156 individuals. Most RT sequences did not encompass the 3= RNase H coding region of RT. Therefore, for our analysis of RT amino acid variability, we included just positions 1 to 400.
Of the 99 PR positions, 47 (47%) had one or more variants occurring at a prevalence of Ն1%, and 69 (70%) had one or more variants occurring at a prevalence of Ն0.1% (Fig. 1). Overall, there were 201 variants occurring at a prevalence of Ն0.1% at these 69 positions (Table 1).
Of the 288 IN positions, 97 (34%) had one or more variants occurring at a prevalence of Ն1%, and 172 (60%) had one or more variants in Ն0.1% of sequences (Fig. 3). Overall, there were 346 variants occurring at a prevalence of Ն0.1% at these 172 positions (Table 1).
Variability between subtypes. At each position, the number of amino acid variants with a prevalence of Ն0.1% was highly correlated between subtypes: The median intersubtype correlation coefficients for the number of variants with a prevalence above 0.1% were 0.85 (P Ͻ 2EϪ16), 0.84 (P Ͻ 2EϪ16), and 0.68 (P Ͻ 2EϪ16) for PR, RT, and IN, respectively ( Fig. 4, 5, and 6).
For amino acid variants with a prevalence of Ն0.1%, the median intersubtype ratio of the prevalence for PR variants was 2.9-  fold (interquartile range [IQR], 1.2-to 4.7-fold); only 5.0% of PR variants had a prevalence in one subtype that differed by Ն10-fold in another subtype (range, 10-to 28-fold). The median intersubtype ratio of the prevalence for RT variants was 2.1-fold (IQR, 1.0to 3.5-fold); only 3.7% of RT variants had a prevalence in one subtype that differed by Ն10-fold in another subtype (range, 10to 39-fold). The median intersubtype ratio of the prevalence for IN variants was 1.9-fold (IQR, 1.2-to 3.0-fold); only 2.0% of IN variants had a prevalence in one subtype that differed by Ն10-fold in another subtype (range, 10-to 51-fold).
Chemical relatedness. There was a strong relationship between the prevalence of an amino acid variant and its biochemical similarity to the consensus amino acid (Table 1). Each 10-fold increase in a variant's prevalence was significantly correlated with the change in BLOSUM62 similarity score: the slopes of a fitted line for each gene were 0.71 (r ϭ 0.47; P Ͻ 2EϪ16), 0.67 (r ϭ 0.41; P Ͻ 2EϪ16), and 0.68 (r ϭ 0.36; P Ͻ 2EϪ16) for PR, RT, and IN, respectively. Similar results were obtained using the BLOSUM80 scoring matrix: the slopes of a fitted line for each gene were 0.81 (r ϭ 0.47; P Ͻ 2EϪ16), 0.77 (r ϭ 0.41; P Ͻ 2EϪ16), and 0.74 (r ϭ 0.35; P Ͻ 2EϪ16) for PR, RT, and IN, respectively.

Mixture analysis.
There was a strong inverse relationship between a variant's prevalence and the proportion of times that it occurred as part of an electrophoretic mixture. Each 10-fold increase in a variant's prevalence was inversely correlated with the change in the proportion of times that it occurred as part of an electrophoretic mixture: the slopes of a fitted line for each gene were Ϫ3.6 (r ϭ 0.14; P Ͻ 2EϪ06), Ϫ5.9 (r ϭ 0.32; P Ͻ 2EϪ16), and Ϫ7.6 (r ϭ 0.43; P Ͻ 2EϪ16) for PR, RT, and IN, respectively. For example, the very rare variants with a prevalence of Ͻ0.01% were present as a part of mixture in 54% to 60% of their occurrences, depending on the gene. In contrast, the most common variants were present as a part of mixture in 7% to 9% of their occurrences, depending on the gene (Table 1).
Nonpolymorphic TSMs. (i) PR. To identify nonpolymorphic PI-selected mutations, we analyzed the proportions of all PR mutations in sequences from 61,593 PI-naive individuals and 15,420 PI-experienced individuals. Within PR, 144 mutations at 57 positions were significantly more common in PI-experienced than PI-naive patients after adjustment for multiple-hypothesis testing by controlling the family-wise error rate (i.e., adjusted P) at Ͻ0.01 (chi-square test; unadjusted P Ͻ 8.8 ϫ 10 Ϫ6 ). Of these 144 mutations, 111 at 41 positions were nonpolymorphic and occurred more than five times more frequently in PI-experienced than PI-naive individuals. Table 2 lists each of the 111 nonpolymorphic TSMs by their position and frequency in ARV-experienced individuals.
Of the 88 PI nonpolymorphic TSMs that were previously reported by us (18), two mutations, I13M and T74K, were no longer found 5-fold more often in treated compared with untreated individuals. One mutation, Q58E, had a prevalence of 1.1% in subtype D viruses from untreated individuals. The 85 mutations in boldface were previously reported by us as nonpolymorphic TSMs, whereas the remaining 26 mutations are newly identified. Ninety-two percent of the sequences containing a novel nonpolymorphic TSM had one or more PI-associated SDRMs.  1  10  20  30  40  50  60  70  80  90  100  110  120  130  140  150  160  170  180  190  200  210  220  230  240  250  260  270  280  Within RT, 245 mutations at 116 positions were significantly more common in RTI-experienced than RTI-naive individuals after adjustment for multiple-hypothesis testing by controlling the family-wise error rate (i.e., adjusted P) at Ͻ0.01 (chi-square test; unadjusted P Ͻ3.6 ϫ 10 Ϫ6 ). Of these 245 mutations, 185 mutations at 82 positions were nonpolymorphic and occurred more than five times more frequently in RTI-experienced than RTI-naive individuals. Table 3 lists each of the 95 nonpolymorphic NRTI-selected mutations. Table 4 lists each of the 64 nonpolymorphic NNRTI-selected mutations. Table 5 lists 26 nonpolymorphic RTI-selected mutations that could not be attributed to either NRTI or NNRTI selection pressure alone and that occurred at positions not previously associated with NRTI or NNRTI selection pressure.
Of the 122 RTI nonpolymorphic TSMs that were previously reported by us (18), two mutations, P236L and D237E, were no longer found to be 5-fold more common in treated compared with untreated individuals. One mutation, K43Q, was found to have a prevalence of 2.0% in CRF01_AE viruses from ARV-naive individuals, and another mutation, L228H, was found to have a prevalence of 1.2% in subtype F viruses from ARV-naive individuals. In Tables 3, 4, and 5, the 118 mutations shown in boldface were previously reported by us to be nonpolymorphic TSMs, whereas the remaining 63 are newly identified. Ninety-eight percent of the sequences containing a novel nonpolymorphic TSM in RTI-experienced individuals had one or more RTI-associated SDRMs.   Among the PR TSMs, the minimum numbers of nucleotide differences between the TSM and the consensus amino acid variant were 1 for 67.6% and 2 for 32.4% (i.e., these were 2-bp mutations). Among the RT TSMs, the minimum numbers of nucleotide differences were 1 for 68.4%, 2 for 31.1%, and 3 for 0.6%. Among the IN TSMs, the minimum numbers of nucleotide differences were 1 for 86.7% and 2 for 13.3%.
Overall PR, RT, and IN variability. Forty-seven percent of PR, 37% of RT, and 34% of IN positions had one or more amino acid variants with a prevalence of Ն1%. Seventy percent of PR, 60% of RT, and 60% of IN positions had one or more amino acid variants with a prevalence of Ն0.1%. Although amino acid variants occurred in different proportions in different subtypes, the prevalence of a variant in one subtype rarely differed by more than 10-fold compared with the prevalence of that variant in a different subtype (2.0% of IN variants, 3.7% of RT variants, and 5.0% of PR variants).
In each gene, the more rare the amino acid variant, the more likely it was present as part of an electrophoretic mixture or differed biochemically from the consensus amino acid. Variants that occur frequently as part of electrophoretic mixtures are likely to have reduced replication fitness, explaining their inability to replicate sufficiently to become dominant within an infected individual's circulating virus population (27,28). Although the presence of two electrophoretic peaks at a position is usually a reliable indicator that two nucleotides are present in that virus population, a small secondary peak can also result from PCR error and sequencing artifact (29,30).
Very rare variants had the lowest biochemical similarity to the consensus amino acid at each position and often occurred as part of an electrophoretic mixture. Additionally, these variants were evenly distributed across all positions in PR, RT, and IN-occurring in similar numbers at positions that were highly conserved or 1  10  20  30  40  50  60  70  80  90  100  110  120  130  140  150  160  170  180  190  200  210  220  230  240  250  260  270  280  288   0   5   10   Subtype B   1  10  20  30  40  50  60  70  80  90  100  110  120  130  140  150  160  170  180  190  200  210  220  230  240  250  260  270  280  288   0   5   10   Subtype C   1  10  20  30  40  50  60  70  80  90  100  110  120  130  140  150  160  170  180  190  200  210  220  230  240  250  260  270  280  288   0   5   10   Subtype A   1  10  20  30  40  50  60  70  80  90  100  110  120  130  140  150  160  170  180  190  200  210  220  230  240  250  260  270  280  288   0   5   10   Subtype CRF01_AE   1  10  20  30  40  50  60  70  80  90  100  110  120  130  140  150  160  170  180  190  200  210  220  230  240  250  260  270  280  displayed variability at higher mutation thresholds. We propose that it is useful to identify sequences that contain large numbers of such rare variants because a high number of very rare amino acids in a direct PCR dideoxynucleotide terminator Sanger sequence could result from sequencing error or unrecognized frameshifts if the rare amino acids are clustered. Additionally, the presence of a high number of very rare variants in a next-generation deep-sequencing assay would be more consistent with PCR error than quasispecies variation and would suggest that the threshold for identification of low-abundance variants was set too low. Treatment-selected mutations. We previously published an analysis of nonpolymorphic TSMs in PR and the first 350 posi-tions of RT using an earlier data set containing sequences from approximately 25,000 individuals with known ARV treatment histories (18). In this article, we extended our analysis of nonpolymorphic TSMs to IN and to the entire RT. In addition, the numbers of sequences from individuals with known treatment histories in PR and the 5= part of RT were nearly three times higher for PR and RT than those in our previous analysis.
We identified 111 nonpolymorphic PR TSMs: 26 new TSMs and 85 of the 88 previously identified TSMs. The novel PR TSMs are likely to be accessory drug resistance mutations because they nearly always occurred in combination with established PI resistance mutations.
We identified 185 nonpolymorphic RT TSMs: 67 new TSMs and 118 of the 122 previously identified TSMs. The novel RT TSMs were likely to be accessory drug resistance mutations be-  Several mutations in the connection and RNase H domains of RT have been shown to play an accessory role in reducing HIV-1 susceptibility in combination with thymidine analog mutations (TAMs), most likely by slowing the activity of RNase H and thereby allowing more time for TAM-mediated primer unblocking (31). However, only 11 TSMs were identified beyond position 300, including the NRTI-selected mutation A304G, the NNRTIselected mutations Y318F, N348IT, and E404N, and the RTI-selected mutations E302D, E312G, I341F, Q394S, E399G, and Q547G. This is consistent with the much lower number of sequenced viruses extending beyond position 300 obtained from NRTI-and/or NNRTI-experienced individuals.
Four well-characterized accessory INSTI-associated DRMs-L74M, T97A, and G163R/K-were not identified because they were polymorphic in one or more subtypes (32). G118R and R263K, two other highly studied mutations (21,33), were also not