Department of Molecular Biology and
Microbiology, School of Medicine, Tufts University, Boston,
Massachusetts 02111
To study the mechanism of evolution of the human immunodeficiency
virus (HIV) protease gene (pro), we analyzed a database of
213 pro sequences isolated from 11 HIV type 1-infected
patients who had not been treated with protease inhibitors. Variation
in pro is restricted to rare variable bases which are
highly diverse and differ in location among individuals; an average
variable base appears in about 16% of individuals. The average
intrapatient distance per individual variable site, 27%, is similar
for synonymous and nonsynonymous sites, although synonymous sites are
twice as abundant. The latter observation excludes selection for
diversity as an important, permanently acting factor in the evolution
of pro and leaves purifying selection as the only kind of
selection. Based on this, we developed a model of evolution, both
within individuals and along the transmission chain, which explains
variable sites as slightly deleterious mutants slowly reverting to the better-fit variant during individual infection. In the case of a
single-source transmission, genetic bottlenecks at the moment of
transmission effectively suppress selection, allowing mutants to
accumulate along the transmission chain to high levels. However, even
very rare coinfections from independent sources are, as we show, able
to counteract the bottleneck effect. Therefore, there are two possible
explanations for the high mutant frequency. First, the frequency of
coinfection in the natural host population may be quite low.
Alternatively, a strong variation of the best-adapted sequence between
individuals could be caused by a combination of an immune response
present in early infection and coselection.
 |
INTRODUCTION |
Human immunodeficiency virus (HIV)
shows a high level of genetic diversity compared to other viruses
(45). Genetic variation in antigenically important regions
complicates attempts at vaccine development (3, 29, 48, 49),
and drug resistance mutations limit the efficacy of treatment with
replication inhibitors (6, 25). From another perspective, a
pattern of genetic evolution contains important information about
biological factors acting on the virus population. In particular, the
role of the immune response in HIV infection is difficult to assess
experimentally. It remains a subject of intensive debate, and the
mechanism of HIV persistence in vivo in the face of an apparently
vigorous immune response is still unknown. At the same time, the
presence of an at least partly functional immune response can be
inferred from the genetic diversity data. The large proportion of
nonsynonymous substitutions, e.g., in env reveals a powerful
selection for diversity (4, 33, 41), presumably caused by
the immune response and by adaptation to different cell types.
The average genetic distances and relative proportions of different
types of substitutions vary strongly both between and within different
HIV genes, implying that there are different dominant factors of
evolution at different loci. The highest degree of diversity, 3 to 5%
within an individual (1, 21, 47) and 8 to 17% between
individuals from the same geographical location (1, 20, 26),
is observed in the variable regions of the env gene. A
somewhat smaller but still impressive level of intrapatient diversity
has been reported for gag (1 to 2%) (52) and for
pol and pro (0.4 to 1%) (22, 31). In
pro, synonymous and conservative substitutions dominate
overall variation (22), implying a stronger role for
purifying selection than in env.
The aim of this work was to reconstruct the mechanism of evolution in a
relatively conserved HIV gene, such as pro, by using data on
its genetic variation (22). Genetic evolution of a virus population may be affected by a multitude of different factors (19): mutation, selection forces, stochastic factors (random drift), recombination, transmission between individuals, spread of the
virus between infected organs, etc. The main difficulty, as is always
the case with mathematical theories of real experimental systems, is
that it is not known in advance which factors, among many, are of the
greatest importance to the behavior of the system. The quality of
approximations cannot be evaluated until the analysis is over, and it
depends on the parameters one is trying to predict. We thus face a time
paradox: on the one hand, a well-defined set of approximations must be
introduced in the beginning to make the analysis possible and
intuitively clear; on other hand, finding the right set of
approximations (the biological model) is the final aim of such
research. The only escape from the time paradox that we know of is a
dynamic interplay between experiment and theory (38). One
starts from a simple model to match several important experimental
features. After calculating its predictions, one searches for a
contradiction to the experimental data and, when such is found, checks
initial assumptions by changing them, one by one, and estimating what
had changed in the predictions. This process has to be repeated several
times. The resulting model, when all available information is used up,
is considered a working model subject to further experimental tests.
In the present work, we applied this strategy as follows. After several
attempts, we chose a set of experimental features from the database of
Lech et al. (22) that are difficult to explain within a
mathematically and biologically consistent model. This allowed us to
select a working model (or models) from a large number of
possibilities. We found a model of deterministic evolution in an
individual under conditions of purifying selection that agrees with
several experimental features from those we have chosen and used it as
a starting model of the evolution of pro. Next, to describe
genetic variation of the initial virus variant in infecting inocula and
to predict the average frequency of variable sites in individuals, we
had to take transmission between individuals into consideration. We
have determined that such a model can explain a high frequency of
variable sites in an individual, which is a very important experimental
feature. However, this theoretical result turned out to be entirely
dependent on our initial assumption that an animal in the natural host
population is, in 99% of cases, infected from a single source. Even if
coinfection occurred as rarely as in 5% of cases, this would bring the
frequency of variable sites many times below the observed value. We did
not find independent facts that would confirm such a low frequency of
coinfection. Therefore, we tried to find another explanation for the
high frequency of variable sites by searching for a weak spot in our
approximations. We have estimated (approximately) quantitative
contributions from a number of potential factors of HIV evolution to
the frequency of the variable sites and found most of these
contributions to be very small. The factors we considered included the
possibility that the genetic variation was dominated by random drift
(39), temporal changes in the virus population size,
adaptation of the virus sequence to different cell types, nonuniformity
of virus distribution in the body and virus spread between infection
sites, the possibility that the coinfection sources are not
independent, coselection between loci, and the immune response.
Finally, after excluding all apparent possibilities, we could find a
plausible explanation, alternative to a very low coinfection rate in
the natural host: in the presence of the immune response, the best-fit sequence of virus will differ strongly between individuals due to
individual variation in major histocompatibility complex class I
(MHC-I) subtypes, and selection for a variant unrecognized by the
immune system leads to a cascade of compensatory mutations, both
synonymous and nonsynonymous. This mechanism is also seen following the
appearance of drug-resistant mutants.
To make the work available to a broad audience, we parallel qualitative
arguments in Results and Discussion with mathematical derivations,
given in Materials and Methods. We consider in detail four principal
models: deterministic evolution in an individual, the same model with
transmission between individuals, the same model including coinfection
of an infected individual from independent sources, and the model based
on variation in MHC-I subtypes. The other, dead-end models and all
approximations are discussed.
The reader should keep in mind that the present work is about the
coarse-grained picture of evolution of pro in a typical patient; given the complexity of the problem, we did not attempt to
explain the diversity of all HIV genes or all individual variations. Neither did we expect that models and predictions obtained in this work
were necessarily final and seek to substitute mathematical analysis for
experimental data. The hypotheses selected by this work eventually have
to be tested experimentally.
 |
MATERIALS AND METHODS |
Calculation of genetic distances.
The genetic distance at a
chosen base, between or within patients, is mathematically defined as
the probability that two sequences differ at a given base, if sampled
randomly from the same or two different patients, respectively. This
definition is equivalent to the standard definition of the genetic
distance as the average number of pairwise differences between two
samples of a genome segment, except it deals with a single base and
makes sense only after averaging over many such pairs of sequences.
Both genetic distances can be expressed in terms of the frequency of
substitutions. Let fki be the frequency of
substitutions at site i and for patient k, with
respect to any chosen reference sequence, such as a database consensus
sequence, a best-fit sequence, etc. Then, the intrapatient distance for
each patient, Tkiintra, and the interpatient
genetic distances for each pair of patients, Tk1k2iinter,
are given by the respective expressions
|
(1)
|
The intrapatient distance, as follows from its definition, has a
maximum, Tkiintra = 1/2, at
fki = 1/2. The maximum interpatient distance is
Tk1k2iinter = 1, which corresponds to either
fk1i = 1 and fk2i = 0 or vice
versa; the two populations are genetically uniform in opposite alleles.
Note that the right-hand sides of equations 1 do not change upon
replacement of fki with 1
fki; i.e., both genetic distances are
independent of the choice of consensus, as they should be.
Evolution in an individual.
The initial set of deterministic
equations describing the evolution over time of two variants of a virus
has the form
|
(2)
|
|
(3)
|
Here n1 and n2 are
the numbers of mutant and wild-type proviruses, respectively;
1 and
2 are the replication coefficients; trep is the replication cycle time; and
µf and µr are the
forward and reverse mutations rates, respectively. We assume that
trep is the same for the two variants and that
cell death is a random Poisson process. Neither assumption is important for a long-term selection. The selection coefficient s is
defined as the relative difference in fitness, s = (
2/
1)
1, and is assumed to
be constant and to fulfill the double inequality
µf(r)
s
1. The ratio
2/
1 agrees with the standard experimental definition of the relative fitness measured by comparing the
exponential growth rates, at a low multiplicity of infection, of the
virus variant under study and the reference variant. Using the
additional condition that the total population size is constant,
n1 + n2 = N, as is the
case during the asymptomatic stage of HIV infection (see the section on
Verifying approximations), equations 2 and 3 can be replaced by the
single equation
|
(4)
|
where f
n1/N is the mutant
frequency. Equation 4 is nonlinear since the replication coefficients,
i, must depend on the mutant frequency,
f. It is implied that
i depends on
the number of available target cells, which adjusts to keep the total
provirus population constant. Although the clearance rate of infected
cells is assumed to be constant and to be equal for the two genetic
variants, the same equation, equation 4, can be shown to follow in a
more general case as well, except that the definition of the selection
coefficient must include the ratio of the clearance rates. Solving
equation 4 for the arbitrary initial condition f(0) = f0, we obtain
|
(5)
|
where
f(r)
µf(r)/s and error of the order of
(µf(r)/s)2 is allowed in the
right-hand side. Two particular initial conditions in equation 5 yield
reversion and accumulation dependences:
frev(t) = f(t,1) and
facc(t) = f(t,0). Plots of
frev(t) and
facc(t) for a realistic set of
parameters are given below (see Fig. 4). We define the reversion time,
t50, as given by the equation
frev(t50) = 0.5; one
obtains t50 = (trep/s)ln(s/µr).
In the limit of large t, the mutant frequency converges at
the steady-state value, feq = µf/s
1.
Chain of single-clone transmission.
Let us assume that each
individual infects the next individual at time t* after the
moment of his/her own infection with a single genetic variant. Let
fn* be the probability that the virus
variant which infects person n is mutant. The expected value
of the mutant frequency in person n at time t is
given by
|
(6)
|
The probability fn* is equal to the
expectation value of the mutant frequency in the source of infection at
the time of transmission, as given by
|
(7)
|
Together, equations 6 and 7 fully describe the evolution of a
base along the chain of infection. If transmission occurs late, t* > t50 (t*
t50
trep/s), probability
fn* can be shown to converge, after a few
transmissions, to the individual steady-stage value
µf/s. If transmission occurs prior to the
reversion time, t* < t50, fn*
changes slowly with n, and equations 6 and 7 can be
simplified to the differential equation
|
(8)
|
where
|
(9)
|
|
(10)
|
Solving equation 8 at the arbitrary initial condition
f0*, we obtain
|
(11)
|
As follows from equation 11, the probability of occurrence of a
mutant base converges, after approximately neq
transmissions, to a "chain steady-state" value,
fn*
f
*. The
dependence of parameters f
* and
neq on the transmission time t*,
which is exponentially fast, is shown below (see Fig. 5a and b). In the
range of transmission times trep/s < t* < t50, the two parameters are simply related, f
* = µfneq/s.
Coinfection from independent sources.
Let us assume that a
pair of genomes is transmitted from two different persons or animals
with a probability q and that a single genome is transmitted
with a probability 1
q. Equation 6 is replaced by
|
(12)
|
where fcom(t) is the growth
competition curve defined as f(t,1/2) in equation 5 (the
dashed line in Fig. 4) and fn* meets
recursive equation 7. Equation 12 assumes that the two infection sources are epidemiologically distant and, therefore, statistically independent. The chain steady-state value,
f
*, is given by the smaller of two zeros
of the quadratic equation
|
(13)
|
where facc, frev,
and fcom are taken at t = t*.
The dependence f
* versus t*,
q = 0.05, is shown below (see dashed line in Fig. 5a).
The maximum probability of a mutant base,
f
*(t* = 0), is plotted below versus
parameter q (see Fig. 5c). At intermediate values of
q, such that µf + µr
q
1, we obtain the asymptotics
f
* (t* = 0) = 2µf/qs.
As expected, at 1/q
neq, this
expression matches the formula f
* = µfneq/s obtained above for a
singel-clone transmission.
The above derivation assumed that if coinfection with two virus
sequences of similar fitness occurs, the steady-state virus population
after the acute infection phase consists of equal proportions of the
two variants. We also considered a random initial composition of
between 0 and 100% and found out that such a change of model does not
affect finf* at
q
0 and causes an increase in
finf*(q) at larger values of
q by a factor of 1.5 (the upper curve in Fig. 5c shifts
upward). Consequently, the value of q estimated for HIV
below increases
50%.
Coinfection from the same source.
Let us assume that
transmission always occurs from a single source and that one and two
virus variants are transmitted with probabilities 1
q and q, respectively. (Similar considerations apply to
the case of two different sources which are very close epidemiologically.) When two variants are transmitted, the initial steady-state population is assumed to be 50% mutant. In this case, the
two sequences sampled are not statistically independent and equation 12 does not apply. Instead, one can write two separate equations for
expectation values of the mutant frequency,
fn(t), and of its square,
yn(t), respectively:
and
|
(14)
|
where frev(t),
facc(t), and
fcom(t) are given by equation 5 with
f0 = 1, 0, and 1/2, respectively;
parameters fn* and
yn*, are both defined by recursive equation
7. The chain steady-state values, f
* and
y
*, are found from equation 14 and the
usual steady-state conditions fn* = fn
1* = f
* and
yn* = yn
1* = y
*. The general expressions are cumbersome, and
we give them only for the limit t*
0 in which
f
* is maximum. In this case, the
dependence on q cancels out, and we get the same result as
in the case of single-clone transmission (cf. equation 9 at
t*
0):
|
(15)
|
Value of q predicted for HIV.
We consider
positions at which the wild type is A or C since such positions are
predicted to dominate variable sites (see Fig. 5c). The average
observable frequency of a variable base, f
exp, is given by the general expression
|
(16)
|
where
(s) is the distribution density of
s over different bases, and
f
*(s) is the probability of a mutant base
being inserted for wild-type A or C (see Fig. 5c and d). The limits of
the integrals in s implied in equation 16 are the boundaries
of the variable zone in s, which will be specified below.
The factor g(s) is the fraction of patients tested during the time interval in which a base with selection coefficient s is observably variable (see Fig. 3), given by
|
(17)
|
where
(t) is the distribution density of the time
of testing among patients, L = ln(s/µr) = 7.8 (s = 1%), and coefficient
1 depends on the
boundaries of the interval in f when the base is considered
variable. At an average sample size of 20 clones, a site with
0.05 < f < 0.95 is observably diverse, which
yields
3. Assuming that
(t) is constant in
the interval between 1.25 and 8.75 years, and treating L as
a large parameter, from equation 17 we obtain
|
(18)
|
if 0.57
< s <4
, and nearly zero otherwise.
Here
is defined as given by the equation
trepL/
=
=5 years. The
interval in s specified in equation 18 sets the limits of
the integrals in equation 16. Approximating the distribution density
(s) by a constant in this interval of s, and
using the function f
*(s) calculated from
equation 13, we find that equation 16 matches the observed value of
variable sites, f
exp = 0.16,
at q
0.085.
 |
RESULTS |
Two types of selection.
There are several groups of factors
which can shape the evolution and diversity of a virus genome at a
given locus: mutation, random drift (13, 50), selection,
transmission between individuals, and virus spread between different
infection sites within an individual. Transmission and spread between
infection sites create genetic bottlenecks, which modify the action of
other factors, creating founder effects due to random sampling from the
infecting population.
Before proceeding, we will define the terminology that we use for
different types of selection. We divide all selective forces acting on
a separate locus into two components: purifying selection and selection
for diversity. Purifying selection exists due to the fact that a
certain genetic variant, referred to here as the wild type, is better
fit than other variants. The virus fitness is usually defined in
experiments as a relative parameter, the ratio of the exponential
growth rate of the virus variant to that of a reference variant, when
measured at a low multiplicity of infection in culture. For our
purposes, it is more convenient to characterize selection by the
selection coefficient, s, defined as the relative fitness
minus 1, i.e., to the relative difference in the corresponding growth
rates. If this type of selection is the only factor of evolution
present, the population will eventually become uniform in the
better-fit variant.
Selection for diversity comes in two types. The first is time-dependent
selection, existing due to temporal changes in the conditions of virus
growth, e.g., the appearance of a functional immune response to a
specific epitope. The second type of selection for diversity is the
ecological niche effect: different host cells may favor different
wild-type virus sequences. If this type of selection dominates, the
population will gradually assume a constant, albeit highly diverse,
genetic composition, consisting of a mixture of viruses adapted to
replicate in different cell types.
Note that we use the terms "purifying selection" and "selection
for diversity," as is conventional in population biology, to describe
external conditions acting on a population, as opposed to a state or
dynamic of the population. This means, in particular, that the genetic
diversity of a population can increase transiently in the absence of
selection for diversity, e.g., when a wild-type sequence is being fixed
in the population under the influence of purifying selection (see below).
Sequence diversity in pro.
To obtain a more detailed
picture of genetic variation in pro, we analyzed the
database of sequences described by Lech et al. (22).
Sequences were obtained at a single time point from protease-inhibitor-naive patients. The 265 clones of proviral DNA were
derived from peripheral blood mononuclear cells of 13 individuals by 70 cycles of nested PCR amplification, and the consensus sequence for each
patient's virus was obtained. Of the 13 patients, two (06 and 12) had
stop codons in their consensus sequences and were excluded from further
analysis. For the remaining 11 patients, 6% of the sequences contained
deletions, insertions, or stop codons; these sequences were also
filtered out, since the present work addresses point mutations, which
do not render the virus nonviable. The remaining database comprised 213 clones of proviral DNA. We determined the common consensus sequence by noting the most frequent variant for each base. The consensus sequence
starts from the conserved sequence CCTCAgATCACTCTT (PQITL), where g shows a variable base, and is 297 bases long. It is identical to the subtype B consensus in the LANL database (19a). Of
the total number of substitutions, 25% were transversions with respect to the consensus sequence. To simplify further analysis, we ignored transversions, replacing them by the consensus sequence. As a result,
each base was represented by two genetic variants: either A/G or C/T.
(A theory taking into consideration both types of mutation would have
to contain more parameters, namely, the forward and reverse mutation
rates for transversions, which are very low and known with poor
accuracy. This would complicate the theory without much gain in
information.) Finally, to partially correct for errors accumulated
during PCR, we ignored mutations present in a single copy in the entire
database. These sporadic mutations occurred with a frequency of 0.06%
per base, with respect to the consensus, and were distinctly
overrepresented in nonsynonymous changes relative to mutations
appearing two or more times (Fig. 1;
Table 1). This value, as well as the very
small number of mutations which appear exactly twice in the database,
is consistent with a PCR error rate of about 1 per 105
bases per cycle, which does not exceed reported values (2, 44).

View larger version (21K):
[in this window]
[in a new window]
|
FIG. 1.
Intrapatient (a) and interpatient (b) genetic distances,
averaged over patients, at different positions in pro. The
upper and lower histograms in each figure correspond to synonymous and
nonsynonymous sites, respectively. Dots on the upper horizontal line in
panel a show the positions of sporadic mutations (seen only once in the
data set).
|
|
For the database thus filtered, we classified each base as either
variable or conserved. For each variable site and each patient, we
determined the frequency of substitutions with respect to the database
consensus. We also determined, for each patient, the average genetic
distance at a base as the proportion of sequence pairs (randomly
sampled from the virus population) which differ at that base. This
definition is equivalent to the standard definition of the genetic
distance as the average number of pairwise differences, except it
applies to a separate base rather than a long genomic segment. We
determined the interpatient genetic distance in a similar way, except
that the two sequences of a pair were sampled from different patients.
By definition, the intrapatient genetic distance varies between 0 and
1/2 (0 and 50%), corresponding to a uniform population and a 50:50
mixture, respectively. The interpatient distance is 0 when the two
virus populations consist uniformly of the same genetic variant, and it
is 1 (100%) when the two virus populations are composed entirely of
opposite genetic variants. The interpatient distance, as one can show,
cannot be smaller than the average of the two intrapatient distances.
Note that although genetic distances can be mathematically expressed in
terms of substitution frequencies, as used for calculations of genetic
distances in Materials and Methods, the genetic distances do not depend
on the choice of reference sequence (which, in our case, is the subtype
B consensus, the same as the database consensus). This invariance with
respect to reference sequence makes the intrapatient genetic distance a
more adequate parameter for the comparison of theory and experiment
than the substitution frequency. Indeed, as we argue below, the
best-fit sequence and the consensus sequence do not necessarily
coincide, making the choice of a proper reference sequence less than obvious.
Figure 1 shows both kinds of distances averaged over patients (or pairs
of patients) versus the position of a base in pro. For
individual patients, the intrapatient distances at different variable
sites are shown in a gray-scale diagram in Fig.
2. Table 1 shows the intrapatient genetic
distance averaged twice: first over one of three groups of sites for
each patient separately (sites which are variable in the patient, sites
which are variable in any patient, and all sites in pro),
and then over all patients. Since multiple substitutions within the
same codon were rare and did not generally affect the net diversity, we
were able to classify the mutations at each variable base as either
synonymous or nonsynonymous. (When two or more substitutions with
respect to the consensus did occur in the same codon of the same
sequence, we counted these substitutions as if they occurred in
different sequences.) Table 1 shows separate data for the two types of
sites. The principal conclusions from this material are as follows. (i)
The bulk of variation within an individual is contributed by rare
sites. In all individuals combined, only 47 of the 297 total bases were variable. (ii) An average variable site occurs in approximately 16% of
patients (2 of 11). (iii) A typical variable site is highly diverse;
the intrapatient genetic distance is 27% per variable site.
(iv) Synonymous and nonsynonymous variable sites are similarly diverse.
(v) However, synonymous sites are twice as frequent; if variable sites
were chosen at random, the latter ratio would be inverted.

View larger version (40K):
[in this window]
[in a new window]
|
FIG. 2.
Gray-scale diagram of intrapatient genetic distance,
Tki, in different patients at different variable
sites in pro. The intrapatient genetic distance is indicated
by the degree of shading, as shown on the scale on the right. Letters
and numbers under the diagram show consensus nucleotides and positions
in pro, respectively.
|
|
Random drift, in the case of a small effective population size, could
have played a significant role in the pattern of genetic diversity
observed. However, in other work (39), we tested the pro sequences for a characteristic linkage disequilibrium
effect between pairs of sites, a phenomenon expected if the effective virus population is small. The negative result we obtained shows that
stochastic effects do not greatly influence the evolution of separate
bases in an individual quasi-steady-state HIV infection. The
preponderance of synonymous substitutions (observation v in the
previous paragraph) suggests that selection for diversity is not a
dominant factor of evolution in pro during most of the asymptomatic phase of infection. Therefore, we will treat the evolution
of pro in an individual as approximately deterministic and
mostly controlled by purifying selection.
Our starting idea of HIV evolution, to explain the high level of
diversity at individual variable sites and the random difference between infected individuals, is illustrated in Fig.
3. Slightly deleterious pro
mutants emerge spontaneously in some individuals and are passed from
one individual to the next, down the chain of transmission. Sequences
are sampled randomly during transmission, so that the infecting genome
may or may not contain a mutation. In a person who is infected with a
mutunt virus (person A in Fig. 3), the population gradually reverts to
wild type. During the reversion process, when there are comparable
quantities of mutant and wild type, the base will be very diverse. We
will analyze this process in two parts. First we consider evolution in
an individual, and then we include transmission.

View larger version (27K):
[in this window]
[in a new window]
|
FIG. 3.
Model of evolution of a nucleotide along the chain of
infected individuals under purifying selection. The numbers at the top
denote cycles of transmission. X signs denote mutants. A mutant base
appears spontaneously in person n, who infects person A with
the mutant and person B with the wild-type variant. Person A passes the
mutant down the chain to person C, after which his/her own virus
population slowly reverts to the wild type. In person D, who is stably
coinfected with a pair of sequences polymorphous at that base,
selection clears the mutant virus rapidly.
|
|
Model 1: evolution in an individual.
Neglecting transversions
(which we do in both theory and experiment), a base can assume one of
two variants, either A/C or G/T. One of the two variants (by definition
the wild type) is better fit, in the sense defined above. We assume
that the wild-type sequence is the same for all individuals and all
cell types involved. (This and other assumptions are discussed in
detail in Discussion.) Each base is characterized by the relative
difference in fitness between the two variants (the selection
coefficient, s) and the mutation rate (µ). The best
estimate of the HIV mutation rate is µ = 4 · 10
5 for A
G and C
T transitions, and 5 to 10 times
lower for the opposite transitions (24). The selection
coefficient, s, varies strongly among different bases and is
almost never known in vivo; therefore, it will be treated as a fitting
parameter (see its formal definition in Materials and Methods).
To complete the model, we have to make an assumption about the initial
genetic composition at the chosen site. It is known that the virus
population early (a few weeks) postinfection can be either genetically
diverse or uniform, depending on the genomic region. For example, the
V3 and V4 regions of env, which are highly diverse late in
infection, are uniform early in infection (9, 18, 23, 53),
while the region of gag coding for the p17 matrix protein
can have a level of diversity early in infection comparable to that
exhibited late in infection (53). This effect probably
reflects transmission of multiple clones which can coexist on the time
scale of a few weeks due to relatively weak selection conditions for
p17 (53). Since we are not aware of analogous data that
could clarify transmission and early selection conditions in
pro, we will consider all possibilities. In this and the
following section we consider a single-clone transmission.
Subsequently, we will investigate effects of multiple-clone
transmission, including that from the same and different sources.
For the moment, let us assume that the initial, postseroconversion
virus population is either 100% mutant or 100% wild type with respect
to a chosen base. Suppose it is 100% mutant. In the course of a
persistent infection involving numerous infection cycles, wild-type
variants will emerge due to mutation. Since the wild type, by
definition, is selected for and the mutant is selected against, the
population will gradually revert to an almost entirely wild-type
population, as shown in the upper panel of Fig.
4. The resulting steady-state population
will have a small proportion of mutants due to the balance between
selection and mutations (7). We denote
t50 as the time it takes to reach a 50%
composition; t50 is inversely proportional to
the selection coefficient and directly proportional to the time per
replication cycle (see Materials and Methods). The latter time can be
approximated as the average duration of the productive phase of an
infected cell, approximately 2 days (16, 17, 34, 46). If a
base is observed near time t50, it will display
a high degree of diversity. It is reasonable to assume that the mean
sampling time is about 5 years, which is somewhat more than one-half of
the average duration of infection (42). This estimate yields
a selection coefficient s = 0.008; for a patient tested
earlier or later, the respective fitting value of s will be
higher or lower. Figure 4 shows the case in which A or C is wild type;
when G or T is wild type, reversion occurs somewhat faster due to the
higher reverse mutation rate, and we have a slightly lower fitting
value, s = 0.005.

View larger version (35K):
[in this window]
[in a new window]
|
FIG. 4.
Time dependence of the mutant frequency at different
initial populations: purely mutant (reversion) (thicker line in upper
panel), purely wild type (accumulation) (lower panel), or 50% mutant
(growth competition) (dashed line). The thinner curve shows the
intrapatient genetic distance T(t) during
reversion. The horizontal bar is the time interval during which the
base would be classified as variable (5 to 95%). Values shown in the
upper left corner correspond to parameters as follows: the forward
(wild type mutant) and reverse mutation rates, µf and
µr, corresponding to either A or C wild type; the
replication cycle time, trep; and selection
coefficient, s. The reversion half-time,
t50, is given by the equation
t50 = (trep/s)ln(s/µr).
|
|
The reversion model explains almost all experimental features discussed
in the previous section. (i) Variable bases are rare in the sequence,
since such a base must have a very small selection coefficient, on the
order of or less than 0.01. (ii) Variable bases differ among
individuals, since an individual may be randomly infected with a virus
that is either mutant or wild type at a particular base. Also, since
different individuals are tested at different times, most of them are
sampled outside the narrow time interval in which reversion of a
particular base can be observed (Fig. 4). (iii) An average variable
base in an individual is very diverse, since it is in the middle of
reversion. (iv) For the same reason, synonymous and nonsynonymous
variable sites have approximately equal diversities per site. (v)
Synonymous variable sites are more abundant (Table 1), implying that
synonymous substitutions tend to have smaller s values. Note
that according to the model, positions of variable sites are not fixed
but depend on the time of observation.
According to this model, the consensus sequence of the entire data set
for all patients can be either mutant or wild type at a variable site.
A base with a selection coefficient somewhat smaller than 0.008 and
wild type A or C will tend to be observed before it has completed its
reversion in patients who were infected with a virus that is mutant at
that base (for a 5-year observation point). Therefore, if such patients
are sufficiently frequent, the majority of samples at the time of
observation will be mutant at that base. This example illustrates our
previous comment, that wild type and consensus sequences, even if
determined for a very large group of patients, do not necessarily
coincide. (Alternatively, the wild-type sequence may vary among
individuals; this possibility is discussed later.)
It is tempting to use the fact that each variable site is expected to
evolve in the same direction in different patients as an experimental
test of the reversion model. Suppose we compare virus variants from two
patients that share a few variable sites. Since one of the patients has
been infected longer than the other, the mutant frequency is expected
to differ from the first to the second patient in the same direction
for all variable sites. Unfortunately, to use this prediction as a
test, one would have to know the wild-type sequence at these bases,
which, as we just discussed, is unknown. This is also why we have used
the pairwise genetic distance instead of the substitution frequency:
the former does not change if all sequences mutant at a base are
replaced by sequences wild type at the base, and vice versa. In the
database studied, the intrapatient distance increases at some bases
shared by both patients and decreases at others (Fig. 2). This does not
contradict the reversion model in which the genetic distance at a base
is expected to have a maximum in the middle of reversion (Fig. 4). We
once hoped that comparing more than two patients at once might allow
prediction of wild types and, therefore, directions of reversion for
separate bases. Using a Boolean algorithm written for this purpose, we determined that the average proportion of patients sharing the same
variable site is too low for useful analysis (results not shown).
Determining wild types at different sites would require two sequence
sets obtained, in the same group of patients, at two sufficiently
distant time points. We are not aware of such data for pro.
Model 2: chain of single-clone transmission.
In the absolute
sense, the average proportion of patients sharing the same variable
site (the frequency of variable sites in individuals) is still quite
high, 16%, contributing to the high average intrapatient distance in
pro. Can the reversion model explain such a high value? To
answer this question, we have to explicitly consider the evolution of
the virus along the chain of infection (Fig. 3). Suppose that each
individual infects the next individual at a fixed time since the moment
of his/her own infection with, as we still assume, a single genetic
variant, either mutant or wild type at a particular base. A predictable parameter, related to the frequency of variable sites in an individual, is the probability that a base is mutant in the inoculum for person n in the transmission chain.
As our calculation shows (see Materials and Methods), if the
transmission time is shorter than the reversion time (5 years, for the
bases of interest), the probability of a mutant being in the inoculum
will grow with every passage from individual to individual, until it
saturates at a constant value, which can be described as a steady state
in the transmission chain sense (Fig. 5a)
(see Materials and Methods). This value is not equal to the individual
steady-state mutant frequency; rather, it is much higher. In fact, for
very small transmission time intervals, and if the forward mutation
rate is much higher than the reverse mutation rate, the probability of
a mutant being in the inoculum can eventually approach 100%.

View larger version (38K):
[in this window]
[in a new window]
|
FIG. 5.
(a) Probability of a mutant base being in the inoculum
at the chain steady state, f* , as a function
of the transmission time, t*. Solid line, strictly
single-clone infection; dashed line, coinfection from two sources with
probability q = 0.05. (b) Number of cycles required to
reach the chain steady state, neq, as a function
of transmission time, t*. (c) Probability of a mutant base
for a very short transmission time, t* 0, versus the
probability of dual infection, q. Upper and lower curves
correspond to wild-type A/C and G/T, respectively. (d) The same
probability versus s, for a fixed value q = 0.01. trep = 2 days;
µf = 4 · 10 5,
µr = 4 · 10 6 for wild type A/C
and vice versa for G/T.
|
|
On an intuitive level, the predicted strong accumulation of mutants is
a combined effect of the transmission bottleneck, which causes an
initial population to be genetically uniform at a base, and of the
short passage time. Selection does not act on a uniformly mutant
population; its effect becomes important after mutations create a
sufficiently large wild-type subpopulation. Hence, quick passage of the
virus to the next person in the chain, before such a subpopulation can
be created, effectively suppresses selection against mutants.
As we have found (see Materials and Methods), the number of
transmission intervals required to reach the steady state in the transmission chain and the level to which mutants accumulate are proportional. To explain the high level of mutants observed in the
inoculum, the transmission interval has to be rather short, ~1 year
or less (Fig. 5a). In this case, hundreds of transmission cycles will
be required. Thus, the main contributor to the occurrence of mutants in
the inoculum is the virus evolution occurring before the start of the
HIV pandemic, which includes the human endemic and enzootic infection
in the original (animal) host. Although the transmission time and other
parameters may differ between human and animal hosts, this difference
is not expected to affect these conclusions seriously.
Model 3: coinfection.
Thus, if a single clone of virus is
transmitted each time, any base with a small selection coefficient will
eventually become highly diverse. (The selection coefficient has to be
less than the inverse of the number of virus replication cycles between two transmissions [Fig. 5a]) Is this model a realistic explanation for the large number of variable sites in pro? To answer
this question, we reexamined the approximations that were incorporated in the model and found that the theoretical explanation critically depends on the assumption that a recipient is always infected from a
single source. As we show below, even very rare coinfections from two
or more independent sources within a short time interval can prevent
accumulation of mutants and sharply decrease the number of variable sites.
Uniformity of the initial virus population is the factor that, as we
have shown, suppresses selection in the case of single-clone transmission. A uniformly mutant population gives rise to a new uniformly mutant population in the next individual (person C in Fig.
3), which is then passed further along the chain until a rare sampling
of the wild type is obtained for an inoculum. As a result, a typical
transmission chain is expected to consist of long series of individuals
infected with the same initial variant: mutant - mutant -...-mutant - mutant - wild type - wild type -...-wild type - wild type - mutant - mutant -...- mutant - mutant, and so on. Suppose, now, that an
individual (patient D in Fig. 3) is infected, before passing the virus
further, with two sequences from different sources, one from a mutant
series and another from a wild-type series. Suppose, also, that the two
transmission events with viruses of very similar fitness occur within a
short time interval (weeks), so that the virus transmitted first does
not have a significant advantage in establishing infection with respect to the virus transmitted second. (This assumption is indirectly supported by the short transmission time we had to assume above to
explain the accumulation of mutants for single-clone transmission, by
the actual observation that the protection effect in simian immunodeficiency virus (SIV)-infected animals is absent during the
first few weeks postinfection [51], and by the
observation that multiple virus clones can coexist in primary
infections [53].) In this case, both genetic variants
are well represented in the initial population. Therefore, even weak
selection between the two viruses starts to act immediately
[fcom (t) in Fig. 4], and there will be less mutant virus
to transmit to the next recipient. Such an event can interrupt a long
series of "mutant" individuals. Thus, even very rare dual
infections can prevent mutant variants from accumulating. This
qualitative conclusion is confirmed by direct calculation (see
Materials and Methods). The effect of dual infections on diversity
turns out to be especially strong for short transmission time
intervals, when the level of diversity otherwise would be very large
(Fig. 5a). Even if the probability of dual infection is as small as
q = 0.05, the mutant frequency in the inoculum is
decreased by a factor of 10 compared to the case of exclusively
single-clone transmission (Fig. 5c). (Note that the above argument and
corresponding calculation make sense only on average. A typical pair of
sequences from different sources is also heterozygous at many other
sites. The difference in fitness between the two clones is the sum of a
random term contributed from the other sites and of a regular
term due to the chosen site. The random term has a random sign and
vanishes when averaged over many pairs of clones. The same
consideration applies to evolution within an individual: although
competition between a particular pair of clones depends on the entire
configuration of each sequence, evolution of a chosen base can be
considered separately, since configurations of other bases average out
over many pairs of clones. This is the reason why evolution of a site,
in the absence of coselection and in a sufficiently large population,
can be considered separately from other sites, even if recombination is
absent.)
Thus, the accumulation of mutants to high levels occurs only as long as
the spread of infection among the animal or human population is exactly
represented by a branching tree, but it may be efficiently suppressed
if the branches can sometimes merge together (by coinfection), creating
an image of a cluster with loops. The topology of epidemics determines
the outcome of virus evolution.
Transmission of multiple sequences from the same source, unlike
coinfection from independent sources, does not greatly affect the
accumulation of mutants in the transmission chain. We have calculated
the frequency of mutants in an inoculum in the model in which either
one or two clones are transmitted from a single source (see Materials
and Methods). At short transmission time intervals (when the mutant
frequency in the inoculum is maximum), coinfection has no effect at all
on the frequency. (Such an effect may appear for a very large, in terms
of parameter s/µf, number of transmitted
clones and at finite transmission times.) The intuitive reason for this
is clear from Fig. 3: if person D received two or more genomes from
person A, all of them would be mutant. This situation would not differ
from that in which a single mutant genome is received from person A,
since the initial population established in person D would be mutant in
both cases. In mathematical terms, the difference between coinfection
from two epidemiologically distant persons and coinfection from the
same source is that sequences from two sources are statistically
independent (see Materials and Methods) while two sequences sampled
from the same person are not. Since each person has an almost uniform
initial population, they tend to be either both mutant or both wild
type; only rarely is one mutant and the other wild type. Conditions
under which two typical individuals in an infected population can be
considered as epidemiologically independent are discussed below.
Value of q estimated for HIV.
We now return to the
experimental data to find out how small the probability of coinfection,
q, must be to explain the high degree of diversity observed
in pro. Since, as follows from our calculation, the
probability of a mutant base is much less for sites with wild-type G or
T (Fig. 5c), most variable bases are expected to have A or C as the
wild type. If the test time and the replication cycle time did not vary
between patients, the probability of receiving a mutant base in the
inoculum would be equal to the observed frequency at which a variable
site appears in individuals, which is 16%. Since both times, in fact,
vary between patients, the probability of a mutant being in the
inoculum must be higher than 16%, because if an infected person is
tested too early or too late (compared to an average patient), given a
limited sample size (~20), a reverting base will not be detectably diverse (Fig. 4). The variation in replication time, as follows from
statistical analysis of data from 20 patients (17), is relatively small and can be neglected. Still, we have to take into
account the probable broad distribution of the test time among
patients. The time of progression to AIDS for HIV-infected individuals
is almost uniformly distributed between 2 and 14 years (42).
We assume that the test time is somewhat past the middle of the
infection course and is distributed uniformly as well, between 1.3 and
8.7 years, with the average being 5 years. The distribution density of
the selection coefficient around the value s
0.01
(which is, of course, unknown) also enters our calculation. Fortunately, as we had checked, the final answer happens to be rather
insensitive to the shape of the distribution density; as an estimate,
we assume a uniform distribution in the region of interest, which
happens to be (see Materials and Methods) s = 0.005 to
0.05. Under these approximations, we estimate that the observed
experimental frequency of a variable site, 16%, requires a
q value of 0.01 or less (see Materials and Methods). At this value of q, the probability of a mutant being in the
inoculum is close to 100% if the wild type is A or C and around 10%
if it is G or T (Fig. 5c).
The above calculation assumes that if coinfection with two virus
sequences occurs, the virus population after the acute infection phase
contains 50% of each variant. As the period of time between two
infections increases, the symmetry between the two clones may be lost.
A random initial composition of between 0 and 100% may be a better
approximation in this case. As we have checked, such a change of the
model does not affect the mutant frequency in the inoculum at very
small values of q and causes an increase in this parameter
at larger q values by a constant factor of 1.5 (the upper
curve in Fig. 5c shifts upward). As a result, the value of q
estimated above increases by 50%, and the final conclusions are not affected.
The above results suggest a surprisingly strong effect of quite rare
coinfection events on mutant frequency. Given our assumptions, a
probability of coinfection from independent sources of 1% during the
evolutionary history of the virus allows for a high observed frequency
of variable sites, 16%. A 10-fold-higher probability of coinfection
would decrease the latter frequency by a factor of ~10 (cf. Fig. 5c).
On the other hand, lowering q below 1% does not increase
the variable-site frequency further above 16%. Indeed, an HIV
population has to be, according to the model, in the rare-coinfection limits, when the probability of a mutant being in the inoculum is
already close to 100%. The difference between the 16% observed and
100% values is primarily due to fluctuations of the observation time
between patients.
 |
DISCUSSION |
In this paper, we present and evaluate a model designed to explain
the distribution of mutations observed in a relatively conserved region
of the HIV genome. In its simplest form, the model describes the
accumulation of diversity at individual sites as a consequence of
passing of slightly deleterious mutations upon relatively rapid
transmission of the virus from individual to individual, accompanied by
a slow reversion within infected individuals. As described in more
detail below, the model is robust to variation in most of the
assumptions we initially made, with one exception. The accumulation of
mutations during the chain of transmission is exquisitely sensitive to
coinfection of infected individuals prior to further transmission from
epidemiologically distant sources. To sustain the observed variability
in pro, either the frequency of coinfection must be 1% or
less or one of the other assumptions must be changed radically. We
present an alternative explanation at the end of this section.
The scenario we describe here is also a possible mechanism for the
accumulation of deleterious mutations in frequently passaged viruses,
as observed in vitro for vesicular stomatitis virus (5, 10, 11,
43). Another possible explanation (10) is the Muller's ratchet effect, in which the processes of reversion at different loci interfere with each other. This effect of stochastic evolution theory, which applies to populations that are on the order of
the inverse mutation rate, leads to enhanced accumulation of slightly
deleterious mutations (12, 14, 30). The testable differences
between the more simple effect described in this analysis and Muller's
ratchet are follows. (i) The "simple" ratchet acts on an individual
locus, while Muller's ratchet involves two or more loci and can,
therefore, be suppressed if recombination events are frequent (12,
14, 30). (ii) While the Muller's ratchet effect is restricted to
small virus populations (12, 14, 30), the simple ratchet is
expected to work even if the average virus population size between
transmissions is large. (iii) Simple ratchet implies that the reversion
processes at different loci do not have sufficient time to occur,
making their interference (Muller's ratchet) redundant.
As we have found, a high frequency of variable sites in HIV could be
explained as a combined effect of a transmission bottleneck and
frequent passaging between individuals, provided coinfection from
independent sources is extremely rare. The rate of superinfection of
individuals depends on both virological and epidemiological factors and
is hard to estimate, in general. A clue to the rate in modern
times is given by the frequency of coinfection with multiple HIV subtypes in geographical regions where no single subtype
dominates. Infections with two different HIV type 1 (HIV-1) subtypes
(or with HIV-1 and HIV-2) have been directly observed and are
estimated to occur in 10 to 15% of such cases (35, 36). This figure does not include reports of intersubtype recombination (15, 40), which imply coinfection at some indeterminate time in the past.
The fact that coinfection is relatively frequent despite the effect of
superinfection protection detected in SIV-positive animals
(51) suggests either that a narrow time interval between primary infection and establishment of protection exists or that the
protection is only partial in a probabilistic sense; data (51) allow for either interpretation. The two-subtype data
can be used to estimate the overall frequency of coinfection,
q. Assuming that the two subtypes circulate in equal
quantities, and taking into account the fact that cases of double
infection with the same subtype are not included in the above
percentage, we obtain, as a lower bound, q = 0.2 to
0.3, much higher than our prediction of q
0.01. We
have no way of knowing whether this value is representative of
conditions in ancient animal transmission, which, according to the
model, determine the mutant frequency in an inoculum in the modern-day
pandemic; however, there is no obvious barrier to the coinfection rate
which would keep its value as low as 1%. Although we cannot dismiss
the possibility of a very low coinfection rate in the natural host, we
do not know any independent facts to support this model.
Verifying approximations.
The above considerations suggest
that we may have to look elsewhere for an explanation for the high
frequency of variable sites. To find another possibility, we will
reexamine the simplifications common for all models discussed so far.
(i) Ignoring transversions.
We ignored transversions in both
experiment and theory. The danger of this approximation is that a
double transversion may appear as a transition and thus interfere with
our count of transitions. Given the small relative weight of single
transversions in the net variation (25%), this effect is expected to
be but a relatively small correction to the numbers of transitions, on
the order of (0.25)2 (~0.06).
(ii) Neglecting stochastic effects in steady state.
Stochastic
effects include randomness of mutation times and random drift of the
genetic composition. The linkage disequilibrium test (39)
implies that the effective HIV population is larger than
105 infected cells (P = 0.05) and that,
correspondingly, stochastic effects on the evolution of separate bases
within an individual are relatively small. In the same work, we
considered the effect of recombination on the test results and found
that it cannot decrease the lower estimate of the effective population
size by more than a factor of ~2.
(iii) Constant HIV population size.
In fact, in a typical
patient, the virus load expands sharply within 1 to 2 weeks and then
drops by 1 to 2 orders of magnitude over 2 to 3 months during the acute
phase of infection. Then, during the asymptomatic phase of infection,
the virus load climbs back slowly, over a scale of a few years, before
a large expansion at the final stage, leading to AIDS. In this work, we
consider the evolution of bases with a small selection coefficient,
which occurs slowly during the asymptomatic phase of infection. The ongoing slow change in the population size is not important for the
almost deterministic evolution (which we found is the case), since it
does not greatly depend on the population size.
(iv) Same wild type for all cells and individuals.
In fact,
HIV adapts to new cell types. The magnitude of this effect on the HIV
genome after a thousand replication cycles, the relevant time scale, is
hard to infer from the existing literature. It is certain that
adaptation on a scale comparable to the fraction of variable sites in
HIV pro, 16%, can be caused within 1 year by a very adverse
factor, such as an efficient protease inhibitor (8). It
could be also forced by the immune response, which we have not
considered to this point.
(v) "Well-stirred pot" approximation.
We implicitly
replaced the complex topography of virus distribution between and
within different lymphoid organs by an imaginary infected organ with
well-mixed cells infected with different genetic variants. Direct
visualization of two env variants of HIV in the spleen by
selective labeling shows that although infected cells are nonuniformly
distributed into islands, most of these islands are shared by different
variants (37). This result implies that virions mix well
between these islands; i.e., a significant part of the de novo
infection is due to far-travelling virus particles (38). The
conclusion is supported (38) by the large number of virion
particles removed daily from peripheral blood of SIV-infected animals
(28).
(vi) Independent sources of coinfection.
When calculating the
purifying effect of coinfection, we implied that two typical sources of
coinfection in the host population are epidemiologically distant, so
that the probabilities of receiving mutant bases from the two sources
are independent. For epidemiologically close sources, the purifying
effect of coinfection is expected to become weaker. Transmission of two
clones from the same source is a limiting case of nonindependent
coinfection; it does not have a purifying effect (see Materials and
Methods). Whether two typical animals in a population are
epidemiologically distant depends on the coinfection frequency
q and on the size of the animal population. Coinfections
randomize the distribution of virus variants between animals and make
them more independent. If coinfections are very rare (i.e.,
q is much less than 1), a mutant base can pass through a
long chain of animals without converting to the opposite variant. Then,
two animals from even a large population are likely to have the same
initial variant. If, however, the value of q is not too
small, say ~0.2 to 0.3, then the two typical animals are
statistically independent even in a small herd comprising just a few
animals. Therefore, invoking the epidemiological proximity does not
really help to explain the high mutant frequency in inocula; one still
has to postulate that q is very small.
(vii) Absence of group selection.
The possibility of group
selection (as opposed to selection of individual genomes) has been
raised (27). However, at this time, there is no compelling
evidence that such an effect exists for RNA viruses.
(viii) Neglecting coselection.
This approximation is rather
crude for a particular pair of sites. When studying linkage
disequilibrium (39) at close pairs of variable sites, we
found that some haplotypes are 2.5-fold more frequent than one would
expect for independently evolving sites, implying a noticeable
coselection effect. However, when obtaining an average pattern of
variation in a segment of genome as long as pro, containing
over 40 variable sites, contributions from negatively and positively
coselected pairs of bases to the net variation are expected to cancel
out, at least partially. For example, if coselection is responsible for
50% of the relative fitness of haplotypes at a separate pair of sites,
and there are 40 variable sites with a random sign of coselection
between each pair, then the effect of coselection on the relative
fitness at an average site is expected to be, by the laws of
statistics, ~50%/
39 (~8%), i.e., a small correction. This
argument, however, applies only under certain restrictions (see below).
(ix) Neglecting selection for diversity.
The predominance of
synonymous and conservative substitutions in pro in patients
at late stages of asymptomatic infection was the reason why we
initially neglected selection for diversity. This argument does not
imply that immune pressure is completely absent from pro but
rather states that its selective pressure is smaller than the effect of
purifying selection at a typical observation time.
Let us search for weak spots in these approximations. Note, first, that
a few of them have restrictions. First, the wild-type pro
sequence could differ between individuals strongly if, despite the
predominance of synonymous mutations, the immune function was somehow
involved; amino acids serving as binding sites for MHC subtypes
expressed in a person are highly individual. Second, coselection could
be relevant if the 47 variable sites in pro were somehow
dependent on a much smaller number of sites. Third