## ABSTRACT

To assess the dynamics of genetic reversion of live poliovirus vaccine in humans, we studied molecular evolution in Sabin-like poliovirus isolates from Nigerian acute flaccid paralysis cases obtained from routine surveillance. We employed a novel modeling approach to infer substitution and recombination rates from whole-genome sequences and information about poliovirus infection dynamics and the individual vaccination history. We confirmed observations from a recent vaccine trial that VP1 substitution rates are increased for Sabin-like isolates relative to the rate for the wild type due to increased nonsynonymous substitution rates. We also inferred substitution rates for attenuating nucleotides and confirmed that reversion can occur in days to weeks after vaccination. We combine our observations for Sabin-like virus evolution with the molecular clock for VP1 of circulating wild-type strains to infer that the mean time from the initiating vaccine dose to the earliest detection of circulating vaccine-derived poliovirus (cVDPV) is 300 days for Sabin-like virus type 1, 210 days for Sabin-like virus type 2, and 390 days for Sabin-like virus type 3. Phylogenetic relationships indicated transient local transmission of Sabin-like virus type 3 and, possibly, Sabin-like virus type 1 during periods of low wild polio incidence. Comparison of Sabin-like virus recombinants with known Nigerian vaccine-derived poliovirus recombinants shows that while recombination with non-Sabin enteroviruses is associated with cVDPV, the recombination rates are similar for Sabin isolate-Sabin isolate and Sabin isolate–non-Sabin enterovirus recombination after accounting for the time from dosing to the time of detection. Our study provides a comprehensive picture of the evolutionary dynamics of the oral polio vaccine in the field.

**IMPORTANCE** The global polio eradication effort has completed its 26th year. Despite success in eliminating wild poliovirus from most of the world, polio persists in populations where logistical, social, and political factors have not allowed vaccination programs of sustained high quality. One issue of critical importance is eliminating circulating vaccine-derived polioviruses (cVDPVs) that have properties indistinguishable from those of wild poliovirus and can cause paralytic disease. cVDPV emerges due to the genetic instability of the Sabin viruses used in the oral polio vaccine (OPV) in populations that have low levels of immunity to poliovirus. However, the dynamics responsible are incompletely understood because it has historically been difficult to gather and interpret data about evolution of the Sabin viruses used in OPV in regions where cVDPV has occurred. This study is the first to combine whole-genome sequencing of poliovirus isolates collected during routine surveillance with knowledge about the intrahost dynamics of poliovirus to provide quantitative insight into polio vaccine evolution in the field.

## INTRODUCTION

Over the last 50 years, mass vaccination campaigns with oral polio vaccine (OPV) have been a key factor in the dramatic decrease in the incidence of paralytic poliomyelitis and the eradication of wild poliovirus type 2 (1–3). OPV has proven highly effective because of its ease of administration and ability to transmit to secondary contacts (4–7). OPV contains live attenuated poliovirus (Sabin) strains that exhibit significantly reduced neurovirulence and likely reduced transmissibility relative to the neurovirulence and transmissibility of wild-type poliovirus due to key attenuating nucleotide substitutions (1, 7). However, in a vaccine recipient or close contact, the vaccine strains experience significant positive selection pressure that drives reversion of the attenuating nucleotide substitutions. In the absence of sufficient population immunity, the ability to be transmitted allows vaccine-derived poliovirus (VDPV) strains to circulate, evolve, and cause outbreaks of poliomyelitis indistinguishable from those caused by wild-type poliovirus.

Understanding the rates of nucleotide substitution early in the vaccine reversion process is critical to the development of a quantitative understanding of the processes leading to circulating VDPV (cVDPV) emergence. Due to genetic drift and selection, the Sabin strains distributed in OPV evolve to first become Sabin-like viruses that are defined to have genetic sequences that closely match the genetic sequence of the originating vaccine strain (<10 VP1 mutations for types 1 and 3; <6 VP1 mutations for type 2) (8). Sustained persistence in the human population allows further evolution into VDPV, of which the cVDPVs are a subset that are confirmed to be circulating by epidemiological investigation and genetic linkage. The substitution rate of the VP1 gene, estimated from the number of VP1 substitutions between the genome sequence of the sampled isolate and that of the vaccine strain, is used to estimate the transmission time from the time of vaccination to the time of cVDPV detection. For circulating wild-type polioviruses, the neutral substitution rate of the VP1 region of the capsid is approximately 2.8 × 10^{−5} nucleotides (nt) per site per day (9). This is approximately 0.025 nt per day, equivalent to 9 VP1 sites per year. Most of these substitutions are synonymous (*d _{n}*/

*d*= 0.03, where

_{s}*d*is the rate of nonsynonymous substitutions and

_{n}*d*is the rate of synonymous substitutions) (8). However, previous studies have indicated that in Sabin strains, reversion of known attenuating sites can occur within the duration of a single infection and that VP1 substitutions accumulate more quickly than would be expected from the wild-type neutral substitution rate; multiple VP1 substitutions occur within the first month after administration of the vaccine dose (10–15). In addition, most reported cVDPV isolates are recombinants with Sabin-derived capsid and enterovirus C (poliovirus and nonpoliovirus) noncapsid sequences, although the role of selection pressure in this process is not clear (1, 2, 11, 16–26). cVDPV has become a significant source of poliomyelitis. A better understanding of the timeline of events leading to cVDPV emergence may provide a basis for improved estimates of the risk of cVDPV emergence and is critical for improving models of OPV cessation and postcessation outbreak response strategies.

_{s}Here we describe 241 Sabin-like isolates collected in Nigeria via routine acute flaccid paralysis (AFP) surveillance during the period from 2009 through 2012. With the aid of a detailed intrahost model of poliovirus infection (27), we confirm that the VP1 substitutions accumulate at significantly higher rates in Sabin-like isolates than in circulating wild-type poliovirus and that the rates observed in this OPV-experienced population are consistent with the rates observed in newborns (14). We confirm that many key attenuating sites known to be responsible for the low neurovirulence and reduced replicative fitness of the vaccine strains revert in days to weeks during primary vaccine-induced infections. We identified Sabin-like isolates containing consensus recombinant genomes and compared their sequences with those of known cVDPV index case isolates. We found that cVDPV index case isolates are associated with a higher frequency of recombination with non-Sabin enteroviruses but that recombination rates that account for the time since vaccination are comparable in VDPV index case isolates and Sabin-like isolates.

## MATERIALS AND METHODS

Viruses.Viral isolates were obtained from stool samples collected during routine surveillance for AFP in Nigeria from 2009 to 2012 (28). Individuals were drawn from 35 of Nigeria's 36 states and Abuja, the federal capital. The age at the time of paralysis ranged from 0 to 12 years (mean, 2.5 years), the number of reported lifetime OPV doses ranged from 0 to 24 (mean, 4.9), and the precise timing of vaccination was rarely reported. The typical vaccination history contains more than the three recommended doses because polio vaccination in Nigeria is dominated by supplementary immunization activities (SIAs), and roughly 8 SIAs per year were typical during this time period (29–31) (Fig. 1). Isolates that were categorized as “Sabin-like” by intratypic differentiation using real-time reverse transcription-PCR (32, 33) were included in the study. After samples that contained mixtures of multiple poliovirus strains (about 15% of all available samples) were excluded, the initial sample set was reduced to 241 monotypic samples from 238 unique individuals. Because our focus was on mutation in Sabin-like isolates, we also excluded from the study an additional 168 type 1 wild polioviruses, 354 type 3 wild polioviruses, and 216 cVDPV type 2 isolates that were collected within the same time period (29–31).

Poliovirus genome sequencing and assembly.Total nucleic acid was extracted from 200 μl of each virus isolate using a MagNA Pure compact nucleic acid isolation kit and a MagNA Pure compact automated extractor (Roche Applied Science, Indianapolis, IN). For each isolate, two fragments of the genome were amplified using primers 5′ NCR (5′-AAGCAGGCTTTAAAACAGCTCTGGGGTT-3′) and 3′ NCR (5′-TCTCCGAATTAAAGAAAAATTTACCCCTACA-3′) and primer Internal Y7 (5′-GGGTTTGTGTCAGCCTGTAATGA-3′) and Internal Q8(367) (5′-AAGAGGTCTCTRTTCCACAT-3′) (34). Barcoded sequencing libraries were prepared using Nextera DNA sample preparation kits (Epicentre, Madison, WI). Next-generation whole-genome sequencing was performed by the use of 36-bp single-end reads on an Illumina GAIIx instrument (htSEQ, Seattle, WA). The preliminary quality evaluation for each sample was generated using the FASTQC program (version 0.10.1; Babraham Bioinformatics [http://www.bioinformatics.babraham.ac.uk/projects/fastqc/]). We preprocessed the raw data to remove the bases or reads with poor quality, reads containing adaptor sequences, and reads containing ambiguous base calls before performing downstream analysis using the CLC Genomics Workbench program (version 6.5.1; CLC Bio). All scores for sample per base sequence quality and sample per sequence quality passed the evaluation with the quality control software performed after the trimming process. A summary of each evaluation of the quality of the trimmed reads was extracted into Excel software tables and is available upon request. See Fig. 2 for more information about sample classification.

Consensus sequences were constructed as follows: the filtered reads were aligned with the bowtie program (35) to the three Sabin reference sequences (sequences with GenBank accession numbers AY184219, AY184220, AY184221), single nucleotide variants were identified using SAMTools (36) and the VarScan program (37), the serotype was determined by maximum mean coverage, and the consensus sequence was defined as the most common nucleotide at each position. Poliovirus serotype was assigned on the basis of the VP1 sequence, according to standard practice (38).

We also constructed *de novo* assemblies of the trimmed reads using both CLC Bio software and the SSAKE algorithm (28). The assemblies generated by CLC Bio were compared to those generated by the SSAKE algorithm. The assemblies generated by CLC Bio were selected for further analysis due to their longer contig lengths and better quality scores (the *de novo* assemblies are available upon request). For each sample, all contigs that were longer than 200 nucleotides and that had a mean coverage depth of more than 10% of that of the longest contig were selected for further analysis. The sequence of each contig was pairwise aligned to the sequences of the three Sabin reference strains with the Needleman-Wunsch algorithm in the MATLAB program (MATLAB and Bioinformatics Toolbox release 2012a; MathWorks, Natick, MA, USA), and positions were labeled as matched or missed. The alignment match/miss was smoothed with a 200-nt window and plotted, and sample identity was determined by visual inspection; recombination breakpoints correspond to regions on a contig where the highest-quality alignment to the reference sequence changes between strains. We identified 10 samples that contained recombinant genomes with noncapsid consensus sequences: 9 contained Sabin enterovirus-Sabin enterovirus recombinants and 1 contained a Sabin enterovirus–non-Sabin enterovirus recombinant (Fig. 3 and Table 1).

To compare our Sabin-like isolates to circulating wild-type and cVDPV strains, we determined the consensus wild-type nucleotide at each position from a global alignment of all wild-type complete genome sequences available in GenBank under the search string “Human poliovirus *x* complete genome,” where *x* is types 1 to 3. The consensus sequence was derived after exclusion by hand of vaccine-derived sequences not associated with circulating outbreaks.

Modeling of duration of infection for each sample by estimation of vaccination dates from the SIA campaign calendar and reported vaccination history.To estimate substitution rates from observations of the number of substitutions, one must also know for each case the time interval from the start of OPV infection to the sampling time. For our samples obtained from surveillance for AFP in Nigeria, the sampling date is known, but the OPV exposure date is generally unavailable. Therefore, we used a Bayesian approach to infer a plausible range of values for the unknown exposure time point. To derive data-informed probability distributions that describe plausible exposure dates for each case, we incorporated several types of data, including shedding duration and lifetime vaccine exposure. Using Bayes' theorem, we posit that, given that shedding was observed on the sampling date, the posterior probability (*P*) that exposure occurred on a certain date (*t*_{exposure}) can be inferred by
1
where *t*_{exposure} is the exposure date, *P*(shedding | *t*_{exposure}) is the probability that an individual will still be shedding on the sampling day given the exposure date, and *p*_{0}(*t*_{exposure}) is the prior probability for the exposure date independent of shedding status (throughout this paper, we obey the convention that *p* denotes probability densities and *P* denotes probabilities).

To inform the exposure date prior probability, *p*_{0}(*t*_{exposure}), we used the fact that most vaccine recipients in Nigeria are reached by short-duration SIA campaigns (http://www.polioeradication.org/Aboutus/Strategy/Supplementaryimmunization.aspx). For each sample, we identified plausible SIA campaign sources of exposure: any campaign that administered a vaccine consistent with the observed poliovirus type in the case's state and district that also took place less than 60 days preceding the sample collection date. For 60% of the samples, at least one plausible campaign source was found. Because the bulk of an SIA campaign occurs within 3 days of the campaign start day, with additional time possibly being needed for wrap-up activities, we represented the probability of a certain exposure date, given that a campaign was observed, with a combination of a uniform and an exponential distribution:
where *t*_{campaign} _{start} is the campaign start date. This distribution was normalized so that it was continuous at *t*_{campaign start} + 3 days. In addition to exposure from SIAs, exposure may also have resulted from unreported routine vaccination, to which we assign a constant probability *P*(*t*_{exposure} | routine), which is abbreviated γ. The routine exposure probability was set relative to the campaign exposure probability so that the odds of exposure by routine vaccination relative to exposure by SIA vaccination roughly match expectations within Nigeria for each serotype. For types 1 and 3, we assumed that γ was equal to 0.15, and for type 2, we assumed that γ was equal to 0.5. The difference between serotypes reflects the fact that bivalent type 1 and 3 OPV is more commonly used than trivalent OPV in campaigns. It is difficult to accurately estimate routine immunization exposure in Nigeria beyond knowing that it is low compared to SIA campaign exposure, but our results were insensitive to this difficult-to-estimate parameter for any plausible value (γ < 0.75). We then represented the complete exposure date prior distribution as the sum of the probability of routine exposure and the probability of SIA campaign exposure as follows:
2
Motivated by the observation that the majority of our samples contained genotypes closely matching those of the vaccine strains, we assumed that the shedding duration distribution for individual poliovirus infections determines the probability of shedding after a certain duration *T*, which is equal to *t* − *t*_{exposure}, *P*(shedding | *T*). The shedding duration was assumed to follow a log-normal distribution:
3
The shedding duration depends on prior immunity to each serotype, as measured by the neutralizing serum antibody titer, log(*N*_{Ab}), such that μ is equal to log(*T*_{max})[1 − *k _{T}* log (

*N*

_{Ab})] and σ is equal to σ

_{T}, where

*T*

_{max}is the mean duration of shedding in the absence of immunity,

*k*is the reduction in shedding duration due to immunity, and σ

_{T}_{T}is the standard deviation of the shedding duration. From published studies (27, 39), we derived the following values:

*T*

_{max}is equal to 25.3 days,

*k*is equal to 0.15, and σ

_{T}_{T}is equal to 0.58 log(days). This shedding duration distribution encompasses the 5- to 60-day characteristic time scale of a poliovirus infection, in which higher antibody titers correlate with shorter infectious durations. Given the sampling date, we used the distribution for shedding given the duration in equation 3 to derive the distribution for shedding given exposure date

*P*(shedding |

*t*

_{exposure}), needed in equation 1.

Because serological information is not routinely collected during surveillance for AFP, we estimated log(*N*_{Ab}) from the self-reported vaccination history and lifetime vaccination campaign exposure using the model developed in reference 27. The history dependence is important because the shedding duration for each individual depends on the individual's immune state and can vary significantly across individuals. First, for each individual, we estimated the number of vaccination attempts for each serotype, and then we modeled the expected antibody titer that would result from that vaccination history (assuming no waning). The estimated number of vaccination attempts for each serotype is as follows: number of vaccinations of serotype *x* = (reported number of lifetime doses)(number of lifetime campaigns with serotype *x*/number of lifetime campaigns), where the number of campaigns is determined using the campaign calendar for Nigeria provided by WHO (29–31), and multivalent vaccines were counted as independent attempts with each serotype contained in the vaccines. For children who reported known routine immunization doses, trivalent OPV was attributed to each routine dose and the campaign exposure accounted for the remaining doses only. After estimating the vaccination history of each individual, we used the antibody-dependent dose-response model in reference 27 to estimate the mean humoral antibody titer. In the model, individuals are challenged according to their vaccine history, where the probability of acquiring infection from vaccination is given by equation 9 of reference 27. We scaled the probability from the model by use of a multiplier of 0.8 to account for vaccine take in Nigerian AFP cases relative to that in healthy trial subjects (27, 40). Successful vaccinations increase the antibody titer (*N*_{Ab}) according to equations 3 and 4 of reference 27. Both the shedding and exposure features of the model are combined using equation 1 to construct the prior for the time interval between the date of exposure and the sampling date for each case.

To check the self-consistency of the immunity model that sets the log(*N*_{Ab}) parameter in equation 3, we compared the number of samples collected and the number of VP1 substitutions for each sample to the modeled estimated mean number of prior infections and estimated mean antibody titer (Fig. 4). We see that for higher levels of estimated immunity, we were less likely to detect any shedding and less likely to detect shedding at longer durations of infection (as measured by the number of VP1 mutations). Both of these observations are consistent with expectations from the literature that shedding duration and shedding probability after exposure decrease with increasing immunity (27, 41).

Estimation of VP1 substitution rate for Sabin-like isolates.For each sample, we then derived the distribution of rate *R* given *k* substitutions from Bayes' theorem:
4
where the probability distribution of each possible shedding duration, *p*(*T* | shedding), was given by equation 1 and *T* is equal to *t* − *t*_{exposure}. For the prior distribution for *R*, *p*_{0}(*R*), we assumed a normal distribution with mean and standard deviation derived from the neutral wild-type VP1 substitution rate of 0.028 ± 0.003 nucleotides per day (9). To represent the probability of each number of VP1 nucleotide substitutions *k* given a particular nucleotide substitution rate *R* and shedding duration *T*, we used a Poisson distribution:
The VP1 substitution rate distribution conditioned on the entire set of samples, *p*(*R*), then follows as the product over *i* sample estimates:
5

Vaccine stock diversity at position 2493 in Sabin type 3.Sequencing-based studies of vaccine stocks have previously indicated that diversity may exist at certain positions, including position 2493 in Sabin type 3, especially between vaccine suppliers (42). Such diversity may affect our estimation of VP1 substitution rates for Sabin type 3; therefore, we accounted for the uncertainty in the number of substitutions at position 2493. In this case, instead of calculating the rates based solely on the observed number of substitutions relative to the reference sequence (which could inappropriately attribute reversion in the vaccine stock to evolution within people), we assumed that the number of substitutions at position 2493 that were attributable to reversion varied from 0 to the observed substitution count. This assumption was represented by a uniform distribution on the number of substitutions, which was used as a prior distribution in calculating the per base rate at position 2493 as well as the total and nonsynonymous VP1 substitution rates. In this study, we refer to position 2493 as a known attenuating site in order to be consistent with the majority of the literature.

Posterior updated exposure date distribution.After estimating VP1 substitution rates, we performed a Bayesian update step to improve the precision of our exposure date distributions for each sample by using the information contained in the number of VP1 substitutions. The updated posterior for each sample is given by
where *p*(*R*) is the posterior VP1 substitution rate distribution.

Estimation of per base fixation rate.To estimate a per base fixation rate, *p*(*R* ∣ *k* = 1), we repeated the previous calculation in equation 5 under the assumptions that substitutions reach fixation and additional substitutions at the same site occur with a negligible probability prior to sampling. In this case, we formulated the likelihood of the rate *R*, *P*(*k* | *R*, *T*) as
6
The prior distribution for the fixation rate, *p*_{0}(*R*), was posed as the weighted sum of a Gaussian distribution centered at the neutral wild-type rate and a uniform distribution:
7
where 풩(μ, σ) is a normal distribution with mean, μ equal to 2.8 × 10^{−5} nt per day and σ equal to 0.28 × 10^{−5} nt per day, (9), and Δ*R* is the range of *R*, which spanned the interval 5 × 10^{−6} < *R* < 1 nt per day. We set γ equal to 10/7,440 to represent the assumption that approximately 10 nucleotide positions were expected to revert at rates significantly exceeding the wild-type neutral rate. This representation assumes that most sites are under weak or no selection (described by the Gaussian prior distribution) but still allows the possibility that some sites are under strong selection pressure (described by the wider uniform prior distribution). We identified sites under strong positive selection as those for which the posterior substitution rate distribution is significantly greater than the prior distribution rate in equation 7 (nonoverlapping 95% highest-density intervals). We found that when the mean posterior per base rate was 3 or more orders of magnitude above the expected wild-type neutral rate, the inference became insensitive to the details of the chosen prior. For rates higher than but closer to the expected neutral rate, the heavy weight of the prior on the Gaussian component near the wild-type neutral rate biases the posterior estimated rate to be low in comparison to that obtained with a uniform prior (γ = 1). This property of the prior encodes a Bayesian approach to multiple-testing correction: the evidence for strong selection at a site has to be sufficient to overcome the prior expectation that the site is neutral with probability (1 − γ) = 0.9987.

To improve estimates of the time between initiation of the dose and VDPV detection (*T*_{VDPV}), we propose the inclusion of information about the higher VP1 substitution rates identified from the Sabin-like isolates. Because there are no known phylogenetic lineages linking Sabin-like virus sequences to circulating outbreaks to provide definitive evidence for how the VP1 substitution rate evolves, we propose using a model where (i) VP1 sites under strong selection evolve independently at the per base rates sampled from the posterior distributions derived above and (ii) all other sites mutate randomly at the wild-type rate of 1% per year. To estimate the time to VDPV using this model, we generated 10^{6} realizations of the model and defined VDPV using a VP1 threshold, *N*_{VDPV}, of 10 substitutions for Sabin types 1 and 3 and 6 substitutions for Sabin type 2 (8). Specifically, we used the following algorithm:

Sample the rate of reversion,

*R*_{site}, for each attenuating VP1 position under strong selection from the corresponding posterior distribution. The total number of sites under strong selection is denoted*N*_{revert}.Sample the time to reversion,

*T*_{revert}, for each attenuating nucleotide from an exponential distribution with rate*R*_{site}as follows: p(T_{revert}∣ k = 1) = R_{site}e^{−Rsite}^{Trevert}.Sample the time

*T*_{background}to any given number of additional VP1 mutations from a gamma distribution with shape parameter α, which is equal to*N*_{VDPV}−*N*, and rate parameter β, which is equal to (0.01/365)[(length of VP1) −*N*_{revert})].If

*T*_{background}is greater than the maximum value of*T*_{revert}, then VDPV has already been generated and*T*_{VDPV}is equal to*T*_{background}.Else, a VDPV has not yet been generated. Sample the time to an additional background VP1 mutation from an exponential distribution with rate β and add it to

*T*_{background}. Repeat until the number of background mutations plus the number of reverted mutations with*T*_{revert}<*T*_{background}is ≥*N*_{VDPV}.If the sum is equal to

*N*_{VDPV}, then*T*_{VDPV}is equal to*T*_{background}.Else,

*T*_{VDPV}is equal to max(*T*_{revert},*T*_{background}).

Recombination rate estimation in samples with Sabin-like isolates.For the consensus recombinants, we estimated a Poisson rate for recombination averaged over all serotypes and recombination breakpoint positions using the methodology described above for the VP1 substitution rate, with *k* being equal to the number of recombination breakpoints. A uniform distribution was assumed for the prior recombination rate distribution. For comparison, we performed a similar analysis on the index case isolates of known cVDPV type 2 lineages in Nigeria (19).

Analysis of within-sample genetic diversity.To characterize whole-genome diversity, we calculated whole-genome entropy for each sample. We processed variant call format (VCF) files by assuming a maximum of two variants at each position (e.g., A and G at Sabin type 2 position 481). The variant frequencies at each position were *p* and 1 − *p*, where *p* is the frequency reported in the VCF file. We calculated Shannon entropy (*H*) as Σ[−*p* log *p* − (1 − *p*) log(1 − *p*)]. Per sample entropies were plotted against the mean ages of infection for the individuals providing samples, and curves were fit using least-squares regression in MATLAB (MathWorks, Natick, MA).

Accession numbers.Consensus genome sequences for all samples are available in GenBank under consecutive accession numbers KJ170425 through KJ170677 (VP1 sequences only for recombinants are available in GenBank under accession numbers KJ170425 to KJ170435; whole-genome sequences are available for all others). The raw sequencing reads and their associated metadata are available in the Short Read Archive under consecutive BioSample accession numbers SAMN04125839 to SAMN04126091.

## RESULTS

Sampled population.We found 241 samples that contained at least one high-quality whole-genome alignment for a single strain (Sabin type 1, *n =* 97; Sabin type 2, *n =* 43; Sabin type 3, *n =* 101). For 3 individuals, serial stool samples expressed Sabin type 1 on one day and Sabin type 3 on another, but none of the stool samples contained a mixture of types. The relative abundances of the frequencies of the three serotypes in the samples are roughly consistent with the relative amounts of trivalent and bivalent (type 1 and 3) OPV administered in Nigeria during the time period sampled (2009 to 2012) (29–31). We also found 10 samples that contained recombinant noncapsid consensus genome sequences (Fig. 3 and Table 1).

The cases studied represent a convenience sample obtained from surveillance for AFP. OPV causes vaccine-associated paralytic poliomyelitis (VAPP) at a rate of roughly 3.8 cases per million births (range, 2.9 to 4.7 cases per million births) (43). Nigeria experienced 7 million births per year during the 4-year period sampled (http://www.unicef.org/infobycountry/nigeria_statistics.html), and so approximately 106 VAPP cases (range, 81 to 132 VAPP cases) are expected in our data set. The classification of an AFP case as VAPP depends on clinical disease status (AFP, residual paralysis at 60-day follow-up) and a history of recent vaccination (onset of paralysis within 4 to 40 days after OPV exposure) (43, 44; http://www.paho.org/hq/index.php?option=com_content&id=1934). Since the timing of vaccination is largely unreported, a formal diagnosis of VAPP is not possible for the majority of cases. However, of the 191 cases that were not lost to 60-day follow-up, 89 exhibited residual paralysis and so had conditions clinically compatible with VAPP. The remaining 102 exhibited no residual weakening and so were not clinically compatible with poliomyelitis. They represented non-polio AFP cases that had been shedding Sabin poliovirus due to recent exposure at the time of paralysis. To test if the properties of the samples might depend on the clinical outcome, we compared the distributions of the number of reported doses by residual paralysis status for each serotype and found no significant differences (*P* > 0.05, pairwise Kolmogorov-Smirnov test). Similarly, there were no significant differences in age.

Estimated duration of infection: intervals between exposure and sampling dates.The posterior exposure date distributions for the three types are shown in Fig. 5A. They represent our data-driven inferences about the typical duration of infection in field samples from patients infected with Sabin-like isolates. The mean interval from the time of exposure to the time of sample collection was 7 days, and there were no significant differences among the three types.

To assess the plausibility of the exposure date estimation procedure, we examined the results for the 13 samples for which a self-reported date of the most recent vaccination was available. When the last reported dose was the source of the sampled infection, the result obtained by the Bayesian method should be consistent with the report. Disagreement may show error in the model or may occur because the child received the infection from a source other than the last reported dose. Figure 5B compares the prior and posterior distributions to the date of exposure, assuming that the last reported dose was the source. The results for 9 of the 13 samples were consistent at the 95% confidence level. The results for the remaining 4 differed in ways that suggest that the reported date of last exposure was not the source of the infection detected in the sample: for the mismatches, the mean reported interval between the time of sampling and the last reported exposure (33.5 days) was significantly longer than the mean reported interval for the matches (8.4 days), but the mean number of VP1 substitutions for the mismatches (*k* = 0.5) was less than the mean number of VP1 substitutions for the matches (*k* = 1.0).

VP1 substitution rates in Sabin-like isolates.The frequencies of total VP1 substitutions for each serotype are shown in Fig. 6A. The mean number of VP1 substitutions was 0.50 for Sabin type 1, 0.63 for Sabin type 2, and 0.94 for Sabin type 3. To test if the VP1 substitution count might be correlated with clinical outcome, we compared the distributions of VP1 mutations by residual paralysis status for each serotype and found no significant differences (*P* > 0.05, pairwise Kolmogorov-Smirnov test).

From the observed VP1 substitution counts (Fig. 6A to C) and our model of the interval between the time of exposure and the time of sampling for each case, we estimated the VP1 substitution rate for each Sabin type. We estimated that substitutions in VP1 accumulate in Sabin-like isolates at rates 2 to 8 times faster than the rate that characterizes the accumulation of substitutions in circulating strains with the wild phenotype (0.28 × 10^{−4} ± 0.03 × 10^{−4} nt per site per day) (9): • Sabin type 1, 0.83 × 10^{−4} nt per site per day (95% highest posterior density [HPD], 0.57 × 10^{−4} < *R* < 1. 15 × 10^{−4} nt per site per day) • Sabin type 2, 1.07 × 10^{−4} nt per site per day (95% HPD, 0.65 × 10^{−4} < *R* < 1.65 × 10^{−4} nt per site per day) • Sabin type 3, 0.85 × 10^{−4} nt per site per day (95% HPD, 0.45 × 10^{−4} < *R* < 1.38 × 10^{−4} nt per site per day) (Fig. 6D) For types 1 and 2, the rate estimates from the full model are closely approximated by dividing the mean number of substitutions per site by the mean estimated sampling interval (7 days). For type 3, the rate estimate is adjusted downward to account for the possibility that some of the substitutions at position 2493 within VP1 were present in the vaccine stock (42) and should not be attributed to within-host evolution (see Materials and Methods).

The rates of nonsynonymous substitutions were comparable to the rates of synonymous ones (Fig. 6E and F) and, thus, were 2 orders of magnitude higher than the wild-type rate of nonsynonymous substitutions of 8.3 × 10^{−7} nt per site per day (9). This is due to the strong selection pressure acting in the sampled population. We also observed that synonymous substitution rates were insignificantly different from the synonymous substitution rate for the wild-type molecular clock, although the mean estimate of the rate for type 2 was twice the rate for the wild type and the estimated mean rates for types 1 and 3 (Fig. 6E).

To check the plausibility of our model-based estimates of the VP1 substitution rates, we compared our VP1 substitution rate for Sabin type 1 to the result reported in a recent longitudinal study of shedding after monovalent OPV type 1 challenge in vaccinated infants in Egypt (14). Both studies produced similar estimates for the expected number of substitutions over time, despite substantial differences in the study designs (Fig. 7).

Per base substitution rates for nonrecombinant samples.To identify the sites under strong positive selection, we estimated the per base substitution rate at each site. Isolates with possible recombinant segments were excluded from this analysis. Results are shown in Fig. 8 and Tables 2 to 4. The rate estimates for each position derived from the full model can be approximated from the fraction of samples that contained a variant, *f*, which is equal to the (number of samples with substitutions)/(total number of samples), and the mean estimated interval between the time of exposure by inverting equation 7, as follows: mean estimated interval ≈ (1/7 days)[*f*/(1 − *f*)].

For 97% of sites in all serotypes, zero consensus substitutions were observed. We identified sites under strong positive selection to be sites for which the posterior estimate of the substitution rate was significantly greater than the prior estimate of the neutral rate (0.28 × 10^{−4} ± 0.03 × 10^{−4} nt per site per day) (9) (see Materials and Methods). For this data set, our definition of strong selection flags all nucleotide positions (sites) for which 3 or more samples contained a substitution at that site (a complete list of all sites with two or more fixed substitutions appears in Table 5). We identified high reversion rates at known attenuating sites in all three Sabin types.

For Sabin type 1 (Fig. 8A; Table 2), we observed high reversion rates for three of the six known neurovirulence-attenuating sites at positions 480, 2749, and 2795 (1, 45). Reversions at nt 2438 and 6203 are consistent with the occurrence of substitutions at the neutral rate, and reversion at nt 935 likely occurs too slowly to have been observed in this data set. Typical reversion rates at key sites for Sabin type 1 are an order of magnitude lower than those for types 2 and 3. Along with the high number of attenuating mutations in Sabin type 1 (1), the reduced reversion rates may be responsible for the reduced VAPP rate in individuals who received Sabin type 1 relative to that in individuals who received the other types (1).

For Sabin type 2 (Fig. 8B; Table 3), both attenuating sites at positions 481 and 2909 reverted quickly with respect to the wild-type baseline rate, as expected (1, 5, 10–12, 14, 16, 46, 47), with estimated mean times to reversion of 6 days and 21 days, respectively. We also observed a high substitution rate at the site at position 398, which has not been associated with neurovirulence but which is a possible second-site substitution associated with reversion at the site at position 481 (1, 48); the substitution is also represented in the consensus sequence of circulating strains derived from available poliovirus type 2 sequences in GenBank. The short time scales of reversion are such that the wild-type phenotype can be attained within the duration of a single infection (41).

For Sabin type 3 (Fig. 8C; Table 4), we observed high reversion rates at the attenuating site at position 472 (1, 49–51). The site at position 472 is believed to be the primary site of neurovirulence attenuation in Sabin type 3, and the reversion rate at that nucleotide is insignificantly different from that of the homologous nucleotide at position 481 in type 2. Because of the vaccine stock diversity issue previously described in Materials and Methods (42, 52), we were unable to provide a precise estimate of the mutation rate for position 2493; the estimate provided averages over all possible substitution counts, consistent with the data described in Materials and Methods. The variation that we observed at the attenuating site at position 2034 was not sufficient to infer a significantly increased reversion rate (94 of 101 isolates had the U residue of the Sabin reference strain; 2 of 101 isolates had the fixed C of the wild-type strain; 5 of 101 isolates had diverse U and C residues [<50% of the reads in the sample reverted]). During replication, the absence of strong selection at this neurovirulence-attenuating site and possible suppression by second-site mutations at positions 2636 and 2637 (49, 53) may be partially responsible for the reduced rate of emergence of cVDPV type 3 relative to that of cVDPV types 1 and 2.

For all types, additional sites other than the known attenuating sites experience significantly higher substitution rates. As indicated in Tables 2 to 4, some of these sites have previously been observed to undergo rapid reversion but have not been associated with neurovirulence or temperature sensitivity (45, 48, 54–57). These sites may be relevant to other aspects of the replicative fitness and wild-type phenotype not described by monkey, mouse, and tissue culture assays.

We also observed significant selection at antigenic sites consisting of position 1941 in Sabin type 1 (position 56 in VP3) and position 1445 in Sabin type 3 (position 165 in VP2) (26, 48, 54–56, 58). To test for an association with prior immunity, we stratified by reported dose history and found no significant difference in the proportion of antigenic variants between cases reporting zero doses and cases reporting at least one dose (*P* > 0.05, two-sample *z* test). While we observed a significant number of antigenic variants in Sabin-like isolates, the antigenic variants were not found among the wild-type strains in GenBank (Tables 2 and 4).

To test if reversion of the primary neurovirulence-attenuating sites in the 5′ noncoding region (NCR) is associated with the residual paralysis clinical outcome, we compared the proportions of samples displaying reverted nucleotides at known attenuating sites in the 5′ NCR (nt 480 for Sabin type 1, nt 481 for Sabin type 2, nt 472 for Sabin type 3) and found no significant differences (*P* > 0.05, two-sample *z* test).

Within-sample genetic diversity increases over time.The variant frequency data provided by Illumina sequencing provided the basis for characterizing within-sample genetic diversity and its evolution over time. Our diversity measure, the Shannon entropy, emphasizes diversification without fixation as opposed to substitution. Sites under strong selection contribute little to the entropy because significant coexistence of variants is unlikely when one variant has a much higher level of fitness. Changes in entropy over time describe the development of the quasispecies cloud, which may have its own contributions to fitness and phenotype that are not well described by fixation of substitutions (59–61).

We found that diversity increases with the estimated age of infection, and it increases faster for type 2 than for types 1 and 3 (Fig. 9). This pattern of diversification rates (type 2 > type 1 = type 3) was the same as that observed for synonymous VP1 substitutions (Fig. 6E) and is suggestive of differences in the underlying mutation rate. Increased replication fidelity has been associated with decreased replicative fitness (62, 63), and thus, this observation may also help explain the higher rate of cVDPV emergence due to type 2. Differences in diversification rates could also be due to differences in the viral population ecology within the gut, which is affected by immune status (41, 64–66). An alternative hypothesis that we cannot rule out is that our assumption that most of the samples are from primary vaccine recipients is more often false for type 2 due to higher levels of secondary spread of type 2 (6), and so we may be underestimating the range of times from the time of exposure to the time of sampling and, thus, overestimating the diversification rate. A more detailed population genetic analysis of the diversity data is in preparation.

Estimated date of VDPV emergence postvaccination.The accelerated early VP1 substitution rates have implications for dating the origin of cVDPV outbreaks. Sequences are characterized as cVDPV on the basis of a VP1 substitution count threshold. Before a case is considered evidence of circulation, Sabin types 1 and 3 must accumulate at least 10 VP1 substitutions and Sabin type 2 must accumulate at least 6 VP1 substitutions (8). The assumption has been that these substitutions accumulate at the wild-type neutral rate of approximately 9 VP1 nt per year, and so it has been assumed that cVDPV lineages must be circulating for about a year (range, 240 days [type 2] to 410 days [types 1 and 3]) (1, 19, 23).

We propose to update the model of the time from the initiation of dosing to VDPV detection to account for strong selection on attenuating sites in VP1. As described in Materials and Methods, we used a model that accounts for the fast mutation of selected sites and the neutral evolution of the rest of the VP1 gene to estimate the distribution of the times from the time of initiation of the dose to the time of attainment of either 6 or 10 substitutions for Sabin type 2 or Sabin types 1 and 3, respectively: for Sabin type 1, 300 days (100 < *T* < 600 days); for Sabin type 2, 210 days (50 < *T* < 460 days); for Sabin type 3, 390 days (140 < *T* < 710 days) (Fig. 10). The estimate of the time to VDPV for type 2 shows that there is a finite probability that 6 VP1 substitutions can occur within the duration of a single infection in an otherwise healthy, immunologically naive individual (41).

Evidence of circulation among samples infected with Sabin-like isolates.To identify possible circulating lineages, we looked for sets of sequences that were closer to each other in genetic distance than they were to their Sabin parent strain with respect to the whole-genome sequence and VP1 only. We identified three pairs of sequences (two pairs for type 3, one pair for type 1) that showed evidence for linkage by descent. Summary statistics based on genetic distances are shown in Table 6, and the shared nucleotide substitutions are listed in Table 7. The evidence is stronger for type 3: the genetic divergence from the Sabin strain was greater for both pairs, and the geographic proximity was closer. All occurred in geographic locations and during periods with a low wild poliovirus incidence (29–31). No known established circulating vaccine-derived polio outbreaks caused by types 1 and 3 have occurred in Nigeria, which indicates that these lineages died out without targeted intervention. For type 2, there was no evidence of circulation of the Sabin-like sequences. Furthermore, there was no evidence that any of the type 2 Sabin-like sequences are ancestral to known cVDPV emergences (not shown).

Recombination rate in Sabin-like samples.Of the 10 consensus recombinant samples (capsid; Sabin type 2, *n =* 2; Sabin type 3, *n =* 8) identified, 2 were phylogenetically related and so we counted the pair as only one sample in the analyses described below. Breakpoints between Sabin strains were concentrated in the 2C and 3C regions (Fig. 3 and Table 1), consistent with previous observations (19, 21, 22, 67). One sample with a Sabin type 3 capsid is a recombinant near nt 4600 in region 2C with a non-Sabin virus; alignment of the *de novo* assembly with available non-Sabin enterovirus sequences from GenBank most closely matched sequences of coxsackieviruses in species enterovirus C (GenBank accession numbers AF499637 [coxsackie A virus type 13], AF499639 [coxsackie A virus type 17], and AF499640 [coxsackie A virus type 13] [68]). The consensus recombinants were systematically older than the viruses in the other samples: the mean estimated date since exposure was 15 days, which is significantly different from the 7-day mean for all samples. The estimated Poisson rate (*R*) to fix recombinant genomes averaged over all types was (184 days)^{−1} [range, (403 days)^{−1} < *R* < (104 days)^{−1}].

To put our Sabin-like recombination fixation results in context, we compared them with those for the index case isolates of known cVDPV emergences in Nigeria (19). We observed that the proportion of Nigerian cVDPV index case isolates that are recombinant in the 3D*pol* region with non-Sabin enteroviruses, 16 of 20, was significantly higher than that for Sabin-like fixed 3D*pol* recombinants: 1 of 9 for all types aggregated (*P* = 0.00005, Fisher's exact test) and 0 of 2 for Sabin type 2 only (*P* = 0.033). However, cVDPV index case isolates are systematically more diverged from Sabin strains than the isolates in this study, and thus, more time prior to cVDPV detection is available for recombination to occur. To account for the difference in the duration of infection/circulation time, we also estimated the recombination rate prior to cVDPV emergence. Twenty of the 23 outbreaks arose from viruses with recombinant genomes containing recombinant 3D*pol* (19). Based on the estimated time intervals between the time of initiation of the dose and the time of detection of the first isolate derived from VP1 substitution counts, we estimated that the recombination fixation rate (*R*) preceding cVDPV emergence, averaged over breakpoint position, is (300 days)^{−1} [range, (510 days)^{−1} < *R* < (210 days)^{−1}]. This is not significantly different from the recombination fixation rate that we inferred for the Sabin-like isolates.

## DISCUSSION

As we approach the eradication of wild poliovirus, cVDPV is of great concern, and many unknowns about the dynamics of reversion persist (1, 23). This study provides the first estimates of the whole-genome substitution rate of Sabin-like isolates from immunologically experienced individuals in a region of endemicity. It also confirms key results from the analysis of van der Sanden et al. of vaccine reversion in newborns (14): that VP1 substitutions in Sabin-like poliovirus shed by primary vaccine recipients occur at rates faster than those observed in circulating wild-type isolates.

The Poisson rate of substitution of the VP1 segment of the genome is approximately 3 to 4 times faster in Sabin-like isolates than in circulating wild strains of poliovirus. Key attenuating sites in the capsid and 5′ noncoding region begin reverting within days of vaccine administration, consistent with previous observations (10–15). Thus, it is possible for the substitutions needed to regain the wild phenotype to revert within a primary vaccine recipient, especially for type 2. This observation is consistent with studies of OPV spread that indicate that Sabin type 2 is more highly transmissible than Sabin type 1 and Sabin type 3 (6, 7). The observation that cVDPV outbreaks are rare, occurring only a few times per year, given the hundreds of millions of OPV doses administered annually, indicates the difficulty of establishing endemic circulation. However, as the world moves to a post-OPV future, decreasing intestinal immunity may make cVDPV emergence easier if OPV is used in response to outbreaks. The observations in this study will be helpful for more accurate modeling of OPV transmission to study outbreak dynamics and anticipate future risks.

The increased rate of VP1 mutations in Sabin-like isolates relative to that in the wild type has implications for inferring the timing between VDPV detection and the initiation of dosing. After accounting for fast reversion at key sites, we infer that type 2 likely requires 210 days of sustained transmission, on average, to reach VDPV emergence (versus 240 days if the wild-type rate is assumed), type 1 requires 300 days, and type 3 requires 390 days (versus 410 days if the wild-type rate for types 1 and 3 is assumed). The length of time required to reach the VDPV detection thresholds, defined empirically to correlate with cVDPV emergence, provides a measure of the number of generations of infection that are required to establish circulation, with type 3 requiring the longest chains of transmission. This is consistent with the rates of cVDPV emergence: type 2 is more common than type 1, which is more common than type 3 (23).

We found no significant associations between the clinical outcome of residual paralysis 60 days after AFP onset (a proxy for VAPP diagnosis) and case age or vaccination history. The lack of association is consistent with recently reviewed data from India which indicate that VAPP in low-income countries commonly occurs even after three or more doses of OPV (43). We also found no correlation between residual paralysis and the number of VP1 substitutions or the presence of reverted attenuating sites in the 5′ NCR. This lack of association of clinical outcome and genotype has been previously reported (49) and is not unexpected. The polioviruses found in the central nervous systems of paralytic cases are often not identical to the polioviruses excreted in stool at the time of paralysis due to possible coinfection with multiple serotypes (69) and due to within-serotype intrahost diversification to target different tissues (59, 70).

We identified circulating type 3 (and possibly type 1) vaccine strains during times or in geographic regions of low wild poliovirus transmission. As there have been no cVDPV outbreaks of those types, we can conclude that these strains that circulated briefly during a period with high rates of use of monovalent type 1 and bivalent type 1 and 3 OPV in vaccination campaigns (29–31) died out without further targeted intervention. For type 2, we did not identify any circulating Sabin-like isolates or any sequences ancestral to known cVDPV isolates. These results support the adequacy of the current operational cVDPV detection criteria for identifying problematic circulating VDPV lineages in regions with sustained vaccination coverage. They also demonstrate that it is possible to detect circulation prior to reaching the operational VDPV detection threshold and that sequencing of all Sabin-like isolates in the period after OPV cessation may be helpful for the early detection of their circulation.

We observed that cVDPV emergence is significantly more likely to be associated with non-Sabin isolate recombination prior to the index case than non-Sabin isolate recombination in Sabin-like isolates but that the recombination rates in index case isolates and Sabin-like isolates are similar. Unfortunately, these data do not provide evidence for or against the hypothesis that recombination with wild enteroviruses is advantageous relative to Sabin isolate-Sabin isolate recombination. Conclusive resolution of the question of selective advantage is unlikely to be possible without additional data about rates of coinfection with poliovirus and nonpoliovirus enteroviruses.

While our study reduces many of the uncertainties around the timing of reversion events, only limited information on the correlations between genotype and human infectivity is available (23). There are large gaps in our understanding of transmissibility in human populations that fundamentally limit our ability to predict the risk of emergence of cVDPV. Contact-tracing studies with longitudinal sampling are necessary to assess transmissibility and its dependence on genotype. Coupled with the substitution rates reported here, quantitative measures of transmissibility would allow us to model in detail the process of cVDPV emergence to better assess the risks and benefits of the vaccination strategies needed to eradicate wild poliovirus, respond to cVDPV, and transition away from OPV.

## ACKNOWLEDGMENTS

We thank Annelet Vincent (CDC) for RNA extractions, Peter Sabo and Eric Haugen (University of Washington; htseq.org) for sequencing, Shanda Boyle (the Bill and Melinda Gates Foundation) for collaboration in initial study design and administration, Alex Upfill-Brown and Hil Lyons (Institute for Disease Modeling) for helpful input into the construction of the exposure dating model, Adi Stern (Tel Aviv University) for interesting discussions about the dynamics of selection, and Jing Shaw for the NCBI submission.

## FOOTNOTES

- Received 15 June 2015.
- Accepted 7 October 2015.
- Accepted manuscript posted online 14 October 2015.
- Address correspondence to Michael Famulare, mfamulare{at}intven.com.
**Citation**Famulare M, Chang S, Iber J, Zhao K, Adeniji JA, Bukbuk D, Baba M, Behrend M, Burns CC, Oberste MS. 2016. Sabin vaccine reversion in the field: a comprehensive analysis of Sabin-like poliovirus isolates in Nigeria. J Virol 90:317–331. doi:10.1128/JVI.01532-15.

## REFERENCES

- Copyright © 2015, American Society for Microbiology. All Rights Reserved.