Discovery of Novel Crustacean and Cephalopod Flaviviruses: Insights into the Evolution and Circulation of Flaviviruses between Marine Invertebrate and Vertebrate Hosts

Some flaviviruses are known to cause disease in vertebrates and are typically transmitted by blood-feeding arthropods such as ticks and mosquitoes. While an ever-increasing number of insect-specific flaviviruses have been described, we have a narrow understanding of flavivirus incidence and evolution. To expand this understanding, we discovered a number of novel flaviviruses that infect a range of crustaceans and cephalopod hosts. Phylogenetic analyses of these novel marine flaviviruses suggest that crustacean flaviviruses share a close ancestor to all terrestrial vector-borne flaviviruses, and squid flaviviruses are the most divergent of all known flaviviruses to date. Additionally, our results indicate horizontal transmission of a marine flavivirus between crabs and sharks. Taken together, these data suggest that flaviviruses move horizontally between invertebrates and vertebrates in ocean ecosystems. This study demonstrates that flavivirus invertebrate-vertebrate host associations have arisen in flaviviruses at least twice and may potentially provide insights into the emergence or origin of terrestrial vector-borne flaviviruses. ABSTRACT Most described flaviviruses (family Flaviviridae) are disease-causing pathogens of vertebrates maintained in zoonotic cycles between mosquitoes or ticks and vertebrate hosts. Poor sampling of flaviviruses outside vector-borne flaviviruses such as Zika virus and dengue virus has presented a narrow understanding of flavivirus diversity and evolution. In this study, we discovered three crustacean flaviviruses (Gammarus chevreuxi flavivirus, Gammarus pulex flavivirus, and Crangon crangon flavivirus) and two cephalopod flaviviruses (Southern Pygmy squid flavivirus and Firefly squid flavivirus). Bayesian and maximum likelihood phylogenetic methods demonstrate that crustacean flaviviruses form a well-supported clade and share a more closely related ancestor with terrestrial vector-borne flaviviruses than with classical insect-specific flaviviruses. In addition, we identify variants of Wenzhou shark flavivirus in multiple gazami crab (Portunus trituberculatus) populations, with active replication supported by evidence of an active RNA interference response. This suggests that Wenzhou shark flavivirus moves horizontally between sharks and gazami crabs in ocean ecosystems. Analyses of the mono- and dinucleotide composition of marine flaviviruses compared to that of flaviviruses with known host status suggest that some marine flaviviruses share a nucleotide bias similar to that of vector-borne flaviviruses. Furthermore, we identify crustacean flavivirus endogenous viral elements that are closely related to elements of terrestrial vector-borne flaviviruses. Taken together, these data provide evidence of flaviviruses circulating between marine vertebrates and invertebrates, expand our understanding of flavivirus host range, and offer potential insights into the evolution and emergence of terrestrial vector-borne flaviviruses. IMPORTANCE Some flaviviruses are known to cause disease in vertebrates and are typically transmitted by blood-feeding arthropods such as ticks and mosquitoes. While an ever-increasing number of insect-specific flaviviruses have been described, we have a narrow understanding of flavivirus incidence and evolution. To expand this understanding, we discovered a number of novel flaviviruses that infect a range of crustaceans and cephalopod hosts. Phylogenetic analyses of these novel marine flaviviruses suggest that crustacean flaviviruses share a close ancestor to all terrestrial vector-borne flaviviruses, and squid flaviviruses are the most divergent of all known flaviviruses to date. Additionally, our results indicate horizontal transmission of a marine flavivirus between crabs and sharks. Taken together, these data suggest that flaviviruses move horizontally between invertebrates and vertebrates in ocean ecosystems. This study demonstrates that flavivirus invertebrate-vertebrate host associations have arisen in flaviviruses at least twice and may potentially provide insights into the emergence or origin of terrestrial vector-borne flaviviruses.

As mosquitoes and ticks are potential reservoirs of vertebrate-infecting flaviviruses (VIFs), biosecurity and surveillance programs have been established to characterize the ecology and diversity of potential VIFs in these hosts (8,9). As a result of these efforts, a number of insect-specific flaviviruses (ISFs), so named for an inability to replicate within vertebrate cells (reviewed by Blitvich and Firth [10]), have been fully sequenced and characterized. As the number of ISFs identified in the literature expands, there is evidence that ISFs exist both in a classical ISF (cISF) clade, in which all members form a monophyletic lineage, and a dual-host-associated ISF (dISF) clade. These dISFs are similar in phenotype to cISFs and show an inability to replicate in mammalian cells, but unlike cISF, dISFs are paraphyletic and phylogenetically related to arboviruses (10). This suggests that vertebrate-infecting and insect-specific flaviviruses may have arisen and been lost multiple times throughout the evolutionary history of flaviviruses. In addition, the International Committee on Taxonomy of Viruses also recognizes a number of flaviviruses isolated from vertebrates, such as rodents or birds, that are classified as being flaviviruses with no known vector (NKV) (reviewed by Blitvich and Firth [11]). These viruses are genetically related to vector-borne flaviviruses, with some experimentally demonstrated to replicate in both mosquito and vertebrate cells (12,13). These are presumed to be vectored by as yet unidentified invertebrate hosts. Not all NKV viruses replicate in mosquito cells; for example, Rio Bravo virus (RBV), Montana myotis leukoencephalitis virus (MMLV), Apoi virus (APOV), and Modoc virus (MODV) do not (14,15). Transmission of these NKV flaviviruses is assumed to be horizontally, through salivary glands or aerosol droplets between mammals. The aforementioned NKV, dISFs, and VIFs are all grouped within one well-supported monophyletic lineage, however; one notable exception to this is Tamana bat virus (TABV), isolated from the Parnell's mustached bat, Pteronotus parnellii (16). Tamana bat virus is not phylogenetically related to other NKV flaviviruses, has never been isolated from other vertebrate or invertebrate species, and groups basally to all flaviviruses. It is sometimes referred to as a vertebrate-only flavivirus.
Metatranscriptomic analyses of invertebrate and vertebrate pools, not typically screened in biosurveillance programs, have resulted in uncovering a remarkable diversity of viruses infecting eukaryotic species (17,18). As a result of these efforts, flaviviruses have been identified outside terrestrial vertebrates and invertebrates for the first time, in two groups of fish. The first of these marine vertebrate flaviviruses, Cyclopterus lumpus virus (CLuV), was identified in tissues of diseased lumpfish (Cyclopterus lumpus) (19). The second marine vertebrate flavivirus is Wenzhou shark flavivirus, identified in a metagenomic analysis of the Pacific spadenose shark, Scoliodon macrorhynchos (17). Spadenose sharks are members of the cartilaginous fish group, and while it is currently unknown if Wenzhou shark flavivirus is responsible for any pathology within the shark host, the flavivirus was reported to be abundant in all tissues (17). Phylogenetic analysis based on the polyprotein of these marine vertebrate flaviviruses suggests that they form a basal and genetically divergent group of flaviviruses along with TABV (20). Additionally, flavivirus fragments have been identified in a metatranscriptomic analysis of the Eastern red scorpionfish, Scorpaena jacksoniensis (20), and the sea spider (Endeis spinosa) (21).
Attempts to evaluate the temporal evolution of flaviviruses using time to most recent common ancestor (tMRCA) analyses have remained controversial (22). In addi-tion, poor sampling of flaviviruses outside vector-borne flaviviruses has hindered our potential for exploring the origins or emergence of terrestrial vector-borne flaviviruses. Flaviviruses that have recently been discovered infecting two large groups of fish present an avenue to explore marine flavivirus evolution and diversity.
Previously, it has been suggested that alphaviruses, single-stranded RNA arboviruses belonging to the family Togaviridae, are likely to have a marine origin (23). Most alphaviruses have a similar vector-borne host range, with mosquitoes as the principle vectors. The hypothesis of emergence of alphaviruses from a marine origin is supported both by phylogenetic evidence of the basal origin of alphaviruses infecting marine mammals, and also by the fact that Southern elephant seal virus (SESV) and salmon pancreatic disease virus (SPDV) have also been isolated from blood-and mucus-eating Lepidohthirus ectoparasites (24). The evidence of flaviviruses infecting sharks suggests that flaviviruses may have circulated long before the emergence of hematophagous mosquitoes, which are part of the infraorder Culicomorpha in the lower Diptera, which emerged ϳ220 million years ago (25). In comparison, vertebrate and arthropod lineages diverged between 573 and 656 million years ago (26). Previous approaches in mining sequencing data have yielded insights into the evolutionary history and host jumps of rhabdoviruses (27) and parvoviruses (28). We hypothesized that there may be potential dual invertebrate-vertebrate flaviviruses missing from the virus records. Therefore, we set out to explore and supplement the flavivirus record with additional metazoans, with a view to gaining insight into the evolution and emergence of flaviviruses.

RESULTS
Discovery and annotation of novel marine invertebrate flaviviruses. To discover divergent flaviviruses, we queried the assembled transcriptomes deposited from all Animalia in the Transcriptome Shotgun Assembly Sequence Database (TSA) hosted by the National Center for Biotechnology Information (NCBI) using the tblastn algorithm flavivirus polyprotein sequences (29). A number of 10-to 12-kb transcripts were identified as having 29% to 27% identity over 28% to 59% of the query length of flavivirus polyproteins, suggesting potential divergent hits. We excluded fragments that were unlikely to form complete single-strand flaviviruses and multisegmented members of the unranked Jingmenvirus taxon (30). Additional viruses were identified in unassembled cephalopod and crustacean RNA sequencing (RNA-Seq) data and subsequently de novo assembled. Genome features and statistics are summarized in Table 1, and metadata of the samples used are available in Table S1.
Six flaviviruses were identified in this study; five novel and one strain of a previously published virus. One 10,669-nucleotide (nt) contig from the transcriptome of the gazami crab or Japanese blue crab, Portunus trituberculatus, was identified with 95% nucleotide sequence identity to Wenzhou shark virus. This transcriptome was produced from eyestalk, gill, heart, hepatopancreas, and muscle tissue from 90 healthy swimming crabs from Shandong, China (31). Reanalysis of raw data of this RNA-Seq library suggested that the flavivirus had an average coverage of 33.50ϫ. Further analysis of the incidence of Wenzhou shark flaviviruses in other P. trituberculatus data sets is covered in later sections of the results.
Three ϳ10to 11-kb putative flaviviruses were identified from wild-caught malacostracan crustaceans. A flavivirus was identified in the brown shrimp, Crangon crangon, from midgut samples originating from Weser estuary, Germany (32). Prediction of the complete polyprotein sequence of Crangon crangon flavivirus (CcFV) suggested that the closest known flavivirus in amino acid identity is the arbovirus Usutu virus (33). Two flaviviruses were identified in amphipod species. Gammarus chevreuxi flavivirus (GcFV) was identified in transcriptomes from two publications on Gammarus chevreuxi in both embryonic and adult samples originating from the Plym estuary, Plymouth, United Kingdom (34,35) and in Gammarus pulex flavivirus (GpFV) identified from a male Gammarus pulex wild-caught from the Bourbre River, France (36). The polyproteins of these viruses were more closely related in amino acid identity to those of the  (37) and also the NKV Apoi flavivirus, the latter of which has been isolated from three different bat species (38). Flaviviruses were also identified from transcriptomes of two different squid species (class Cephalopoda). Fragments of flavivirus-like contigs from the TSA record of the firefly squid, Watasenia scintillans (39), were identified. These contigs originated from the RNA-Seq data of a single arm tip sample, which was downloaded and reassembled into two larger 7.4-and 3 -kb fragments. The final novel flavivirus identified was a 12,567-nt contig from the southern pygmy squid, Idiosepius notoides (40). Both polyproteins from these squid flaviviruses showed higher amino acid identity to those of Tamana bat virus and Cyclopterus lumpus virus than to those of the crustacean flaviviruses reported here. We have tentatively named these Firefly squid flavivirus (FsFV) and Southern pygmy squid flavivirus (SpsFV). For clarity within the document, we will refer to the collective grouping of the cephalopod and crustacean flaviviruses as marine invertebrate flaviviruses (MIFs).
With the exception of FsFV, all MIFs appeared to be coding complete. Prediction of transmembrane domains and conserved flavivirus protein domains in the polyprotein of all MIFs suggest remarkable conservation of genome orientation between these putative flaviviruses and the flavivirus type species (Fig. 1). All MIFs were identified as encoding the conserved flavivirus RNA-dependent RNA polymerase NS5 (pfam00972; E value, Յ3EϪ43) and also the flavivirus DEAD domain of the NS3 helicase (pfam07652; E value, Յ6EϪ20). All but one were identified as having flavivirus glycoprotein, central and dimerization domains (pfam00869; E value, Յ7EϪ10), or flavivirus envelope gly- coprotein E, stem/anchor domain (TIGR04240; E value: Յ2EϪ10) ( Fig. 1). Crustacean flaviviruses were all predicted to have a NS1 protein domain (pfam00948; E value, Յ7EϪ39), FtsJ-like methyltransferases (pfam01728; E value, 6EϪ15), and the S7 peptidase domain of the NS3-protease protein (pfam00949; E value, 4EϪ10). Remarkably, CcFV was predicted to not only have all of these features, but also similarity to the flavivirus nonstructural protein NS4A domain (PF01350; E value, 0.15) and the immunoglobulin-like domain III of flavivirus envelope glycoprotein (cd12149; E value: 3EϪ08). In addition to the conserved protein domain architecture, flaviviruses are known to have conserved N-linked glycosylation sites on the envelope (41,42) and the NS1 protein (43). The predicted N-linked glycosylation sites were similar among the assembled genomes (Fig. 1).
We propose that the viruses presented here bear all the prototypical hallmarks of conventional flaviviruses and, based on the current species demarcation guides for the genus Flavivirus (44), are sufficiently divergent from other known flaviviruses to be considered novel species of the genus.
Translation of the CcFV, GcFV, and GpFV polyprotein is predicted to depend on programmed ؊1 ribosomal frameshift. Initial prediction of the open reading frames of CcFV, GcFV, and GpFV suggests that the genomes of these flaviviruses have two discrete open reading frames ( Fig. 2A) with a stop codon near the C terminus of the NS1 region. If the flaviviruses presented here encoded proteins on two discrete open reading frames, a number of transmembrane domains would not be produced between the predicted NS1 and the NS2 protein region. A similar genome structure was observed in Cyclopterus lumpus virus (CLuV), in which the production of a complete polyprotein is predicted to depend on a programmed Ϫ1 ribosomal frameshift (Ϫ1 PRF) on a "slippery" heptanucleotide sequence proximal to the stop codon (19). Programmed Ϫ1 ribosomal frameshifting has been documented to be used by many flaviviruses for the synthesis of additional proteins (45)(46)(47); however, only in CLuV has a Ϫ1 PRF been predicted for the production of a complete polyprotein.
We screened the genomes of CcFV, GcFV, and GpFV for evidence of Ϫ1 PRF. Three criteria have been suggested that implicate the presence of a potential Ϫ1 PRF (48). First, there is the canonical "slippery" heptanucleotide sequence X_XXY_YYZ, where XXX and YYY represent three identical nucleotides, with Y being A/U, and Z being A/C/U, although there are known exceptions within this sequence (48). This slippery heptanucleotide is followed by a 5-to 9-nt spacer region and then by a downstream stem-loop or pseudoknot structure, which serves as a stimulatory RNA signal for slippage of the ribosome. We were able to identify all these features in these flaviviruses ( Fig. 2B and C) suggesting that each produces a bona fide Ϫ1 PRF. Subsequent slippage of the ribosome at all of these sites would result in the production of a complete and single polyprotein.
Comparisons of the Ϫ1 PRF of CluV and the three novel flaviviruses presented here suggests that Ϫ1 PRF may be a common strategy of marine flaviviruses to produce a single polyprotein, and that the fairly interesting flavivirus open reading frame (fifo) and the other ribosomal slippage proteins produced by VIFs and cISFs may have arisen from a vestige of this protein expression strategy (45,46).
Putative polyprotein cleavage sites. The polyprotein of flaviviruses becomes embedded in multiple positions in the membranes of the endoplasmic reticulum, and for efficient flavivirus budding and maturation, it is proteolytically processed by the catalytically active virus encoded NS2B-NS3 proteinase (5) and host proteases (49). The premembrane (pr) and membrane (M) proteins in the polyprotein localized in the trans-Golgi network are cleaved by the host convertase furin, which cleaves at the highly conserved motif RXR/KR (3). The embedded polyprotein exposed on the luminal side of the membrane is processed by the enzyme signalase (50). Protein cleavage motifs are highly conserved among some flaviviruses, suggesting that co-and posttranslational processing of the polyprotein is similar even in diverse species. We predicted potential host signalase, host furin, and putative virus NS2B-NS3-protease sites for processing of the polyprotein (Table 2). Additionally, we identified weak C-terminal NS1 octapeptide sequences (L/M-V-X-S-X-V-X-A) in four of these flaviviruses. This octapeptide sequence is conserved in flaviviruses and has been demonstrated as important for the cleavage between NS1/NS2A by an unknown host protease (49).
For potential virus NS2B-NS3-protease cleavage sites, we conducted a MUltiple Sequence Comparison by Log-Expectation (MUSCLE) alignment and used wellelucidated YFV and DENV processing sites, proximity to transmembrane domains, and protein domain homology to guide our search. Prediction of many NS2B-NS3-protease cleavage sites for the MIFs described here was challenging, as it appears that residues that are well established for flaviviruses are weakly conserved in basal group members, a finding that has previously been described for the divergent Tamana bat virus (16).
The NS3-Pro protein of many vector-borne flaviviruses and cISF species has been experimentally demonstrated to cleave after two basic amino acid residues (RR/RK/KR or, rarely, QR) before a small amino acid (G/A/S) (51). Some of the predicted sites are identical to canonical NS3-Pro motifs (Table 2), whereas many of the predicted NS3-Pro sites, especially in cephalopod flaviviruses, are at sites never previously been indicated for flaviviruses.
Analysis of the putative protease domains of the novel flaviviruses presented here (Fig. 3) suggests that they all share the trypsin-like serine protease catalytic triad (His, Asp, and Ser) known to all flaviviruses (52). However, boxes 3 and 4 of this domain, which contain residues that are involved in substrate binding and recognition, show reasonable flexibility. This may partly explain the imperfect prediction of the cleavage sites and suggests that the sensitivity and specificity for the canonical two basic amino acid residues may be inaccurate. Phylogenetic placement of the novel marine invertebrate flaviviruses. While there is an ever-growing collection of ISFs, dISFs, and VIFs identified in the literature, most of these viruses are placed within well-supported lineages. Previous genus-wide phylogenetic analyses suggested three large well-supported clades of cISF, VIF, and a basal clade with the divergent flaviviruses Tamana bat flavivirus and Wenzhou shark flavivirus (17). We explored genus-wide phylogenetic relationships between the novel flaviviruses discovered here and representatives of all flaviviruses. We aligned 78 flaviviruses representing a comprehensive set of published species and employed two different phylogenetic inferences, the first being a maximum likelihood (ML) phylogeny using IQ-TREE, and the second being a maximum clade credibility (MCC) tree using the Bayesian Markov chain Monte Carlo (MCMC) method implemented in MrBayes (53). The resultant trees were visualized using FigTree and midpoint rooted for clarity only. Examination of topologies of phylogenetic trees produced the same topology and strong support, indicated by posterior probabilities in the Bayesian tree and high bootstrap values for the ML tree. Importantly, the topologies of these trees are congruent with those of previously produced genus-wide phylogenies (11,13,17,54). For simplicity, we have depicted the Bayesian tree (Fig. 4). Analysis of the posterior probability of the MCC tree suggests that the cephalopod flaviviruses are the most divergent of all known flaviviruses and cluster basally to a clade that encompasses Cyclopterus lumpus virus, Tamana bat flavivirus, and both strains of the Wenzhou shark flavivirus. This implies that cephalopod and marine vertebrate flaviviruses share a more closely related ancestor than any other flaviviruses. The three novel crustacean flaviviruses GcFV, GpFV, and CcFV form a well-supported clade that falls between cISFs and VIFs. This suggests that crustacean flaviviruses and VIFs share a more closely related ancestor with each other than cISFs do with VIFs.
Wenzhou shark flavivirus is abundant in swimming crabs (Portunus trituberculatus), and active replication is supported by a functional RNA interference response. The gazami crab, also known as the swimming crab, P. trituberculatus, is a commercially important crustacean widely distributed in Indian and West Pacific oceans, and it is ubiquitous in southeast/east Asian countries and the north and eastern coastal waters of Australia (55). It is the most widely fished crab in the world, with annual catches exceeding ϳ600,000 metric tons (56). Due to its ubiquity in shallow waters and its commercial fishing value, P. trituberculatus is a widely studied crustacean model. As such, there exists a wealth of RNA-Seq data from P. trituberculatus from a number of different catch locations. Hence, we explored the incidence of Wenzhou shark flavivirus within these crab populations, in addition to P. trituberculatus RNA-Seq data deposited in the short-read archive. RNA-Seq libraries were downloaded and mapped to Wenzhou shark flavivirus, revealing evidence that five additional P. trituberculatus sequencing projects harbored Wenzhou shark flavivirus (Table 3). These P. trituberculatus sequencing projects derive from a range of eastern coastal China geographic origins and share some overlap with the location of the Pacific spadenose shark, host of Wenzhou shark flavivirus, which is listed as Heilongjiang, Eastern China Sea.
One of these libraries was for small RNAs (sRNAs) originating from the testis tissue of P. trituberculatus wild-caught from Weifang, China. During virus infection in diverse arthropods, the riboendonuclease III enzyme Dicer-2 processes double-stranded RNA (dsRNA) produced by viruses during replication into viral-derived short interfering RNAs (vsiRNAs) between 20 and 25 nt in length (57,58). To exclude the likelihood of contamination and also to consider the strandedness and composition of these putative vsiRNA reads, we mapped the small RNAs to the representative genome. Among reads with a 5= adapter and a length of 18 to 30 nt, 11,726 out of 3,302,979 reads (0.3%) mapped to both the sense and antisense strands of the Wenzhou shark flavivirus contig along its entire length (Fig. 5A) (59). Visualizing the size distribution of these mapped reads indicated that 21-to 22-nt reads represented 31% of the total mapped reads, with 22-nt reads as the most abundant size of vsiRNAs (16.8%) (Fig. 5B). The presence of vsiRNA reads originating from both the sense and antisense strands of Wenzhou shark flavivirus in the gazami crab, and also the 21-to 22-nt length bias in the mapping of this library, indicate not only that the virus is present in these crab populations but also that an active RNA interference (RNAi) response is produced against the virus.
Flavivirus endogenous viral elements in crustacean genomes are more closely related to vector-borne flaviviruses than to cISF. Flavivirus endogenous viral elements (EVEs) are fragments of flaviviruses that are known to be integrated within the genomic DNA of a previously infected host (60). Flaviviruses do not encode reverse transcriptase or integrase domains, but flavivirus EVEs are reported in the genomes of two well-known flavivirus vectors, Aedes aegypti and Aedes albopictus (60,61), as well as those of Anopheles mosquitoes (62). We sought to identify previous flavivirus infections in crustacean or cephalopod genomes by identifying these EVEs. In a screen of genomes deposited in the NCBI whole-genome contig database, we identified numerous flavivirus EVEs within crustacean genomes but were unable to identify any in deposited cephalopod genomes. Two representative flavivirus EVEs from the tadpole shrimp, Lepidurus arcticus (63), and the planktonic crustacean Daphnia magna are presented (Fig. 6). Both genomic regions have fragmented NS5 protein homology (pfam00972; E value, Ͻ5EϪ48) and association with retrotransposable (RT) elements, such as RT peptidases or RT domains. The D. magna flavivirus EVE also has a FstJ methyltransferase domain (pfam01728; E value, Ͻ2EϪ11). BLASTX and BLASTN analyses of these regions suggested that they share the highest nucleotide and protein identity to vector-borne and vertebrate-infecting flaviviruses. Attempts to phylogenetically place these EVEs proved difficult, as phylogenies produced using the flaviviral EVEs discovered here indicated low bootstrap support. The existence of highly fragmented

Parry and Asgari Journal of Virology
Analysis of the mononucleotide and dinucleotide composition of novel marine flaviviruses to predict host range. Classic CpG underrepresentation has been demonstrated in all flaviviruses known to infect vertebrates. In comparison, there is no clear selection against or for CpG motifs in cISFs, and they are weakly selected against in dISFs (64). We initially explored the odds ratios of CpG and also UpA, which are typically underrepresented in arthropod mRNAs of marine flaviviruses, and compared these in a genus-wide flavivirus context (Fig. 7). Odds ratios are calculated as the observed ratio of this dinucleotide motif over the expected ratios of individual mononucleotides. When there is no selection against a motif, the odds ratio should approach 1, whereas an odds ratio of a dinucleotide motif that is Յ0.78 or Ն1.23 indicates a statistically significant under or overrepresentation of the motif (65). We calculated and grouped flaviviruses that are known to replicate in vertebrates (n ϭ 62) against both groups of cISF (n ϭ 19) and dISF (n ϭ 10). Comparing the odds ratios of the CpG motif in marine flaviviruses suggested that for most of these viruses CpG is underrepresented, with only Gammarus chevreuxi flavivirus and Firefly squid flavivirus not having statistically unrepresented CpG motifs (Fig. 7A) (66,67). In comparison, there is a genus-wide selection against UpA for all flaviviruses irrespective of host range (Fig. 7B).
Using odds ratios of only two dinucleotide motifs to predict host range is too simplistic; therefore, we sought to use a hierarchical clustering method to assess the natural groupings of all flaviviruses based on mono-and dinucleotide composition. Hierarchical clustering is a statistical clustering analysis that builds clusters out of associations between groups using predictive variables. For this, we calculated the odds ratios of each mononucleotide (n ϭ 4) and dinucleotide (n ϭ 16) bias of the polyprotein of 96 complete or coding-complete flavivirus genomes deposited in GenBank, as well as those of the six flaviviruses from this study. We only considered the open reading frame.
Using the frequencies of each mononucleotide and dinucleotide (20 parameters) as predictive factors, we used the pvclust package, which not only performs the agglomerative hierarchical clustering (HC) analysis but also assesses the certainty of different groupings by calculating P values via multiscale bootstrap resampling (68). Analysis of the resultant dendrogram (Fig. S1) suggests there are two large supergroupings of similarity with composition of nucleotides within the Flavivirus genus, an invertebrate flavivirus supergroup and a vertebrate-associated supergroup. Hierarchical clustering analysis assigned marine flaviviruses into two distinct groups; the first was a vertebrateassociated and vector-borne group with both strains of Wenzhou shark flavivirus and Cyclopterus lumpus virus and Crangon crangon flavivirus. This first grouping was most closely associated with reasonable probabilistic support (85%) to a group of vectorborne flaviviruses that includes ZIKV and DENV. The second marine flavivirus group contained the two Gammarus and two cephalopod flaviviruses. These flaviviruses shared mononucleotide and dinucleotide composition more closely with the cISF grouping. What is important to note is that this HC method was successfully able to assign all dISFs together as one well-supported group, suggesting that HC analysis can sensitively discriminate between dISFs and VIFs.
As the HC analysis provided some degree of confidence in the natural structure of groups, we used the HC results to inform the linear discriminant analyses (LDA). Flaviviruses were then assigned into groups of vertebrate-infecting flaviviruses (VIF-1 and VIF-2), marine vertebrate-associated flaviviruses (MF1), cISF, dISF, and marine invertebrate flaviviruses (MF2) (Fig. 8). Both the HC analysis and the LDA shared complete consensus in assignments to these groups, with the exception of one outlier, New Mapoon virus. These results provide evidence for two different dinucleotide compositions for the marine flaviviruses, a marine invertebrate flavivirus group and a marine vertebrate-associated group.

DISCUSSION
In this study, we have undertaken a metagenomic and phylogenetic approach to further uncover the diversity of flaviviruses. To this end, we have discovered and phylogenetically placed five novel flaviviruses of marine invertebrates. The grouping of crustacean flaviviruses suggests that terrestrial vector-borne flaviviruses and crustacean flaviviruses share a closer common ancestor than any reported and sequenced flaviviruses and also that the flavivirus endogenous elements of crustaceans are closer in nucleotide and protein identity to vector-borne flaviviruses than to cISFs. Furthermore, we provide four lines of evidence for horizontal transmission of Wenzhou shark flavivirus between crabs and Pacific spadenose sharks, i.e., strong nucleotide identity between both strains of Wenzhou shark flavivirus, the geographical overlap of the catch locations of both crab and shark samples, proof of replication within P. trituberculatus through evidence of an RNAi response, and mono-and dinucleotide composition that most closely aligns with that of vector-borne terrestrial flaviviruses. Flaviviruses have gained and lost the ability to infect vertebrates a number of times through their evolutionary history (17). As we have demonstrated that Wenzhou shark flavivirus has jumped from or jumped to P. trituberculatus, this provides evidence of additional dual vertebrate-invertebrate host associations in flavivirus evolution. Previously, it has been demonstrated that rhabdoviruses have occasionally moved between distantly related hosts and then established in these environments by spreading into closely related hosts (27). It is difficult to extrapolate the evolution and emergence of terrestrial vector-borne flaviviruses in relation to the crustacean flaviviruses and cISF and VIF clades. We have limited data to conclude the evolutionary direction of flaviviruses with any certainty. While it is possible to date the crustacean node of EVEs from other closely related Daphnia and crustacean species, no additional genomes for these species are available. As such, we cannot give an accurate estimate as to when these endogenization events may have taken place. As we have discovered a new clade of flaviviruses that are closely related to VIF, the following two possible scenarios could be speculated: (i) terrestrial vector-borne flaviviruses evolved from a crustacean flavivirus ancestor, or (ii) insect-specific flaviviruses and crustacean flaviviruses codiverged, and VIFs subsequently gained an ability to infect vertebrates. However, additional data are required to resolve the scenarios and the direction of evolution.
In invertebrates, upon RNAi response, short interfering RNAs (siRNAs) are produced in response to dsRNA triggers (reviewed in Liu et al. [69] and Asgari [70]). During viral infection, the riboendonuclease III enzyme Dicer-2 processes dsRNA, produced by viruses during replication, into siRNAs between 20 and 25 nt in length (57,58). These virus-derived siRNAs (vsiRNAs) are then loaded into the RNA-induced silencing complex (RISC), and they have been shown to modulate and control virus accumulation and tolerance in the host (71). In invertebrates, the RNAi response against flaviviruses has been well established in mosquitoes infected with both VIF (72)(73)(74) and cISFs (75,76). In crustaceans, delivery of siRNA and dsRNA have been experimentally demonstrated to control white spot syndrome virus (WSSV) (family Nimaviridae) replication in both the Penaeus japonicus shrimp and the whiteleg shrimp, Litopenaeus vannamei, challenged with WSSV (77,78). While these studies have experimentally demonstrated that crustaceans may share the RNAi machinery that targets viral replication, the first report of exogenously produced vsiRNAs was only recently described from small RNA sequencing of the penaeid shrimp Fenneropenaeus chinensis challenged with WSSV (79). In F. chinensis, it was shown that processing of viral dsRNA resulted in 21-to 22-nt vsiRNAs, with the most abundant size being 22 nt. Consistently, we found large numbers of 21to 22-nt vsiRNAs (with a slight bias toward 22 nt) in P. trituberculatus that mapped to both positive-sense and negative-sense RNA strands of Wenzhou shark flavivirus. The mapped reads indicated that vsiRNAs are derived from along the entire length of the virus, suggesting active replication in the invertebrate host. This is the second report of vsiRNAs in crustaceans and indicates that 22-nt sRNA fragments are produced preferentially by the RNAi pathway in crustaceans against viral pathogens.
It appears that some marine flaviviruses contain a Ϫ1 PRF motif to produce a singular polyprotein. Furthermore, it seems possible that the two discrete open reading frames observed in these viruses may produce functional virions with only transmembrane proteins lost within the frameshifted region. Numerous monopartite RNA viruses have structural and nonstructural genes encoded on discrete open reading frames, notably alphaviruses, which encode two discrete open reading frames as a means to control translation of these proteins.
Within the Flavivirus genus, Ϫ1 PRF motifs are accepted and reported in both VIF and cISF species. In cISF members, the alternative reading frame is located near the NS2A/NS2B-coding region, the fairly interesting flavivirus open reading frame (fifo) (45). It has been demonstrated that cISFs encode a downstream fifo between 221 and 293 additional amino acids (aa) in length. This is much larger than the Ϫ1 PRF region that exists in a number of VIFs, which extends the NS1 by additional residues (delineated NS1=) (46,47). Experimentally validated NS1= translation efficiencies in West Nile virus suggests that slippage happens between ϳ20% and 50% of the time, depending on the cell line (47). The existence of this Ϫ1 PRF within the marine flaviviruses reported here occurs within the same regions as those two Ϫ1 PRF regions of cISFs and VIFs. Not all VIFs encode this Ϫ1 PRF, and it appears that the Ϫ1 PRF region has evolved independently or been lost multiple times in the Flavivirus genus. The reasons for this location have been discussed extensively (80). However, identification of this Ϫ1 PRF within the crustacean flaviviruses also presents an alternative to the origin of the NS1= extension and fifo protein as a vestigial or retained genomic feature arisen from a mechanism that was required to produce a complete polyprotein.
Ongoing coevolution of flaviviruses, either with their invertebrate or dual invertebrate-vertebrate hosts, places certain constraints on the coding region of the viral polyprotein that limit the nucleotide and codon usage (64). It is well established that vertebrate mRNAs show an underrepresentation of the UpA and CpG dinucleotides, whereas typical arthropod genes display an underrepresentation of the UpA dinucleotide (66,67). It is also known that the dinucleotide composition is conserved among VIFs and cISFs. Considering that in mammals, CpG motifs modulate the innate and adaptive immune response (81), it has been suggested that this is the reason for CpG motifs being selected against in VIFs, whereas CpG is not selected against in cISFs (82).
Previous studies have shown that mono-and dinucleotide compositions are reasonable predictors of the host range of flaviviruses. In a family-wide analysis of monoand dinucleotide compositions for Flaviviridae, 76% of the virus hosts could be accurately predicted using odds ratios of all possible 16 dinucleotides and the four mononucleotides. (83). Mono-and dinucleotide compositions have also been used in linear discriminant analyses using a training set of RNA viruses with established host ranges (vertebrate-only, invertebrate-only [cISF], vector-borne, plant, and bacterial). Using this approach, it was possible to sensitively assign Anopheles cISFs as an invertebrate-only grouping (84). Importantly, our analyses indicate that dinucleotide compositions of the marine flaviviruses discovered here associate these viruses into tentative marine invertebrate and marine vertebrate host-associated groupings.

CONCLUSIONS
This work presents evidence that flaviviruses infect a range of marine invertebrates and improves our understanding of the potential origins and emergence of invertebrate-vertebrate flaviviruses. The identified marine invertebrate flaviviruses provide insight into flavivirus genome organization, and we demonstrate a clear example of dual host invertebrate-vertebrate flaviviruses in sharks and crabs. This study presents interesting future avenues to explore the emergence of terrestrial vector-borne flaviviruses and scenarios for a jump to terrestrial arthropods and terrestrial vertebrates.

Identification of divergent flaviviruses and endogenous viral elements from published transcriptomes and genome assemblies.
To uncover unknown and divergent flaviviruses infecting other eukaryotic organisms, we queried the polyprotein sequence of Wenzhou shark flavivirus (GenBank accession number AVM87250.1) using the translated BLAST: tblastn algorithm against the Transcriptome Shotgun Assembly Sequence Database and the whole-genome shotgun contigs database hosted by the National Center for Biotechnology Information (NCBI). We restricted the search to Animalia (taxonomic identifier [taxid] 33208) under default parameters. A number of 2-to 3-kb contigs encoding a flavivirus NS5 and NS3 Pfam domain were identified in terrestrial arthropods and excluded, as they appeared to be related to the multisegmented Jingmenvirus group (30). Contigs were then subsequently queried using BLASTx against the nonredundant database. After positive identification, these flavivirus strains were then used to query additional wild-caught published Cephalopoda (taxid 6605) and Crustacea (taxid 6657) RNA-Seq data using the BLASTn search on SRA.
De novo assembly of Gammarus pulex flavivirus and Firefly squid flavivirus, and bioinformatic validation of previously assembled data. For assembly of Firefly squid flavivirus, RNA-Seq data (SRA accession number SRR2960129) was imported into the Galaxy Australia Webserver (https://usegalaxy .org.au/), and quality and adapter trimmed using Trimmomatic (85). Trimmed reads were then de novo assembled using Trinity (Galaxy version 2.4.0.2) (86). Gammarus pulex flavivirus (GPFV) was assembled using the CLC Genomics Workbench (version 10.1.1) with default settings. To obtain mapping statistics and also validate putative flavivirus contigs, original total RNA-Seq data were downloaded with CLC Genomics Workbench, adapter and quality read trimmed (quality score, Ͻ0.05; ambiguous nucleotides, 2), and then remapped to flavivirus contigs using identity (0.9) and length (0.9) mapping criteria.

RNAi analysis.
For analysis of the mapping and profile of virus-derived small RNAs from the testes of P. trituberculatus, we used a previously established workflow (87). Briefly, raw sRNA data were downloaded and trimmed with the same quality and ambiguous nucleotide score. Reads that did not have adapters, and those that were less than 16 nt, were discarded. Clean reads were then mapped to genome and antigenome with the RNA-Seq analysis tool, using strict mapping criteria (mismatch, insertion, and deletion costs: 2:3:3, respectively). Profiles of mapped read sizes were then graphed using Excel.
Maximum likelihood tree inference was made using IQ-TREE with Ultrafast bootstrap approximation (92), using computational resources from the Cyberinfrastructure for Phylogenetic Research Science Gateway (93). A maximum clade credibility tree was estimated using the Bayesian Markov chain Monte Carlo (MCMC) method implemented in MrBayes version 3.2.3, using the same protein substitution model run for 50,000,000 generations (six runs, four chains) sampled every 1,000 generations, with 25% discarded as burn in. Convergence diagnostics were assessed using the information logged, such as the potential scale reduction factor (PSRF; equal to 1). The resultant tree was visualized using FigTree version 1.4 (A. Rambaut; http://tree.bio.ed.ac.uk/software/figtree/).
Analysis of mononucleotide and dinucleotide composition. For analysis of mononucleotide and dinucleotide motifs, we used a total of 103 flavivirus genomes deposited in GenBank. Open reading frames of the flaviviruses were predicted using CLC Genomics Workbench, and frequencies were counted using the length (LEN) function in Excel. Odds ratio of mono-and dinucleotide motifs were calculated as previously described (65), and hierarchical clustering was performed on 16 dinucleotide ratios and 4 mononucleotide ratios for 103 flaviviruses analyzed using the pvclust package, which assesses the uncertainty of hierarchical cluster analysis by calculating P values via multiscale bootstrap resampling in R (68). Prior to using pvclust, data were transposed, then scaled in R. pvclust was run using correlation and Ward's clustering methods with 10,000 bootstraps (method.distϭЉcor," method.hclustϭЉward.D2," nboot ϭ 10000). The groupings ascertained using the HC method were then used as the categories for a linear discriminant analysis using the lda function of the MASS package suite in R studio. All raw data used for this analysis and the Excel calculator are available in Table S2.
Data availability. All data used within this publication are available within text and supplementary files. Sequencing data used to assemble the genome are available in Table S1. Accession numbers of annotated flaviviruses have been deposited in GenBank under the accession numbers MK473875 to MK473881.