Viral bioinformatics: Recombination

This week’s addition to the virology toolbox was written by Danielle Coulson and Chris Upton

Comparing genomes of viral strains can provide very useful insight into evolutionary relationships. Recombination, defined by Posada et al (2001) as the exchange of genetic information between two nucleotide sequences, is quite common in many viruses. Because recombination accounts for much of the genetic diversity observed between viral strains, it is of interest to decipher where the origins of recombinant sequences are, and to know which viral strains are likely to have undergone recombination. Several programs exist to detect recombination among genomes and to identify breakpoints in sequences, which represent recombinant regions. A few are described here, using HIV-1 strains isolated from Uganda, where subtypes A and D are prominent. Recombinant strains have arisen in Eastern Africa, given the co-circulation of different types of strains. Three recombination-detection programs are described here, using HIV-1 strains isolated from Uganda, where subtype A and D are prominent, and recombinant A-D strains have arisen.

Recombination analysis tool (RAT)

RAT is a very simple, easy-to-use cross-platform program that allows for the comparison and detection of recombination of between multiple sequences, in a straightforward graphical user interface. It provides a clear graphical output, depicting recombination crossover points between sequences by plotting the genetic distance between each sequence as a function of its sequence position. A sequence alignment (FASTA format works best, although other alignment files will work) is input, and default parameters may be maintained or changed. The default settings are well optimized for analysis; however, a rule of thumb is that window size should be 10% of the sequence length, and the increment size should be half the window size.

Figure 1. Sequence input window in RAT, where parameters can be adjusted, or left as default values. Here, a FASTA file containing three HIV sequences is input, and the suspected recombinant is selected as the test sequence, to which other sequences will be compared.

RAT is useful in that it allows the user to check for recombination between sequences already thought to be recombinants, as well as to conduct an auto search to find possible recombination spots. By clicking execute, a sequence viewer will display the similarity of all sequences in the alignment as compared to a specified test sequence. Useful in this display is the option to select and unselect different sequences, in order to view all sequences at once to find possible recombination sites, or to view two at a time to decipher specific recombination breakpoints. When conducting an auto search, results can be screened based on customized similarity thresholds. All possible recombination points within the threshold are listed between which sequences they occur.

Figure 2. Output for specified sequence search, showing the genetic distance of two HIV strains of clade A and D from the test sequence (the AD recombinant strain) on the Y-axis, and the sequence position on the X-axis. A possible recombination spot occurs at position 4874, where the recombinant sequence now shares higher similarity with clade D than clade A. This recombinant region appears to end at approximate position 6203.

Finally, the graphical representation of genetic distance between strains can be exported as a JPG file.

RAT works based on a distance method, whereby pairwise comparisons between sequences are performed as a sliding window moves along the length of the sequence. A score is generated based on the similarity between the nucleotides in the current window of the test sequence and the nucleotides in the same window of the other sequences. While useful information is provided in the RAT program, it does not provide statistical support for the results generated. However, this is an advantage as it allows analysis to proceed very quickly, which is extremely useful to get an overview of potential recombinant sites.

SimPlot

Simplot (for Windows) is another useful tool for detecting recombination between sequences, which like RAT, produces similarity plots, but has more features and therefore is slightly more complex. SimPlot allows for the analysis of up to 10 sequences (although the alignment may have more than this), where each can be used as a query sequence with which to compare the rest, or hidden from the analysis. Other useful and unique features of SimPlot are the ability to ignore sites containing gaps in the alignment when generating the similarity plot, as well as being able to identify the sequence position and exact similarity value on any point in the sequence you click on. Furthermore, there is a zoom in feature, as well as options to include titles, legends, grid lines and other useful information as part of the display. SimPlot also allows sequences to be grouped together, and analysis to be performed between groups rather than individual sequences.

Sequences (FASTA format, as well as other common alignment files) are loaded into the program, and those that are desired are selected for analysis. Several options are available for analysis, such as which Distance Model to use, and the number of Bootstrap replicates to use for statistical significance.

Figure 3. Similarity plot generated in SimPlot using one query sequence (recombinantAD) and two other sequences, (HIV-1 strains from Clade A1 and Clade D). Possible recombination sites are identified where sequence crossover occurs. By zooming in to better view, and clicking on crossover regions, breakpoints are determined, such as at regions 4481 and 5681 in the alignment. Other potential recombinant regions are also identified that were not obvious in RAT.

Finally, SimPlot provides the option of finding specific recombinant sites. After identifying potential recombinant sequences on the similarity plot, specific sequences within each group can be chosen for an informative site analysis. Overall, SimPlot provides a very effective means of detecting recombination, in an easy-to-use interface with fast results.

Recombination Detection Program (RDP)

RDP (for Windows) is yet another program that allows for detection of recombination amongst aligned sequences, however it is unique in that it incorporates several detection methods and analysis algorithms into one well laid-out interface, allowing the user to select which method of recombination detection is most suitable and provides the best results.

Figure 4. RDP Overall Display

Furthermore, recombination events that are detected are displayed graphically, with statistical evidence provided, and recombination events are also depicted on phylogenetic trees constructed from proposed recombinant regions. These features allow the user to decipher which events are true recombination events, and discard those that have been incorrectly identified. Importantly, possible recombination events are listed with warnings, to indicate when the program is not confident of the proposed recombination event, its location and sequence breakpoints, or its contributing sequences.

Easy navigation through the sequences is possible, as these are displayed alongside the statistical display of recombinant regions and breakpoints, the schematic display of recombinant sequences, as well as the dendogram display.

Figure 5. Sequence display

Figure 6. Schematic sequence display of recombinant regions; each recombinant block can be selected in order to view the supporting evidence for recombination.

Figure 7. Recombination information; displayed here are any relevant warning suggesting reasons why the recombinant may have been misidentified, as well as statistical evidence from each algorithm used supporting the event.

Figure 8. Graphical representation of recombinant region (shown in pink) as determined by the RDP algorithm. Plotted on the Y axis is Pairwise identity of each pair of sequence, against their position in the alignment on the X axis.

By clicking on the various potential recombinant sequence blocks in the schematic sequence display of recombinant regions, graphical evidence will appear for each on the bottom left display area. The pink region here shows the likely recombinant region. Each potential recombinant region will also have its accompanying statistical evidence displayed in the top right corner in the recombination information display. The tree display is quite useful in determining true recombination events; it provides a dendrogram of non-recombinant regions and recombinant regions which to compare. While the default setting is to create trees using the neighbor joining method (which requires less time), RDP is also able to create trees using UPGMA, least squares, Bayesian and Maximum Likelihood algorithms.

As mentioned, RDP provides statistical evidence for each recombination event, as determined by several different methods. Although the default displays evidence as determined by the method which most strongly suggested recombination in that region, the user can easily see graphical displays of the different methods used to find the breakpoint.

Among the different recombination detection algorithms in the tool, are RDP, Geneconv, Bootscan, MaxChi, Chimaera, SiScan, 3Seq, LARD and TOPAL, all of which are optimized to detect recombination in different ways, thus allowing for detection of recombination in various different alignments. Furthermore, each type of analysis has several customizable options, set to default values that work well. Manual Distance plots, similar to those created by SimPlot and RDP are also possible, where any selected sequence can be queried against all other in the alignment.

With the vast array of options and analysis preferences that are available on RDP, the average run-time for an alignment is longer than for the other programs, however, much more information is provided. Knowing which detection algorithm best suits the alignment allows the user to select which algorithms should be used, allowing the analysis to proceed much faster.

Finally, this program is accompanied by an extremely useful user’s manual, explaining the algorithms that are available and which is best suited to different alignments. The manual also includes a step by step guide, which details the process of detecting recombination in sequences, from preliminary hypothesis, to finding conclusive statistically supported recombinant regions.

Example sequences:

Clade A1: HIV-1 isolate 99UGA07072 from Uganda, partial genome

GenBank: AF484478.1

Clade D: HIV-1 isolate 99UGC06443 from Uganda, partial genome

GenBank: AF484479.1

RecombinantAD: HIV-1 isolate 99UGB21875 from Uganda, partial genome

GenBank: AF484480.1

Viral bioinformatics: Dotplot

This week’s addition to the virology toolbox was written by Chris Upton

Dotplots are an extremely useful way of visualizing comparisons of small and large DNA sequences (as well as protein sequences), providing insight into the degree of similarity, deletions, insertions and direct and indirect repeats. In a dotplot, each nucleotide, or small window of nucleotides, of one sequence is compared with every nucleotide of a second sequence. Dotplots can quickly provide an overview of the relationship between sequences.

The Dotter program [1] has several very useful features including the ability to save and reload dotplots, the ability to zoom into particular regions of the plot, an option to create a multi-dotplot by aligning more than two DNA (or protein) sequences and permitting users to adjust the stringency of the matrix being displayed in real-time by changing the greyscale of the dots.

JDotter [2] provides an easy to use Java (platform independent) interface to Dotter giving all the benefits of Dotter in a single web-accessible tool. You can access JDotter here.

Additional background information on nucleic acid dotplots is available.

The first figure is a dotplot of three poxvirus interferon gamma binding proteins plotted against each other. Genes are displayed along the axes. This plot takes a few seconds to calculate.

Here is a dotplot of vaccinia CVA and MVA genomes (~170 kb). Large deletions are present in MVA, a result of >500 passages in chicken embryo fibroblasts.  Terminal inverted repeat sequences are obvious in the bottom-left and top-right corners of the plot. A plot for these sequences takes ~ 10 min to calculate.

Next is a self plot of the Molluscum contagiosum virus genome. Enhancing “background” shows that it’s not totally random. The “stripes” are caused by segments of DNA with different nucleotide composition. The region that creates the area in the red box has a higher A+T%, and appears to be derived from host sequences: it contains virulence genes.

Another view of the Molluscum contagiosum virus genome self plot – a zoomed-in view of the red box shown in the previous figure. Three of the genes in the “pale stripe” appear to be paralogs, probably resulting from duplications of an ancestral gene acquired from the host [3].

A student pointed me to the Gepard dotplot program, which is more suited for large DNA sequences (Gepard: German, “cheetah”, Backronym for “GEnome PAir – Rapid Dotter”). The self-plot below, for an E. coli genome took only a couple of minutes to complete. Although it uses a different type of algorithm, the features are similar to Dotter. It is simple to zoom into regions and you can change the parameters for scoring on-the-fly (post-plot).

1. Sonnhammer EL, Durbin R: A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene 1995, 167:GC1-10.

2. Brodie R, Roper RL, Upton C: JDotter: a Java interface to multiple dotplots generated by dotter. Bioinformatics 2004, 20:279-281.

3. Da Silva M, Upton C: Host-derived pathogenicity islands in poxviruses. Virol. J 2005, 2:30.

4. Krumsiek J, Arnold R, Rattei T. Gepard: A rapid and sensitive tool for creating dotplots on genome scale. Bioinformatics 2007; 23(8): 1026-8. PMID: 17309896

Viral bioinformatics: Introduction + Homology

This week’s addition to the virology toolbox was written by Chris Upton

First, you may be asking yourself – Why viral bioinformatics? Good question! Although it’s true that much in the world of bioinformatics can be applied to all manner of protein and DNA sequences, there are a number of resources that are specific for viruses and there are a number of analyses that every virologist should be familiar with. So this section of the virology toolbox will highlight database resources, some useful tools and analyses, and some pitfalls you want to avoid.

What do I know about viral bioinformatics? Another good question! Well, all I can say is that although I can’t program my way out of a paper bag – I’ve been analyzing DNA and protein sequences of poxvirus for 20+ years and have been “developing” software for analysis of viral genomes for 10+ years. That is, I say what I want the software to do, and a bunch of talented programmers – including many undergraduate students – figure out how to code it. Over the years, this work has been funded by NSERC and PENCE in Canada, and NIH in the USA.

I’ll also be highlighting many of our own tools – why?

  • They were developed precisely for comparative genomics of viruses. That is, genomes from 10-500 kb.
  • They were developed for use by the bench virologist – so they’re fairly straightforward to use.
  • They’re platform independent – will run on Macs, Windoze and LINUX boxes.
  • I know them best and can give good examples of their use.
  • PS. The programs also work on any type of DNA and protein sequences.

My first topic is very simple, but equally important. So watch your language.

Homology = common origin

Phrases like “sequence (structural) homology”, “high homology”, “significant homology”, or even “35% homology” are as common, even in top scientific journals, as they are absurd, considering the above definition.

I took this quote from a book (which seems to be online): Sequence – Evolution – Function: Computational Approaches in Comparative Genomics by Eugene V Koonin and Michael Y Galperin: Kluwer Academic; 2003.

So genes/proteins/sequences are either homologous, or they’re not. No fractions or percentages here!

Try writing: 50% identical. But you also have to say whether you mean nucleotides or amino acids.

While we’re on this topic:

Orthology

Homologous sequences are orthologous if they were separated by a speciation event: when a species diverges into two separate species, the divergent copies of a single gene in the resulting species are said to be orthologous. Orthologs, or orthologous genes, are genes in different species that are similar to each other because they originated from a common ancestor. The term “ortholog” was coined in 1970 by Walter Fitch.

Paralogy

Homologous sequences are paralogous if they were separated by a gene duplication event: if a gene in an organism is duplicated to occupy two different positions in the same genome, then the two copies are paralogous. A set of sequences that are paralogous are called paralogs of each other. Paralogs typically have the same or similar function, but sometimes do not: due to lack of the original selective pressure upon one copy of the duplicated gene, this copy is free to mutate and acquire new functions (from Wikipedia).

The figure illustrates homologs, orthologs and paralogs (click for the original link).