Viral bioinformatics: Recombination

This week’s addition to the virology toolbox was written by Danielle Coulson and Chris Upton

Comparing genomes of viral strains can provide very useful insight into evolutionary relationships. Recombination, defined by Posada et al (2001) as the exchange of genetic information between two nucleotide sequences, is quite common in many viruses. Because recombination accounts for much of the genetic diversity observed between viral strains, it is of interest to decipher where the origins of recombinant sequences are, and to know which viral strains are likely to have undergone recombination. Several programs exist to detect recombination among genomes and to identify breakpoints in sequences, which represent recombinant regions. A few are described here, using HIV-1 strains isolated from Uganda, where subtypes A and D are prominent. Recombinant strains have arisen in Eastern Africa, given the co-circulation of different types of strains. Three recombination-detection programs are described here, using HIV-1 strains isolated from Uganda, where subtype A and D are prominent, and recombinant A-D strains have arisen.

Recombination analysis tool (RAT)

RAT is a very simple, easy-to-use cross-platform program that allows for the comparison and detection of recombination of between multiple sequences, in a straightforward graphical user interface. It provides a clear graphical output, depicting recombination crossover points between sequences by plotting the genetic distance between each sequence as a function of its sequence position. A sequence alignment (FASTA format works best, although other alignment files will work) is input, and default parameters may be maintained or changed. The default settings are well optimized for analysis; however, a rule of thumb is that window size should be 10% of the sequence length, and the increment size should be half the window size.

Figure 1. Sequence input window in RAT, where parameters can be adjusted, or left as default values. Here, a FASTA file containing three HIV sequences is input, and the suspected recombinant is selected as the test sequence, to which other sequences will be compared.

RAT is useful in that it allows the user to check for recombination between sequences already thought to be recombinants, as well as to conduct an auto search to find possible recombination spots. By clicking execute, a sequence viewer will display the similarity of all sequences in the alignment as compared to a specified test sequence. Useful in this display is the option to select and unselect different sequences, in order to view all sequences at once to find possible recombination sites, or to view two at a time to decipher specific recombination breakpoints. When conducting an auto search, results can be screened based on customized similarity thresholds. All possible recombination points within the threshold are listed between which sequences they occur.

Figure 2. Output for specified sequence search, showing the genetic distance of two HIV strains of clade A and D from the test sequence (the AD recombinant strain) on the Y-axis, and the sequence position on the X-axis. A possible recombination spot occurs at position 4874, where the recombinant sequence now shares higher similarity with clade D than clade A. This recombinant region appears to end at approximate position 6203.

Finally, the graphical representation of genetic distance between strains can be exported as a JPG file.

RAT works based on a distance method, whereby pairwise comparisons between sequences are performed as a sliding window moves along the length of the sequence. A score is generated based on the similarity between the nucleotides in the current window of the test sequence and the nucleotides in the same window of the other sequences. While useful information is provided in the RAT program, it does not provide statistical support for the results generated. However, this is an advantage as it allows analysis to proceed very quickly, which is extremely useful to get an overview of potential recombinant sites.


Simplot (for Windows) is another useful tool for detecting recombination between sequences, which like RAT, produces similarity plots, but has more features and therefore is slightly more complex. SimPlot allows for the analysis of up to 10 sequences (although the alignment may have more than this), where each can be used as a query sequence with which to compare the rest, or hidden from the analysis. Other useful and unique features of SimPlot are the ability to ignore sites containing gaps in the alignment when generating the similarity plot, as well as being able to identify the sequence position and exact similarity value on any point in the sequence you click on. Furthermore, there is a zoom in feature, as well as options to include titles, legends, grid lines and other useful information as part of the display. SimPlot also allows sequences to be grouped together, and analysis to be performed between groups rather than individual sequences.

Sequences (FASTA format, as well as other common alignment files) are loaded into the program, and those that are desired are selected for analysis. Several options are available for analysis, such as which Distance Model to use, and the number of Bootstrap replicates to use for statistical significance.

Figure 3. Similarity plot generated in SimPlot using one query sequence (recombinantAD) and two other sequences, (HIV-1 strains from Clade A1 and Clade D). Possible recombination sites are identified where sequence crossover occurs. By zooming in to better view, and clicking on crossover regions, breakpoints are determined, such as at regions 4481 and 5681 in the alignment. Other potential recombinant regions are also identified that were not obvious in RAT.

Finally, SimPlot provides the option of finding specific recombinant sites. After identifying potential recombinant sequences on the similarity plot, specific sequences within each group can be chosen for an informative site analysis. Overall, SimPlot provides a very effective means of detecting recombination, in an easy-to-use interface with fast results.

Recombination Detection Program (RDP)

RDP (for Windows) is yet another program that allows for detection of recombination amongst aligned sequences, however it is unique in that it incorporates several detection methods and analysis algorithms into one well laid-out interface, allowing the user to select which method of recombination detection is most suitable and provides the best results.

Figure 4. RDP Overall Display

Furthermore, recombination events that are detected are displayed graphically, with statistical evidence provided, and recombination events are also depicted on phylogenetic trees constructed from proposed recombinant regions. These features allow the user to decipher which events are true recombination events, and discard those that have been incorrectly identified. Importantly, possible recombination events are listed with warnings, to indicate when the program is not confident of the proposed recombination event, its location and sequence breakpoints, or its contributing sequences.

Easy navigation through the sequences is possible, as these are displayed alongside the statistical display of recombinant regions and breakpoints, the schematic display of recombinant sequences, as well as the dendogram display.

Figure 5. Sequence display

Figure 6. Schematic sequence display of recombinant regions; each recombinant block can be selected in order to view the supporting evidence for recombination.

Figure 7. Recombination information; displayed here are any relevant warning suggesting reasons why the recombinant may have been misidentified, as well as statistical evidence from each algorithm used supporting the event.

Figure 8. Graphical representation of recombinant region (shown in pink) as determined by the RDP algorithm. Plotted on the Y axis is Pairwise identity of each pair of sequence, against their position in the alignment on the X axis.

By clicking on the various potential recombinant sequence blocks in the schematic sequence display of recombinant regions, graphical evidence will appear for each on the bottom left display area. The pink region here shows the likely recombinant region. Each potential recombinant region will also have its accompanying statistical evidence displayed in the top right corner in the recombination information display. The tree display is quite useful in determining true recombination events; it provides a dendrogram of non-recombinant regions and recombinant regions which to compare. While the default setting is to create trees using the neighbor joining method (which requires less time), RDP is also able to create trees using UPGMA, least squares, Bayesian and Maximum Likelihood algorithms.

As mentioned, RDP provides statistical evidence for each recombination event, as determined by several different methods. Although the default displays evidence as determined by the method which most strongly suggested recombination in that region, the user can easily see graphical displays of the different methods used to find the breakpoint.

Among the different recombination detection algorithms in the tool, are RDP, Geneconv, Bootscan, MaxChi, Chimaera, SiScan, 3Seq, LARD and TOPAL, all of which are optimized to detect recombination in different ways, thus allowing for detection of recombination in various different alignments. Furthermore, each type of analysis has several customizable options, set to default values that work well. Manual Distance plots, similar to those created by SimPlot and RDP are also possible, where any selected sequence can be queried against all other in the alignment.

With the vast array of options and analysis preferences that are available on RDP, the average run-time for an alignment is longer than for the other programs, however, much more information is provided. Knowing which detection algorithm best suits the alignment allows the user to select which algorithms should be used, allowing the analysis to proceed much faster.

Finally, this program is accompanied by an extremely useful user’s manual, explaining the algorithms that are available and which is best suited to different alignments. The manual also includes a step by step guide, which details the process of detecting recombination in sequences, from preliminary hypothesis, to finding conclusive statistically supported recombinant regions.

Example sequences:

Clade A1: HIV-1 isolate 99UGA07072 from Uganda, partial genome

GenBank: AF484478.1

Clade D: HIV-1 isolate 99UGC06443 from Uganda, partial genome

GenBank: AF484479.1

RecombinantAD: HIV-1 isolate 99UGB21875 from Uganda, partial genome

GenBank: AF484480.1

Comments on this entry are closed.