TWiV 358: Virology and proteomics with Ileana Cristea

On episode #358 of the science show This Week in Virology, Vincent meets up with Ileana at Princeton University to talk about how her laboratory integrates molecular virology, mass spectrometry-based proteomics, and bioinformatics to unravel the interplay between virus and host.

You can find TWiV #358 at

Bioinformatics Workshop on Virus Evolution and Molecular Epidemiology

from Brian Foley:

18th International BioInformatics Workshop on Virus Evolution and Molecular Epidemiology
University of Florida, Emerging Pathogens Institute
Gainesville, Florida, USA
August 25th – August 30th, 2013
Bioinformatics Methods Applied to Virology and Epidemiology

Announcing the organization of the international workshop on Virus Evolution and Molecular Epidemiology (VEME) in 2013, hosted by the Emerging Pathogens Institute in the warm city of Gainesville and sponsored by several local partners.

We plan to organize a ‘Phylogenetic Inference’ module that offers the theoretical background and hands-on experience in phylogenetic analysis for those who have little or no prior expertise in sequence analysis. An ‘Evolutionary Hypothesis Testing’ is targeted to participants who are well familiar with alignments and phylogenetic trees, and would like to extend their expertise to likelihood and Bayesian inference in phylogenetics, coalescent and phylogeographic analyses (‘phylodynamics’) and molecular adaptation. A ‘Large Dataset Analysis’ module will cover the more complex analysis of full genomes, huge datasets of pathogens including Next Generation Sequencing data, and combined analyses of pathogen and host. Practical sessions in these modules will involve software like, PHYLIP, PAUP*, PHYML, MEGA, PAML or HYPHY, TREE-PUZZLE, SplitsTree, BEAST, MrBayes Simplot and RDP3.

We recommend participants to buy The Phylogenetic Handbook as a guide during the workshop, and to bring their own data set.

For further information and applications check this website.

Abstract and application deadline is April 30th.

Selections will be made by end of May 2013.

The registration fee of 1000 USD covers attendance, lunches and coffee breaks.

Participation is limited to 25 scientists in each module and is dependent on a selection procedure based on the submitted abstract and statement of motivation. A limited number of grants are available for scientists who experience difficulties to attend because of financial reasons.

TWiV 192: Viral tertulia

On episode #192 of the science show This Week in Virology, Vincent, Alan, and Rich answer listener email about bioinformatics, insects, influenza, laboratory classes, commensalism, reproducibility of data, and more.

You can find TWiV #192 at

Viral bioinformatics: Sequence searcher

virology toolboxThis week’s addition to the virology toolbox was written by Chris Upton

Sequence Searcher is a Java program that allows users to search for specific sequence motifs in protein or DNA sequences. For example, it can be used to identify restriction enzyme cleavage sites or find similar sequence patterns among multiple sequences. Most searches run in a few seconds.

Sequence Searcher is part of the suite of programs available at the University of Victoria.

Help files:

Some of the key features of Sequence Searcher include:

  • Searching through multiple sequences
  • Use of regular expressions or fuzzy search patterns.
  • Searching for patterns on both strands of a DNA sequence
  • Graphical representation of results and ability to save search results
  • It can run on multiple computer platforms (Java)

For DNA, the searches are conducted by finding the motif within a sequence from the 5’ to 3’ end on the top strand. The searches are also processed from the 5’ to 3’ end of the bottom strand. As a result, bases are numbered from 1 starting at the 5’ at either the top or bottom strand.

Regular expression and fuzzy pattern searches are available:

Fuzzy searches provide the option for the program to allow a certain number of mismatches from a sequence input at any position.  Note that the maximum number of mismatches that the program allows is 40% of the length of the sequence motif.

Regular expression allows for inputs of precise motifs along with considerable user-specified flexibility at specific positions.

figure 1

Figure 1. The input tab is where you can import DNA or protein sequences (must be in FASTA format) and type in the specific pattern to search within in the sequence(s). The search type can be selected as “Regular expression” or “Fuzzy” by using the drop down menu.

figure 2

Figure 2. When a search has been completed, the results tab is presented in a table format. The results in the table can be sorted depending on the column header (sequence, match, start, stop, confidence, and strand). The results can also be filtered by sequence and strand by selecting the drop down menus at the top.

Marass, F., & Upton, C. (2009). Sequence Searcher: A Java tool to perform regular expression and fuzzy searches of multiple DNA and protein sequences BMC Research Notes, 2 (1) DOI: 10.1186/1756-0500-2-14

Viral bioinformatics: Multiple sequence alignment – Base-By-Base (BBB) editor

This week’s addition to the virology toolbox was written by Chris Upton (Disclaimer: BBB was developed in the Upton lab, so this is a biased review.)

This will be a multi-part posting describing the key features of BBB.

Base-By-Base (BBB) is a Java (platform independent) multiple sequence alignment (MSA) editor. Development was begun many years ago to provide a virologist-friendly tool to work with MSAs of proteins, gene sequences and also genome sequences (up to about 500 kb). Over the years we have added many unique features, as they have been needed by our own research with a variety of virus genomes; development has been driven by the users – who spend a lot of their day looking at a variety of MSAs!

BBB is available here.

The program is Open Source and freely available to all academic labs.

Help files:

Key features:

  1. The program edits MSAs and uses a unique system to display differences between sequences in a MSA. This makes it easier for the user to spot mis-aligned regions that need correcting (Figure 1A-D). The differences can be from adjacent pairs of sequences, sequences compared to a consensus, to the top sequence (Figures 1A and B).
  2. Sequences can be temporarily hidden/revealed. See “eyes” at left of window.
  3. 3-frame translations can be displayed for DNA sequences (Figure 2A).
  4. Top or Bottom strand can be shown for DNA sequences; compare Figures 2A and B.
  5. Sequence annotations can be read from a GenBank file, or added by the user (Figure 3). Users can also use a text window (Edit menu: Edit MSA notes) to jot down notes about an alignment; these are saved with the alignment in the .bbb file.
  6. You can also edit sequences! i.e. change the nucleotides or amino acids; see Figure 4. This can be very useful when you need to edit an assembled/annotated sequence for an occasional sequencing error.Figure 1A.  Differences between sequences are high-lighted. Blue=nt substitution; Green=nt deletion; Red=nt insertion.

Figure 1A. Differences between sequences are high-lighted. Blue=nt substitution; Green=nt deletion; Red=nt insertion.

Figure 1B. Edited version of Figure 1A (my opinion).

Figure 1C. Same alignment as 1B, but differences are set for compare against top sequence.

Figure 1D. Same alignment and differences setting, but Sequence-2 has been moved to the bottom of the alignment (use arrows on left edge of window).

Figure 2A. Two sequences have been hidden. 3-frame translation is shown. Top strand is shown by default.

Figure 2B. Switched to bottom strand display (use top right button; 5’Top3’ in Fig. 2A). Notice direction of arrows in aa translation.

Figure 3. Gene annotations from GenBank file (pink); mouse-over gives gene annotation (#083) and nucleotide position. User-added comments: Blue=comment on Top stand; comments associated with strand not currently displayed are shown as outlines.

Figure 4A. Editing a sequence. First, select a region.

Figure 4B. Editing a sequence. Second, delete the selection.

Figure 4C. Editing a sequence. Third, add new nucleotides.


  1. Save alignments as .bbb files on your local computer. These documents can be reloaded back into BBB.
  2. You’ll find the Preferences menu, under the File menu.
  3. You can edit the names of the sequences (Edit menu)
  4. Paper+Pencil icon is for editing; Paper+arrow icon is for selecting.

Viral Bioinformatics: Multiple sequence alignment – Jalview

This week’s addition to the virology toolbox was written by Chris Upton

The Jalview package: a multiple alignment editor.

This software is primarily aimed at the alignment of protein sequences. Some of the key features are:

  • It allows you to edit the alignment
  • It has functions to display associated protein structures
  • It can connect to software to predict protein secondary structure
  • It’s under active development
  • Jalview has great documentation and tutorials
  • More: Overview, Documentation


  1. Although you can install Jalview on your computer very easily, using the Start with Java Web Start button is even easier and ensures you always have the latest version of the software.
  2. There is also an Applet version of Jalview that is intended to be an alignment viewer – it doesn’t have all the functionality.

If you use Jalview in your work, you should cite the Jalview 2 publication:

• Waterhouse, A.M., Procter, J.B., Martin, D.M.A, Clamp, M., Barton, G.J (2009), Jalview version 2: A Multiple Sequence Alignment and Analysis Workbench. Bioinformatics 25:1189-91.

• Clamp, M., Cuff, J., Searle, S. M. and Barton, G. J. (2004), The Jalview Java Alignment Editor. Bioinformatics 20: 426-7.

Viral bioinformatics: Introduction to multiple sequence alignment

This week’s addition to the virology toolbox was written by Chris Upton

Generating multiple sequence alignments (MSA) is one of the most commonly used bioinformatics techniques. The “sequences” to be compared can be DNA (promoters, genes, genomes) or proteins. Note that the length and number of sequences to be aligned has an impact on the methods (algorithms) that can be used; what is suitable for aligning 20 proteins probably won’t work for alignment of 5 poxvirus genomes (200 kb each).

Some useful links:

So you see, there lots of options (did you say: “too many!”?). Further confusion may arise because 1) the same algorithm may be used in many different software programs, and 2) referencing a software package may give no clue to the algorithm used. For many molecular biologists, Clustal is synonymous with sequence alignment. However, newer algorithms such as T-Coffee and MUSCLE are often offered in current software packages, and may be faster and more accurate.

Specialized alignment tools are almost always needed for long, genome sized DNA sequences.

In this set of posts, I’ll provide some information on favorite general MSA tools (that are free) that should be useful to the average molecular virologist. The lists noted above provide a multitude of tools, but many are for specific analyses.

Viral bioinformatics: Recombination

This week’s addition to the virology toolbox was written by Danielle Coulson and Chris Upton

Comparing genomes of viral strains can provide very useful insight into evolutionary relationships. Recombination, defined by Posada et al (2001) as the exchange of genetic information between two nucleotide sequences, is quite common in many viruses. Because recombination accounts for much of the genetic diversity observed between viral strains, it is of interest to decipher where the origins of recombinant sequences are, and to know which viral strains are likely to have undergone recombination. Several programs exist to detect recombination among genomes and to identify breakpoints in sequences, which represent recombinant regions. A few are described here, using HIV-1 strains isolated from Uganda, where subtypes A and D are prominent. Recombinant strains have arisen in Eastern Africa, given the co-circulation of different types of strains. Three recombination-detection programs are described here, using HIV-1 strains isolated from Uganda, where subtype A and D are prominent, and recombinant A-D strains have arisen.

Recombination analysis tool (RAT)

RAT is a very simple, easy-to-use cross-platform program that allows for the comparison and detection of recombination of between multiple sequences, in a straightforward graphical user interface. It provides a clear graphical output, depicting recombination crossover points between sequences by plotting the genetic distance between each sequence as a function of its sequence position. A sequence alignment (FASTA format works best, although other alignment files will work) is input, and default parameters may be maintained or changed. The default settings are well optimized for analysis; however, a rule of thumb is that window size should be 10% of the sequence length, and the increment size should be half the window size.

Figure 1. Sequence input window in RAT, where parameters can be adjusted, or left as default values. Here, a FASTA file containing three HIV sequences is input, and the suspected recombinant is selected as the test sequence, to which other sequences will be compared.

RAT is useful in that it allows the user to check for recombination between sequences already thought to be recombinants, as well as to conduct an auto search to find possible recombination spots. By clicking execute, a sequence viewer will display the similarity of all sequences in the alignment as compared to a specified test sequence. Useful in this display is the option to select and unselect different sequences, in order to view all sequences at once to find possible recombination sites, or to view two at a time to decipher specific recombination breakpoints. When conducting an auto search, results can be screened based on customized similarity thresholds. All possible recombination points within the threshold are listed between which sequences they occur.

Figure 2. Output for specified sequence search, showing the genetic distance of two HIV strains of clade A and D from the test sequence (the AD recombinant strain) on the Y-axis, and the sequence position on the X-axis. A possible recombination spot occurs at position 4874, where the recombinant sequence now shares higher similarity with clade D than clade A. This recombinant region appears to end at approximate position 6203.

Finally, the graphical representation of genetic distance between strains can be exported as a JPG file.

RAT works based on a distance method, whereby pairwise comparisons between sequences are performed as a sliding window moves along the length of the sequence. A score is generated based on the similarity between the nucleotides in the current window of the test sequence and the nucleotides in the same window of the other sequences. While useful information is provided in the RAT program, it does not provide statistical support for the results generated. However, this is an advantage as it allows analysis to proceed very quickly, which is extremely useful to get an overview of potential recombinant sites.


Simplot (for Windows) is another useful tool for detecting recombination between sequences, which like RAT, produces similarity plots, but has more features and therefore is slightly more complex. SimPlot allows for the analysis of up to 10 sequences (although the alignment may have more than this), where each can be used as a query sequence with which to compare the rest, or hidden from the analysis. Other useful and unique features of SimPlot are the ability to ignore sites containing gaps in the alignment when generating the similarity plot, as well as being able to identify the sequence position and exact similarity value on any point in the sequence you click on. Furthermore, there is a zoom in feature, as well as options to include titles, legends, grid lines and other useful information as part of the display. SimPlot also allows sequences to be grouped together, and analysis to be performed between groups rather than individual sequences.

Sequences (FASTA format, as well as other common alignment files) are loaded into the program, and those that are desired are selected for analysis. Several options are available for analysis, such as which Distance Model to use, and the number of Bootstrap replicates to use for statistical significance.

Figure 3. Similarity plot generated in SimPlot using one query sequence (recombinantAD) and two other sequences, (HIV-1 strains from Clade A1 and Clade D). Possible recombination sites are identified where sequence crossover occurs. By zooming in to better view, and clicking on crossover regions, breakpoints are determined, such as at regions 4481 and 5681 in the alignment. Other potential recombinant regions are also identified that were not obvious in RAT.

Finally, SimPlot provides the option of finding specific recombinant sites. After identifying potential recombinant sequences on the similarity plot, specific sequences within each group can be chosen for an informative site analysis. Overall, SimPlot provides a very effective means of detecting recombination, in an easy-to-use interface with fast results.

Recombination Detection Program (RDP)

RDP (for Windows) is yet another program that allows for detection of recombination amongst aligned sequences, however it is unique in that it incorporates several detection methods and analysis algorithms into one well laid-out interface, allowing the user to select which method of recombination detection is most suitable and provides the best results.

Figure 4. RDP Overall Display

Furthermore, recombination events that are detected are displayed graphically, with statistical evidence provided, and recombination events are also depicted on phylogenetic trees constructed from proposed recombinant regions. These features allow the user to decipher which events are true recombination events, and discard those that have been incorrectly identified. Importantly, possible recombination events are listed with warnings, to indicate when the program is not confident of the proposed recombination event, its location and sequence breakpoints, or its contributing sequences.

Easy navigation through the sequences is possible, as these are displayed alongside the statistical display of recombinant regions and breakpoints, the schematic display of recombinant sequences, as well as the dendogram display.

Figure 5. Sequence display

Figure 6. Schematic sequence display of recombinant regions; each recombinant block can be selected in order to view the supporting evidence for recombination.

Figure 7. Recombination information; displayed here are any relevant warning suggesting reasons why the recombinant may have been misidentified, as well as statistical evidence from each algorithm used supporting the event.

Figure 8. Graphical representation of recombinant region (shown in pink) as determined by the RDP algorithm. Plotted on the Y axis is Pairwise identity of each pair of sequence, against their position in the alignment on the X axis.

By clicking on the various potential recombinant sequence blocks in the schematic sequence display of recombinant regions, graphical evidence will appear for each on the bottom left display area. The pink region here shows the likely recombinant region. Each potential recombinant region will also have its accompanying statistical evidence displayed in the top right corner in the recombination information display. The tree display is quite useful in determining true recombination events; it provides a dendrogram of non-recombinant regions and recombinant regions which to compare. While the default setting is to create trees using the neighbor joining method (which requires less time), RDP is also able to create trees using UPGMA, least squares, Bayesian and Maximum Likelihood algorithms.

As mentioned, RDP provides statistical evidence for each recombination event, as determined by several different methods. Although the default displays evidence as determined by the method which most strongly suggested recombination in that region, the user can easily see graphical displays of the different methods used to find the breakpoint.

Among the different recombination detection algorithms in the tool, are RDP, Geneconv, Bootscan, MaxChi, Chimaera, SiScan, 3Seq, LARD and TOPAL, all of which are optimized to detect recombination in different ways, thus allowing for detection of recombination in various different alignments. Furthermore, each type of analysis has several customizable options, set to default values that work well. Manual Distance plots, similar to those created by SimPlot and RDP are also possible, where any selected sequence can be queried against all other in the alignment.

With the vast array of options and analysis preferences that are available on RDP, the average run-time for an alignment is longer than for the other programs, however, much more information is provided. Knowing which detection algorithm best suits the alignment allows the user to select which algorithms should be used, allowing the analysis to proceed much faster.

Finally, this program is accompanied by an extremely useful user’s manual, explaining the algorithms that are available and which is best suited to different alignments. The manual also includes a step by step guide, which details the process of detecting recombination in sequences, from preliminary hypothesis, to finding conclusive statistically supported recombinant regions.

Example sequences:

Clade A1: HIV-1 isolate 99UGA07072 from Uganda, partial genome

GenBank: AF484478.1

Clade D: HIV-1 isolate 99UGC06443 from Uganda, partial genome

GenBank: AF484479.1

RecombinantAD: HIV-1 isolate 99UGB21875 from Uganda, partial genome

GenBank: AF484480.1

Viral bioinformatics: Dotplot

This week’s addition to the virology toolbox was written by Chris Upton

Dotplots are an extremely useful way of visualizing comparisons of small and large DNA sequences (as well as protein sequences), providing insight into the degree of similarity, deletions, insertions and direct and indirect repeats. In a dotplot, each nucleotide, or small window of nucleotides, of one sequence is compared with every nucleotide of a second sequence. Dotplots can quickly provide an overview of the relationship between sequences.

The Dotter program [1] has several very useful features including the ability to save and reload dotplots, the ability to zoom into particular regions of the plot, an option to create a multi-dotplot by aligning more than two DNA (or protein) sequences and permitting users to adjust the stringency of the matrix being displayed in real-time by changing the greyscale of the dots.

JDotter [2] provides an easy to use Java (platform independent) interface to Dotter giving all the benefits of Dotter in a single web-accessible tool. You can access JDotter here.

Additional background information on nucleic acid dotplots is available.

The first figure is a dotplot of three poxvirus interferon gamma binding proteins plotted against each other. Genes are displayed along the axes. This plot takes a few seconds to calculate.

Here is a dotplot of vaccinia CVA and MVA genomes (~170 kb). Large deletions are present in MVA, a result of >500 passages in chicken embryo fibroblasts.  Terminal inverted repeat sequences are obvious in the bottom-left and top-right corners of the plot. A plot for these sequences takes ~ 10 min to calculate.

Next is a self plot of the Molluscum contagiosum virus genome. Enhancing “background” shows that it’s not totally random. The “stripes” are caused by segments of DNA with different nucleotide composition. The region that creates the area in the red box has a higher A+T%, and appears to be derived from host sequences: it contains virulence genes.

Another view of the Molluscum contagiosum virus genome self plot – a zoomed-in view of the red box shown in the previous figure. Three of the genes in the “pale stripe” appear to be paralogs, probably resulting from duplications of an ancestral gene acquired from the host [3].

A student pointed me to the Gepard dotplot program, which is more suited for large DNA sequences (Gepard: German, “cheetah”, Backronym for “GEnome PAir – Rapid Dotter”). The self-plot below, for an E. coli genome took only a couple of minutes to complete. Although it uses a different type of algorithm, the features are similar to Dotter. It is simple to zoom into regions and you can change the parameters for scoring on-the-fly (post-plot).

1. Sonnhammer EL, Durbin R: A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene 1995, 167:GC1-10.

2. Brodie R, Roper RL, Upton C: JDotter: a Java interface to multiple dotplots generated by dotter. Bioinformatics 2004, 20:279-281.

3. Da Silva M, Upton C: Host-derived pathogenicity islands in poxviruses. Virol. J 2005, 2:30.

4. Krumsiek J, Arnold R, Rattei T. Gepard: A rapid and sensitive tool for creating dotplots on genome scale. Bioinformatics 2007; 23(8): 1026-8. PMID: 17309896

Viral bioinformatics: Introduction + Homology

This week’s addition to the virology toolbox was written by Chris Upton

First, you may be asking yourself – Why viral bioinformatics? Good question! Although it’s true that much in the world of bioinformatics can be applied to all manner of protein and DNA sequences, there are a number of resources that are specific for viruses and there are a number of analyses that every virologist should be familiar with. So this section of the virology toolbox will highlight database resources, some useful tools and analyses, and some pitfalls you want to avoid.

What do I know about viral bioinformatics? Another good question! Well, all I can say is that although I can’t program my way out of a paper bag – I’ve been analyzing DNA and protein sequences of poxvirus for 20+ years and have been “developing” software for analysis of viral genomes for 10+ years. That is, I say what I want the software to do, and a bunch of talented programmers – including many undergraduate students – figure out how to code it. Over the years, this work has been funded by NSERC and PENCE in Canada, and NIH in the USA.

I’ll also be highlighting many of our own tools – why?

  • They were developed precisely for comparative genomics of viruses. That is, genomes from 10-500 kb.
  • They were developed for use by the bench virologist – so they’re fairly straightforward to use.
  • They’re platform independent – will run on Macs, Windoze and LINUX boxes.
  • I know them best and can give good examples of their use.
  • PS. The programs also work on any type of DNA and protein sequences.

My first topic is very simple, but equally important. So watch your language.

Homology = common origin

Phrases like “sequence (structural) homology”, “high homology”, “significant homology”, or even “35% homology” are as common, even in top scientific journals, as they are absurd, considering the above definition.

I took this quote from a book (which seems to be online): Sequence – Evolution – Function: Computational Approaches in Comparative Genomics by Eugene V Koonin and Michael Y Galperin: Kluwer Academic; 2003.

So genes/proteins/sequences are either homologous, or they’re not. No fractions or percentages here!

Try writing: 50% identical. But you also have to say whether you mean nucleotides or amino acids.

While we’re on this topic:


Homologous sequences are orthologous if they were separated by a speciation event: when a species diverges into two separate species, the divergent copies of a single gene in the resulting species are said to be orthologous. Orthologs, or orthologous genes, are genes in different species that are similar to each other because they originated from a common ancestor. The term “ortholog” was coined in 1970 by Walter Fitch.


Homologous sequences are paralogous if they were separated by a gene duplication event: if a gene in an organism is duplicated to occupy two different positions in the same genome, then the two copies are paralogous. A set of sequences that are paralogous are called paralogs of each other. Paralogs typically have the same or similar function, but sometimes do not: due to lack of the original selective pressure upon one copy of the duplicated gene, this copy is free to mutate and acquire new functions (from Wikipedia).

The figure illustrates homologs, orthologs and paralogs (click for the original link).