VISTA Exercises
An electronic version of these instructions can be found at http://hazelton.lbl.gov/vista/exercises.shtml.
These exercises will familiarize you with many features of the VISTA tools for Comparative Genomics. You will learn how to submit your own genomic sequences for comparative analysis (mVISTA, rVISTA, gVISTA) and investigate comparative data for your own sequences and pre-computed whole-genome alignments (VISTA Browser, VISTA Point, Whole-Genome rVISTA).
All exercises start at the VISTA portal: http://genome.lbl.gov/vista/
Section 1.
Examples of detailed comparative analysis.
I. VISTA Browser and VISTA Point
Vista Point and VISTA Browser allow users to interactively visualize a variety of whole genome alignments and quickly identify highly conserved regions.
Exercise 1.
Find the minimum percent conservation identity for which all of the exons on the LDL Receptor gene are conserved between Human (Feb. 2009) and Mouse (July 2007) assemblies. Retrieve the coordinates of the conserved regions.
Hint: The RefSeq name of the LDL Receptor gene is LDLR. Click on the icon above the curve to change parameters.
Answer: Exons are conserved at 57% minimum conservation identity.
Step-by-Step
- Go to http://genome.lbl.gov/vista/. Click on the "VISTA Point" link located in the middle of the page. Make sure "Human Feb. 2009" is selected in the base genome box, and enter "LDLR" in the position box (note that you can only enter a RefSeq gene name or a chromosome coordinate in the position box). Inspect the list to find the LDL Receptor Gene: in this case it is the first match. Click on it to load the human/mouse comparison.
- Identify the strand on which LDLR is transcribed, the coding exons and UTRs (they are marked on the annotation track above the curve, and colored according to the color legend in the lower left-hand corner). Are all the exons and UTRs conserved? No, 2 coding exons and 1 UTR are not conserved.
- Try adjusting the parameters of the human/mouse comparison to get all the exons and UTRs marked as conserved. Select the button above the curve to adjust parameters.. A description of the parameters is available from the "Help" pages at "Changing Curve Parameters". In this case, you will want to try lowering the "Cons Identity". Experiment with parameter values until you get all the exons to be marked as conserved. Lowering the conservation to 57% identifies all coding exons/UTR as conserved in the human/mouse comparison.
Exercise 2.
Identify the human coordinates of the non-coding regions in the HOXA3 gene that are conserved between Human (March 2006 assembly), Mouse and Chicken. Find the coordinates of the chicken genomic interval that aligns to human HOXA3.
Hint: Use VISTA Browser, VISTA-Point and the longest isoform.
Step-by-Step
- In the "position" box of the VISTA Browser entry page, enter "HOXA3" and click "Go." Three matches will come up – double-click the last match (the other two matches are alternatively spliced forms of the gene, which cover only a part of the region we want). Identify the strand of the HOXA3 gene, its exons and UTRs. Can you tell from the display how long the interval is? (Answer: 21 Kb)
- Remove VISTA curves corresponding to Human-Rhesus, Human-Dog, Human-Horse, and Human-Rat comparisons. To do this, click on the curve you want to remove and select the "Remove VISTA Curve" button from the top menu. Now only the Human-Mouse and Human-Chicken comparisons are displayed. Since we are looking at a relatively short interval, we can change the visualization to display each curve on only one row using the "# rows" from the left control panel.
- Identify regions that are highly conserved in all three species (human, mouse and chicken). You will notice that some of the highly conserved sequences are non-coding (pink-colored). Those areas might seem like good candidates for further analysis.
- Click on the button ("alignment details") in the toolbar at the top of the screen. A new browser window, called "VISTA-Point", will open with detailed information regarding the segment of the alignment you were looking at.
- In this window, you can see a lot of data on the aligned regions, including their genomic coordinates. A detailed description of all the options available from VISTA-Point can be found in the Help pages http://pipeline.lbl.gov/vista_help/help.html#vistapoint
- To retrieve the coordinates of regions conserved between human and chicken, click on the "Get CNS" link at the top of the table and select the Human-Chicken pair. The legend for this table is in the top line. The coordinates of conserved non-coding sequences are those marked as "intergenic" or "intron". Note that clicking on the links on this page will give you the sequences of the conserved regions, with retrieval options that facilitate the design of PCR primers for further studying these sequences.
- You should still have the VISTA-Point window open for the Human-Chicken HOXA3 alignment (if not, bring up the alignment again in the browser and click the alignment details icon button).
- To find the coordinates of the chicken genomic interval that aligns to human HOXA3, look for the "Location on chicken" column. In this case, 2 major chicken contigs align to the human interval.
II. rVISTA
rVista is a tool that predicts transcription binding sites by combining a search of the Transfac database (Full Edition) with comparative sequence analysis. We will now perform rVISTA analysis on the HOXA3 alignment to find predicted transcription binding sites.
Note that this is not the only way to use rVISTA – in addition to using it through this page, you can submit to rVISTA directly by going to the main VISTA site and submitting an existing alignment, or you can align two sequences with the main Vista program (mVISTA) and automatically submit to rVISTA from there.
Please remember that rVISTA has a 20Kb limit on the length of aligned sequences. If the sequence you want to analyze is larger than 20Kb, zoom in on the sequence until you have an interval smaller than 20Kb.
III. Whole Genome rVISTA
Whole Genome rVISTA is designed to aid the analysis of gene expression studies by scanning the regulatory regions of genes exhibiting similar expression patterns. In the current implementation, a gene's regulatory region is defined as the sequences upstream of the transcription start site, up to 5kb.
Exercise 3.
Find the 5 transcription factors which are most overrepresented in the 5 kb upstream of the transcription start site of the following mouse genes: Runx1, Tpm2, Mthfr, Tpbg, Armcx2, Lox, Pdk3, Bcat1, Cdc25c, Dpysl3, Gatm, Tnc, Gpx7, Tfpi2, Adam12, Tubb3
Step-by-Step
- On the main VISTA page (http://genome.lbl.gov/vista/), click on "Whole Genome rVISTA".
- Click on "GO" TFBS in Mouse February 2006 (mm8) assembly conserved in the alignment with the Human March 2006 (hg18) assembly.
- In the pop-up menu determining the size of the upstream region scanned by Whole Genome rVISTA, select "5000" (this is the default value).
- Input the gene names in the rectangular input field. Note that the names provided in this exercise are RefSeq names.
- Select "I am submitting gene names" and click on "Submit ".
- The top table in the "Results" page shows all transcription factors that are overrepresented in the 5000 bp upstream of all the 16 genes submitted for this exercise at a p-value cutoff of 0.005. Note that not all the 16 genes need to have binding sites for a given transcription factor. The bottom table shows the transcription factors overrepresented upstream of each gene individually.
Exercise 4.
Which of the above genes are regulated by both of the top two overrepresented transcription factors?
Hint: Use "TFBS in Mouse Fabruary 2006 (mm8) assembly conserved in the alignment with the Human March 2006 (hg18) assembly".
Answer: Runx1, Lox, Tpm2, Tpbg, Armcx2.
Step-by-Step
- Click on show genes where TITF1 is present and show genes where HIF1 is present to obtain the list of genes where these two transcription factors are overrepresented. Compare the two lists to find the genes present in both lists.
Section 2.
Obtaining pair-wise and multiple alignments of your sequences.
Introduction to the comparative analysis servers.
I. mVISTA tool - Align and compare your sequences from multiple species
Click on the mVISTA link at http://genome.lbl.gov/vista/.
We suggest using the following examples to explore mVISTA.
- Pairwise alignment:
- NC_007499 / NC_014056 chloroplasts
- DQ126339 / AY150271 phages
- Multiple alignment:
- leptin (Lep), mRNA:
NM_000230 (Homo sapiens) / NM_173928 (Bos taurus) / NM_008493 (Mus musculus) - reproductive and respiratory syndrome viruses:
EF536003 / AF066183 / EF484033 / EF484031 / AY366525
- leptin (Lep), mRNA:
Submission to mVISTA: Sequence data fields.
To obtain a multiple sequence alignment for your sequences, submit your sequences to the mVISTA server. Sequence can be uploaded in FASTA format from a local computer using the "Browse" button or, if available in GenBank, they can be retrieved by inputting the corresponding GenBank accession number in the "GENBANK identifier" field.
Submission to mVISTA: Choice of alignment program.
Three genomic alignments programs are available in mVISTA. "LAGAN" is the only program that produces real multiple alignments of finished sequences. Note that if some of the sequences are not ordered and oriented in a single sequence, your query will be redirected to AVID to obtain all-against-all pairwise alignment. "AVID" and "Shuffle-LAGAN" produce only all-against-all pair-wise alignments.
Submission to mVISTA: Additional options
- "Name": Select the names for your species that will be shown in the legend. It is advisable to use something meaningful, such as the name of an organism, the number of your experiment, or your database identifier. When using a GenBank identifier to input your sequence, it will be used by default as the name of the sequence.
- "Annotation": If a gene annotation of the sequence is available, you can submit it in a simple plain text format to be displayed on the plot.
- "RepeatMasker": Masking a base sequence will result in better alignment results. You can submit either masked or unmasked sequences. If you submit a masked sequence and the repetitive elements are replaced by letters "N", select "one-celled/do not mask" option in the pull-down menu. mVISTA also accepts softmasked sequences, where repetitive elements are shown as low-case letters while the rest of the sequence is shown in capital letters. In this case, you need to select "softmasked" option in the menu. If your sequences are unmasked, mVISTA will mask repeats with RepeatMasker. If you do not want your sequence to be masked, select "one-celled/do not mask".
- Leave the "Find potential transcription factor binding sites using rVISTA" and "Use translated anchoring in LAGAN/Shuffle-LAGAN" options unchecked.
- If you know the phylogenetic tree relating the species you are submitting, enter it at "Pairwise phylogenetic tree for the sequences", otherwise LAGAN will calculate the tree automatically.
Click "Submit" to send the data to the mVISTA server. If the mVISTA server finds problems with the submitted files, you will receive a message stating the type of problem; if not, you will receive a message saying that submission was successful. You will receive email from vista@lbl.gov indicating your personal Web link to the location where you can access the results of your analysis.
mVISTA: retrieval of the results.
Clicking on the link found in the body of the email takes to the results page. It lists every organism you submitted, and provides you with three viewing options using each organism as base. These three options are:
- "VISTA-Point ", which provides all the detailed information - sequences, alignments, conserved sequence statistics and other results for sequence comparisons. This is where you retrieve the coordinates of conserved regions.
- "Vista Browser", an interactive visualization tool that can be used to dynamically browse the resulting alignments and view a graphical presentation of conservation;
- "PDF", a static VISTA plot of all pair-wise alignments, and is not relevant to multiple comparisons.
It is important to note that while mVISTA shows the results of all pair-wise comparisons between one species chosen as the base (reference) sequence and all other submitted sequences.
mVISTA: results in VISTA Point
The VISTA Point brings you to the results of your analysis in text format.
You can read a detailed description of this tool: http://pipeline.lbl.gov/vista_help/help.html#vistapoint
mVISTA: results in VISTA Browser
Vista Browser is an interactive Java applet designed to visualize pairwise and multiple alignments using the mVISTA (default) and RankVISTA scoring schemes, and to identify regions of high conservation across multiple species. Clicking on the "Vista Browser" link will launch the applet with the corresponding organism selected as base. The VISTA Browser main page displays all pairwise comparisons between the base sequence and all other submitted sequences using the mVISTA scoring scheme, which measures conservation based on the number of identical nucleotides (% conservation) in a 100bp window. Multiple pair-wise alignments sharing the same base sequence can be displayed simultaneously, one under another. The plots are numbered, so that you can identify each plot in the list underneath the VISTA panel. The many additional features of the browser are described in detail in the online help pages, accessed by clicking on the "Help" button.
By clicking on the "Alignment details" button, you can quickly shift to the VISTA Point and retrieve the coordinates of conserved sequences. To print the plot, click the "Print" button. The first time you do this, you will get a dialog box to confirm that you indeed requested that something be sent to the printer. This is a security measure in Java intended to handicap malicious code. Click "yes". A standard printing dialog box will appear. Proceed as you would with any other printing job.
To save the plot, click the "save as" button. In the menu that will appear, select the file type you want, adjust parameters such as image width if desired, and press "ok". If you have pop-up blocking software such as the Google toolbar or a later version of IE browser, you may need to hold down the CTRL key while clicking the OK button
II. wgVISTA (Whole-Genome VISTA)
wgVISTA is a set of programs for comparing DNA sequences from two species up to 10 megabases long and visualize these alignments with annotation information. Although a different algorithm supports this tool, the submission page is very similar to mVISTA.
Click on the wgVISTA link at http://genome.lbl.gov/vista/.
We suggest using the following examples to explore wgVISTA.
- 2 strains of Bifidobacterium: NC_012814 / CP001606 (Browse the precomputed alignment)
- 2 strains of Bartonella: NC_008783 / NC_005955 (Browse the precomputed alignment)
III. gVISTA - compare your sequences against whole-genome assemblies)
Exercise.
Align this mouse sequence (NM_011037, Mus musculus Pax2 mRNA) with the human assembly of May 2006. With which area of the human genome does this sequence align and with how many exons does this sequence align? Are there conserved non-coding regions?
Step-by-Step
- Click gVISTA link at http://genome.lbl.gov/vista/.
- On the "gVISTA Submit" page, type in the GenBank identifier "NM_011037" in the "GenBank Identifier" text box and choose the base genome "Human Mar. 2006". Click "Submit".
- Click on the link given in the next window. This will load the results when they are ready.
- There will be several results, find the longest one (4947bp) and click the "VISTA Browser" link.
Note: the Java applet may take a little while to load and be in a new window. You will see that the sequence aligns with the PAX2 gene region of the human genome. You will note that there are large regions on non-coding sequence that are highly conserved (colored areas, such as the red intron and light blue UTR regions). - Your view will probably not be the entire PAX2 gene. Click the Zoom Out button (magnifying glass with "-") until you see the entire gene region. You will notice this gene has 11 exons.
- Take a closer look at the 3'UTR region by clicking to the left of the last exon end, holding down (or right-clicking) and dragging the mouse over the 3'UTR region and releasing. Compare conserved regions across species (colored peaks). Are their sequences conserved across mammals and not chicken? Note the default view should have several mammalian species and chicken. If not, add in the "organisms" pull down menu to the left). Most of the 3'UTR is conserved across mammals (except the Horse), but not chicken.
Section 3.
Additional exercises on detailed comparative analysis
Exercise 1.
How many putative MEF2 TF binding-sites does rVISTA identify in the region between the first and second exon of TNNC1 (skeletal muscle troponin gene)? How many are aligned / conserved with mouse?
Answer: rVISTA has identified: 13 MEF2 potential binding sites (blue bars above alignment, MEF2_all). 8 aligned MEF2 (red, MEF2_aligned), conserved MEF2 (green, MEF2_conserved).
Step-by-Step
- Click on the VISTA Browser link at http://genome.lbl.gov/vista.
- Make these choices: Clade=vetebrate; genome=human, release=March 2006. This will be the "base" genome that other species' genomes will be compared to.
- Type in the gene symbol "tnnc1" into the position field. Ensure that the radio button for the VISTA Browser (requires Java2) is selected. ClIck the "Submit" button to submit the query.
- When the page loads, check the area below the image frame on the right to see the species list section with the up/down buttons. We will want to change this to look at only mouse. Delete all the currently loaded species (should be 6) except for Mouse Jul.2007. You do this by selecting the species name in the lower right panel, and then clicking the "-" button at the top left of the window. You will now see the entire region expands into three lines.
- Select the region between the first and second exon of tnnc1. Exons can be identified as the blue boxes on the annotation line. We have to select the area between the bottom right and the next exon. Click near to the end of the first exon (bottom row) and drag the cursor to the beginning of the second exon (second row). Release the mouse button and wait until the browser reloads. (alternatively, you could have simply typed the location: chr3:52,461,570-52,463,047 into the position box on the left)
- Note: Leave this window open, you will need it for exercise 2.
- On the newly loaded page, click on the human/mouse alignment diagram to select it, and click the "i" button in the icon bar, or choose View > Details from the menu near the top of the window. This will take you to a new window and VISTA-Point (which might load in a previously opened browser window)
- You are now in VISTA-Point that has 3 sections, a navigation section, a graphic table and an alignment table. You might notice that the graphic table includes species in addition to mouse. The section we will be viewing now is the alignment table at the bottom of the window. Here you can obtain sequence and view tools.
- Take a quick look at the panel that is labeled "Location on Mouse Jul. 2007 July 2007". Click the "Sequence (softmasked)" link. You will find here the aligned mouse sequence in FASTA format. Now click the browser back button to return to VISTA-Point.
- Start the rVISTA analysis, by choosing "Human Mar. 2006- Mouse Jul. 2007" in the rVISTA pulldown menu in the tools section at the bottom alignments table.
- Type in your email address and click the "Submit Query" button.
- On the subsequent page, select the TRANSFAC matrices radio button, and choose the vertebrates checkbox. Click "Submit".
- Select the MEF2 factor matrix checkbox. You may have to scroll a bit to find it. Click "Submit" at the bottom of the form, not in the left navigation area.
- You will receive an email containing the rVISTA results, this email contains a link which will open the results page of your analysis. Usually the mail arrives in a few minutes, but that may vary. Click that link to load your results.
- Select MEF2 in the result page. Click "Submit".
- Choose the "length of the sequence in one row": 0.1 kbp
In the "Binding sites to visualize" column click on the "conserved", "aligned" and "all" checkboxes. Click "Submit".
A new page will load, with bars to identify possible MEF2 binding sites as tick marks along the top, and the human/mouse alignment diagrammed below. Count the number of potential transcription factor sites identified.
Exercise 2
Determine if there is variation in a highly conserved (>90%) region in the intron studied in exercise two using the VISTA tracks in UCSC Genome Browser.
Step-by-Step
- Complete up to and including step 6 of Exercise 1.
- In the VISTA Browser window, click the "BROWSERS" button in the top navigation bar.
- This will open up the VISTA annotation tracks within the UCSC Genome browser (this may open behind open browser windows). This allows you to view this data in a broader genomic context. Here we will want to look at the VISTA track for Mouse compared to Human, the gene annotation and SNPs.
- To simplify the view, scroll to under the graphic display and click the "hide all" button. This will hide all other UCSC annotation tracks.
- In the VISTA Tracks control section of the browser, perform the following actions:
- Change the "mode" menus for all species except mouse to "hide". These menus are below the graphical view and there is one for each "annotation" track.
- Change the Mouse (track 80) "conservation" menu to "90" and then click any refresh button
- The resulting view will have the VISTA Mouse track with the regions conserved with the human genome greater than 90% highlighted in red. To view this with gene annotation and SNPs, perform the following actions: In the "Genes and Gene Prediction Tracks" section, change the "UCSC Genes" menu to "full".
- In the Variation and Repeats section, change the "SNPs (130)" menu to "pack". Click "refresh".
- Viewing the resulting image, you will see there is one SNP in this conserved region (rs72965257) and a total of 6 in the region viewed (depending on how much you selected in earlier exercise steps). Click on the image for that SNP to be taken to the details about it and for further analysis. You can view this VISTA data in context with many other annotation tracks at the UCSC Genome Browser.
