If a human being had actually looked at his blood, anywhere along the way, instead of just running tests through the computer... parasites would have jumped right out at them.
"Failure to Communicate." House M.D.
There is a case in the TV show House M.D., where a parasite infected the patient; Dr. House's team runs many tests and gets no clue. In the end, Dr. House tells the team to look at the blood sample through a microscope instead of numbers from instruments. The team does so and instantly sees the parasite. Even though the settings of the disease may not be very accurate, it leaves me a profound impression. The same situation can happen in bioinformatics as well, as we do many tests to identify targets of interest or evaluate our confidence in conclusions. However, we should be aware that these tests are all based on some assumptions, and it's always beneficial to visually check the data to have an intuitive impression that the data fit the assumptions. Integrated Genome Browser (
IGV) is a prevalent choice for visualizing the data for sequencing data. In this post, I'll share some of my tricks on making IGV even more powerful.
Case 1: There are a bunch of loci that needed to be checked visually
One common case is that you have a list of loci that you want to take a look at, but checking them one by one in IGV can be very inconvenient. The authors of IGV implemented a minimal language (called batch script), which allows you to generate a script file instructing IGV all the regions (
goto) that you want to capture (
snapshot) and save the output to a specific destination (
snapshotDirectory). If you have bedtools installed on your machine, then with a list of loci stored in a bed file (
loci.bed), you can use the following command to generate a batch script file and load it in IGV:
-slop indicates the number of flanking base pairs on
The output will be redirected to
batch.script. Then in the IGV application, after loading all the tracks that you want to see, click
Run Batch Script..., snapshots will be saved to
Tip: IGV can be quite slow when capturing these snapshots; to boost the process, you can split the
loci.bed into multiple files, generate batch scripts for each of them, and load batch files in different IGV instances.
Case 2: Frequently used annotations are not listed in the default server
The IGV team maintains a fabulous web server with some commonly used annotations (like gene annotation from the GENCODE project) or datasets (like ChIP-seq alignments from the ENCODE project); by minor selecting the annotation from
Load from Server..., you can load them to your current session. One small pitfall with this function is that the annotations or datasets are not always up-to-date; for some frequently used files (or customized files), you may want them to be listed there as well. In this case, you should consider setting up your data server for IGV.
Step 1: Copy precompiled data files from IGV
You can get a copy of all genome files that IGV is currently using from their GitHub repo
git clone https://github.com/igvteam/igv.git
Step 2: Install and configure a web server
If you've already had a web server, then you can move to step 3. For Mac users, you can install Nginx with Homebrew:
brew install nginx
By default, the configuration file for Nginx (installed by Homebrew) is located at
/usr/local/etc/nginx/nginx.conf. In the
http section, add a new server configuration as follows:
Reload the configurations to make changes happen:
Create a new directory
/nas1/references, Now the structure of this folder is something like
Step 3: Save new annotations and modify data files
Let's assume you have a new annotation file processed (e.g., processed GENCODE v35 for hg38 with the pipeline we mentioned in the previous post), now you can move the file to
/nas1/references/annotations. Then you need to modify the default genome and data registry:
Change the content of
Add the following item to
<Resource name="Gencode V35" path="http://ref.yaobio.com/annotations/gencode.v35.annotation.sorted.gtf.gz" index="http://ref.yaobio.com/annotations/gencode.v35.annotation.sorted.gtf.gz.tbi" hyperlink="http://www.gencodegenes.org/"/>
Step 4: Change the data server setting in IGV
Now in IGV, click
Advanced, and replace the previous value in
Data registry url with
http://ref.yaobio.com/db/$$/$$_dataServerRegistry.txt. Finally, save the changes, and restart IGV, you should be able to see and load newly added annotations into IGV.
Always load bed files with indecies
If you don't prepare an index of a bed file for IGV, then IGV will try to read every interval into the memory and then generate indexes for them, which means outstanding consumptions of memories and a very long time for computations. So a good practice for visualizing genomic intervals in IGV is using
tabix to generate an index for these interval files first, then load them into IGV. Let's say we have an interval bed file
test_file_1.bed.gz, it has 18M records; after loading this file into IGV without index, IGV takes more than 25GB of memory!
But if you use
tabix test_file_1.bed.gz to generate the index first, and then feed IGV with the same file, it only takes 2GB!