If a human being had actually looked at his blood, anywhere along the way, instead of just running tests through the computer... parasites would have jumped right out at them.

"Failure to Communicate." House M.D.

There is a case in the TV show House M.D., where a parasite infected the patient; Dr. House's team runs many tests and gets no clue. In the end, Dr. House tells the team to look at the blood sample through a microscope instead of numbers from instruments. The team does so and instantly sees the parasite. Even though the settings of the disease may not be very accurate, it leaves me a profound impression. The same situation can happen in bioinformatics as well, as we do many tests to identify targets of interest or evaluate our confidence in conclusions. However, we should be aware that these tests are all based on some assumptions, and it's always beneficial to visually check the data to have an intuitive impression that the data fit the assumptions. Integrated Genome Browser (IGV) is a prevalent choice for visualizing the data for sequencing data. In this post, I'll share some of my tricks on making IGV even more powerful.

Case 1: There are a bunch of loci that needed to be checked visually

One common case is that you have a list of loci that you want to take a look at, but checking them one by one in IGV can be very inconvenient. The authors of IGV implemented a minimal language (called batch script), which allows you to generate a script file instructing IGV all the regions (goto) that you want to capture (snapshot) and save the output to a specific destination (snapshotDirectory). If you have bedtools installed on your machine, then with a list of loci stored in a bed file (loci.bed), you can use the following command to generate a batch script file and load it in IGV:

1
2
3
4
5
6
7
8
# -slop indicates the number of flanking base pairs on 
# both the left and right of the interested regions to be
# extended in the captured images (0 for keep them as they are)
# I recommand set the output img format to be svg or eps
# the resolution for png file is so low
bedtools igv -path path_to_store_snapshots \
-i loci.bed -slop 200 \
-img svg > batch.script

The output will be redirected to batch.script. Then in the IGV application, after loading all the tracks that you want to see, click Tools>Run Batch Script..., snapshots will be saved to path_to_store_snapshots.

Tip: IGV can be quite slow when capturing these snapshots; to boost the process, you can split the loci.bed into multiple files, generate batch scripts for each of them, and load batch files in different IGV instances.

Case 2: Frequently used annotations are not listed in the default server

The IGV team maintains a fabulous web server with some commonly used annotations (like gene annotation from the GENCODE project) or datasets (like ChIP-seq alignments from the ENCODE project); by minor selecting the annotation from File>Load from Server..., you can load them to your current session. One small pitfall with this function is that the annotations or datasets are not always up-to-date; for some frequently used files (or customized files), you may want them to be listed there as well. In this case, you should consider setting up your data server for IGV.

Step 1: Copy precompiled data files from IGV

You can get a copy of all genome files that IGV is currently using from their GitHub repo

1
git clone https://github.com/igvteam/igv.git

Step 2: Install and configure a web server

If you've already had a web server, then you can move to step 3. For Mac users, you can install Nginx with Homebrew:

1
brew install nginx

By default, the configuration file for Nginx (installed by Homebrew) is located at /usr/local/etc/nginx/nginx.conf. In the http section, add a new server configuration as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
server {
listen 80;
listen [::]:80;
# if you have a domain, replace `ref.yaobio.com` with your own domain
# if you don't have one, and only want to access the web server locally
# you can replace it with localhost
server_name ref.yaobio.com;

location / {
# replace this path with where you put the data files from IGV
root /nas1/references;
index index.html index.htm;
}
}

Reload the configurations to make changes happen:

1
nginx reload

Create a new directory annotations in /nas1/references, Now the structure of this folder is something like

  • references
    • db
      • 1kg_ref
      • ...
      • hg19
      • hg38
      • mm10
      • ...
    • sizes
      • 1kg_ref.chrom.sizes
      • ...
      • hg19.chrom.sizes
      • hg38.chrom.sizes
      • mm10.chrom.sizes
      • ...
    • annotations
    • genomes.tab
    • genomes.txt

Step 3: Save new annotations and modify data files

Let's assume you have a new annotation file processed (e.g., processed GENCODE v35 for hg38 with the pipeline we mentioned in the previous post), now you can move the file to /nas1/references/annotations. Then you need to modify the default genome and data registry:

  1. Change the content of db/hg38/hg38_dataServerRegistry.txt from https://s3.amazonaws.com/igv.org.genomes/hg38/hg38_annotations.xml to https://ref.yaobio.com/db/hg38/hg38_annotations.xml

  2. Add the following item to db/hg38/hg38_annotations.xml:

    1
    <Resource name="Gencode V35" path="http://ref.yaobio.com/annotations/gencode.v35.annotation.sorted.gtf.gz" index="http://ref.yaobio.com/annotations/gencode.v35.annotation.sorted.gtf.gz.tbi" hyperlink="http://www.gencodegenes.org/"/>

Step 4: Change the data server setting in IGV

Now in IGV, click View>Preferences>Advanced, and replace the previous value in Data registry url with http://ref.yaobio.com/db/$$/$$_dataServerRegistry.txt. Finally, save the changes, and restart IGV, you should be able to see and load newly added annotations into IGV.

General tips

Always load bed files with indecies

If you don't prepare an index of a bed file for IGV, then IGV will try to read every interval into the memory and then generate indexes for them, which means outstanding consumptions of memories and a very long time for computations. So a good practice for visualizing genomic intervals in IGV is using tabix to generate an index for these interval files first, then load them into IGV. Let's say we have an interval bed file test_file_1.bed.gz, it has 18M records; after loading this file into IGV without index, IGV takes more than 25GB of memory!

But if you use tabix test_file_1.bed.gz to generate the index first, and then feed IGV with the same file, it only takes 2GB!