If a human being had actually looked at his blood, anywhere along the way, instead of just running tests through the computer… parasites would have jumped right out at them.

“Failure to Communicate.” House M.D.

There is a case in the TV show House M.D., where a parasite infected the patient; Dr. House’s team runs many tests and gets no clue. In the end, Dr. House tells the team to look at the blood sample through a microscope instead of using numbers from instruments. The team does so and instantly sees the parasite. Even though the settings of the disease may not be very accurate, it leaves me with a profound impression. The same situation can also happen in bioinformatics, as we do many tests to identify targets of interest or evaluate our confidence in conclusions. However, we should be aware that these tests are all based on some assumptions, and it’s always beneficial to visually check the data to have an intuitive impression that the data fit the assumptions. Integrated Genome Browser (IGV) is a powerful tool for visualizing sequencing data. In this post, I’ll share some of my tricks for making IGV even more useful.

Case 1: There are a bunch of loci that need to be checked visually

If you need to visually check many regions of interest in IGV, you can use a batch script to automate the process. The IGV batch script language allows you to generate a script file that tells IGV which regions to display and where to save the output. Instead of learning the details about this minimal language yourself, you can directly use bedtools to create a batch script for a list of loci in a bed file. In the following example, genomic loci in loci.bed will first be extended 200 bp both upstream and downstream, then a batch script covering these regions will be saved to the file batch.script.

1
2
3
4
5
6
7
8
9
# snapshots will be saved to `path_to_store_snapshots`
# -slop indicates the number of flanking base pairs on
# both the left and right of the interested regions to be
# extended in the captured images (0 for keep them as they are)
# I recommend set the output img format to be svg or eps
# the resolution for png file is so low
bedtools igv -path path_to_store_snapshots \
-i loci.bed -slop 200 \
-img svg > batch.script

After loading the tracks you want to see in IGV, click Tools>Run Batch Script... and load the batch.script file, IGV will get start capturing snapshots of each locus. If the process is slow, you can split the bed file and generate batch scripts for each subset, then load them into separate instances of IGV.

Case 2: Frequently used annotations are not listed in the default server

The IGV team maintains a fabulous web server with some commonly used annotations (like gene annotations from the GENCODE project) or datasets (like ChIP-seq alignments from the ENCODE project); by simply selecting the annotations of interest from File>Load from Server..., you can load them to your current session. One small pitfall with this function is that the annotations or datasets are not always up-to-date; for some frequently used files (or customized files), you may want them listed there. In this case, you should consider setting up your data server for IGV.

Step 1: Copy precompiled data files from IGV

You can get a copy of all genome files that IGV is currently using from their GitHub repo:

1
2
3
4
5
git clone https://github.com/igvteam/igv.git
# igv team removed genome files in commit 218f873
# so we need to check out from one commit before the deletion,
# which refers to commit beb4f48
git checkout beb4f48e04

After checkout, you can copy the entire igv/genomes folder to a new place (assuming it’s /nas1/references) and set up your data server.

Step 2: Install and configure a web server

If you’ve already had a web server, then you can move to step 3. For Mac users, you can install Nginx with Homebrew:

1
brew install nginx

By default, the configuration file for Nginx (installed by Homebrew) is located at /usr/local/etc/nginx/nginx.conf. In the http section, add a new server configuration as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
server {
listen 80;
listen [::]:80;
# if you have a domain, replace `ref.yaobio.com` with your own domain
# if you don't have one, and only want to access the web server locally
# you can replace it with localhost
server_name ref.yaobio.com;

location / {
# replace this path with where you put the data files from IGV
root /nas1/references;
index index.html index.htm;
}
}

Reload the configurations to make changes effective:

1
nginx reload

Create a new directory (annotations) in /nas1/references. Now the structure of this folder is something like

  • references
    • db
      • 1kg_ref
      • hg19
      • hg38
      • mm10
    • sizes
      • 1kg_ref.chrom.sizes
      • hg19.chrom.sizes
      • hg38.chrom.sizes
      • mm10.chrom.sizes
    • annotations
    • genomes.tab
    • genomes.txt

Step 3: Save new annotations and modify data files

Let’s assume you have a new annotation file processed (e.g., processed GENCODE v35 for hg38 with the pipeline we mentioned in the previous post); now, you can move the file to /nas1/references/annotations. Then you need to modify the default genome and data registry:

  1. Change the content of db/hg38/hg38_dataServerRegistry.txt from https://s3.amazonaws.com/igv.org.genomes/hg38/hg38_annotations.xml to https://ref.yaobio.com/db/hg38/hg38_annotations.xml

  2. Add the following item to db/hg38/hg38_annotations.xml:

    1
    <Resource name="Gencode V35" path="http://ref.yaobio.com/annotations/gencode.v35.annotation.sorted.gtf.gz" index="http://ref.yaobio.com/annotations/gencode.v35.annotation.sorted.gtf.gz.tbi" hyperlink="http://www.gencodegenes.org/"/>

Step 4: Change the data server setting in IGV

Now in IGV, click View>Preferences>Advanced, and replace the previous value in Data registry url with http://ref.yaobio.com/db/$$/$$_dataServerRegistry.txt. Finally, save the changes, and restart IGV; you should be able to see and load newly added annotations into IGV.

General tips

Always load bed files with indices

Always create indices for bed files before loading them into IGV. Otherwise, IGV will read every interval into memory and generate indexes on the fly, consuming excessive memory and time. You can use tabix to generate an index for interval files before loading them into IGV. This practice can greatly reduce memory usage and computation time. Let’s say we have an interval bed file test_file_1.bed.gz, it has 18M records; after loading this file into IGV without index, IGV takes more than 25GB of memory!

But if you use tabix test_file_1.bed.gz to generate the index first, and then feed IGV with the same file, it only takes 2GB!

Conclusion

IGV is a valuable tool for visualizing sequencing data, and these tips and tricks can help you make the most of its capabilities. By using batch scripting, setting up a personal data server, and optimizing bed file loading with indices, you can streamline your bioinformatics workflows and gain more insights from your data.