Human

Currently, there are two widely used releases GRCh38 (hg38) and GRCh37 (hg19).

GRCh38 (hg38)

Sequences

File name: GRCh38_no_alt_analysis_set_GCA_000001405.15.fa.gz (MD5 checksum: a08035b6a6e31780e96a34008ff21bd6)
Local path: /References/Sequences/human/hg38/GRCh38_no_alt_analysis_set_GCA_000001405.15.fa.gz
Remote backup: OSF
Description: This file contains sequences for the following:
1. chromosomes from the GRCh38 Primary Assembly (PA);
2. mitochondrial genome from the GRCh38 non-nuclear assembly;
3. unlocalized scaffolds from PA;
4. unplaced scaffolds from PA;
5. Epstein-Barr virus (EBV) sequence.

Recipe:

1	wget https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz;

Annotations

Gene annotations

There are three major releases of gene annotations for Homo sapiens:

GENCODE/Ensembl annotation: The GENCODE annotation is made from Ensembl annotation, so gene annotations are the same in both releases. The only exception is that the genes which are common to the human chromosome X and Y PAR regions can be found twice in the GENCODE GTF, while they are shown only for chromosome X in the Ensembl file. Gene / transcripts IDs are the same in both releases except for annotations is the PAR regions. Comparing to other annotations, GENCODE annotation provides higher coverage among non-coding regions.
RefSeq Gene (RefGene): Annotations for well-characterized genes (mostly protein-coding genes). Projects like Gene Ontology, KEGG and MSigDB (Molecular Signatures Database, gene sets for GSEA) use this annotation as gene identifiers. So RefGene maybe the preferred annotation if you want to do enrichment analysis with GO/KEGG/GSEA.
UCSC Known genes: Automatically generated annotations (based on protein sequences from Swiss-Prot), mostly for protein-coding genes.

GENCODE

File name: gencode.v24.annotation.gtf.gz (MD5 checksum: 17395005bb4471605db62042b992893e)
Local path: /References/Annotations/human/hg38/gencode.v24.annotation.gtf.gz
Remote backup: OSF
Description: GENCODE comprehensive annotation release 24. Downloaded from GENCODE’s website.

Recipe:

1	wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/gencode.v24.annotation.gtf.gz

File name: gencode.v24.segmented.tssup1kb.bed.gz (MD5 checksum: 972a57431c6209667d5aac41bbb01ebd)
Local path: /References/Annotations/human/hg38/gencode.v24.segmented.tss*up1kb.bed.gz*
Remote backup: OSF
Description: Genomic segmentations (promoter, 5_UTR, exon, intron, 3_UTR, and intergenic region) based on GENCODE v24, promoters were defined as upstream 1kb of TSSs (transcripts).

Recipe:

# promoters for protein-coding genes
zcat gencode.v24.annotation.gtf.gz | \
    awk 'BEGIN{OFS="\t"} $3=="transcript" {print $1,$4-1,$5,$18,"promoter",$7,$14}' | tr -d '";' | \
    awk 'BEGIN{OFS="\t";FS="\t"}{if ($7=="protein_coding"){print $1,$2,$3,$4,$5,$6,$2,$3,"102,194,165"}}' | \
    bedtools flank -i - -g hg38.genome -l 1000 -r 0 -s > promoters_1kb_p.bed
# promoters for non-protein-coding genes
zcat gencode.v24.annotation.gtf.gz | \
    awk 'BEGIN{OFS="\t"} $3=="transcript" {print $1,$4-1,$5,$18,"promoter(NP)",$7,$14}' | tr -d '";' | \
    awk 'BEGIN{OFS="\t";FS="\t"}{if ($7!="protein_coding"){print $1,$2,$3,$4,$5,$6,$2,$3,"102,194,165"}}' | \
    bedtools flank -i - -g hg38.genome -l 1000 -r 0 -s > promoters_1kb_np.bed

File name: gencode.v24.segmented.tssflanking500b.bed.gz
Local Path: /References/Anotations/human/hg38/gencode.v24.segmented.tss*flanking500b.bed.gz*
Remote backup: OSF
Description: Genomic segmentations based on GENCODE v24, promoters were defined as TSS $\pm$ 500bp (transcripts). (2e624c3bc2330beb81464558ead1a11e)

Recipe:

# promoters for protein-coding genes
zcat gencode.v24.annotation.gtf.gz | \
    awk 'BEGIN{OFS="\t"} $3=="transcript" {print $1,$4-1,$5,$18,"promoter",$7,$14}' | tr -d '";' | \
    awk 'BEGIN{OFS="\t";FS="\t"}{if ($7=="protein_coding"){print $1,$2,$3,$4,$5,$6,$2,$3,"102,194,165"}}' | \
    bedtools flank -i - -g hg38.genome -l 500 -r 0 -s | \
    bedtools slop -i - -g hg38.genome -l 0 -r 500 -s > promoters_500bp.bed
# promoters for non-protein-coding genes
zcat gencode.v24.annotation.gtf.gz | \
    awk 'BEGIN{OFS="\t"} $3=="transcript" {print $1,$4-1,$5,$18,"promoter(NP)",$7,$14}' | tr -d '";' | \
    awk 'BEGIN{OFS="\t";FS="\t"}{if ($7!="protein_coding"){print $1,$2,$3,$4,$5,$6,$2,$3,"102,194,165"}}' | \
    bedtools flank -i - -g hg38.genome -l 500 -r 0 -s \
    bedtools slop -i - -g hg38.genome -l 0 -r 500 -s > np_promoters_500bp.bed
# intergenic
zcat gencode.v24.annotation.gtf.gz | \
    awk 'BEGIN{OFS="\t"} $3=="gene" {print $1,$4-1,$5,$10,$16,$7}' | \
    tr -d '";' | \
    bedtools slop -i - -g hg38.genome -l 500 -r 0 -s | \
    sortBed -g ../hg38.genome | \
    bedtools complement -i stdin -g ../hg38.genome | \
    awk 'BEGIN{OFS="\t";FS="\t"}{print $1,$2,$3,".","intergenic",".",$2,$3,"141,160,203"}' > intergenic_500bp.bed
# exons
zcat gencode.v24.annotation.gtf.gz | \
    awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5,$18,"exon",$7}' | \
    tr -d '";' | \
    sortBed -g ../hg38.genome | \
    mergeBed -i - -c 4,5,6 -o distinct,distinct,distinct -s | \
    awk 'BEGIN{OFS="\t";FS="\t"}{print $1,$2,$3,$4,$5,$6,$2,$3,"231,138,195"}' > exons.bed
# introns
zcat gencode.v24.annotation.gtf.gz | \
    awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5,$16,"intron",$7}' | \
    tr -d '";' | \
    sortBed -g ../hg38.genome | \
    subtractBed -a stdin -b exons.bed | \
    awk 'BEGIN{OFS="\t";FS="\t"}{print $1,$2,$3,$4,$5,$6,$2,$3,"255,217,47"}' > introns.bed
# UTR, perl script from https://davetang.org/muse/2012/09/12/gencode/
get_35_utr.pl gencode.v24.annotation.gtf.gz | \
    awk 'BEGIN{OFS="\t";FS="\t"}{if ($5=="3_UTR"){print $1,$2,$3,$4,$5,$6,$2,$3,"166,216,84"}else{print $1,$2,$3,$4,$5,$6,$2,$3,"252,141,98"}}' > utr.bed

cat intergenic_500bp.bed promoters_500bp.bed np_promoters_500bp.bed utr.bed introns.bed exons.bed | sort -k1,1 -k2,2n | bgzip > gencode.v24.segmented.tssflanking500b.bed.gz

RefGene

File name: refseq.ver109.20190125.annotation.gtf.gz (MD5 checksum: 848813de5b516e0f328046ef9c931091)
Local path: /References/Annotations/human/hg38/refseq.ver109.20190125.annotation.gtf.gz
Remote backup: OSF
Description: RefSeq annotation in GTF format that has been remapped to use the same set of UCSC-style sequence identifiers used in the FASTA files. The annotation is NCBI Homo sapiens Updated Annotation Release 109.20190125 from 25 January 2019.

Recipe:

1
2

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz
mv GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz refseq.ver109.20190125.annotation.gtf.gz

Other annotations

Repeat Masker

File name: rmsk.bed.gz (MD5 checksum: ae12aefbef9d4f5bc7695158a67d9a55)
Local path: /References/Annotations/human/hg38/rmsk.bed.gz
Remote backup: OSF
Description: Repeat Masker from UCSC. The following fields were selected:
- genoName (Genomic sequence name)
- genoStart (Start in genomic sequence)
- genoEnd (End in genomic sequence)
- strand (Relative orientation + or -)
- repName (Name of repeat)
- repFamily (Family of repeat).

Recipe:

wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz
gunzip rmsk.txt.gz
gawk 'OFS="\t"{print $6,$7,$8,$11,$13,$10}' rmsk.txt | \
    sort -k1,1 -k2,2n | \
    bgzip > rmsk.bed.gz

Generic

Sequences

Primary assembly:
rRNA: Human ribosomal DNA complete repeating unit, GenBank accession code: U13369.1 .

Annotations

Motif databases (MEME)

File name: motif_databases.12.19.tgz (MD5 checksum: f5ffcaecc07570ee19dba20b82d7bd73)
Local path: /References/Annotations/human/generic/motif_databases.12.19.tgz
Remote backup: OSF
Description: Motif databases for MEME suite (updated 28 Oct 2019).

Recipe:

1	wget http://alternate.meme-suite.org/meme-software/Databases/motifs/motif_databases.12.19.tgz

Note

For all fasta files, 3 standard annotations will also be generated simultaneously:
- .fai: index which allows for fast and random access to any sequences in the indexed fasta file. This index is generated with the following command:
  1
  samtools faidx input.fa
- .genome: Table with two columns, specifying length of each chromosome.
  1
  cut -f1,2 input.fa.fai > size.genome
- .dict:
  1
  2
  3
  java -jar picard.jar CreateSequenceDictionary \
  R=input.fa \
  O=input.dict
There are two types of promoters in both gencode.v24.segmented.tssup1kb.bed and gencode.v24.segmented.tssflanking1kb.bed :
- Promoters for protein coding genes (denote as promoter in these files)
- Promoters for non-protein coding genes (denote as promoter(NP))

Collection of commonly used references in bioinformatic analysis

2020-04-22
posts

Collection of commonly used references in bioinformatic analysis

Human

GRCh38 (hg38)

Sequences

Annotations

Gene annotations

GENCODE

RefGene

Other annotations

Repeat Masker

Generic

Sequences

Annotations

Motif databases (MEME)

Note