Human

Currently, there are two widely used releases GRCh38 (hg38) and GRCh37 (hg19).

GRCh38 (hg38)

Sequences

  • File name: GRCh38_no_alt_analysis_set_GCA_000001405.15.fa.gz (MD5 checksum: a08035b6a6e31780e96a34008ff21bd6)
  • Local path: /References/Sequences/human/hg38/GRCh38_no_alt_analysis_set_GCA_000001405.15.fa.gz
  • Remote backup: OSF
  • Description: This file contains sequences for the following:
    1. chromosomes from the GRCh38 Primary Assembly (PA);
    2. mitochondrial genome from the GRCh38 non-nuclear assembly;
    3. unlocalized scaffolds from PA;
    4. unplaced scaffolds from PA;
    5. Epstein-Barr virus (EBV) sequence.
  • Recipe:
    1
    wget https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz;

Annotations

Gene annotations

There are three major releases of gene annotations for Homo sapiens:

  • GENCODE/Ensembl annotation: The GENCODE annotation is made from Ensembl annotation, so gene annotations are the same in both releases. The only exception is that the genes which are common to the human chromosome X and Y PAR regions can be found twice in the GENCODE GTF, while they are shown only for chromosome X in the Ensembl file. Gene / transcripts IDs are the same in both releases except for annotations is the PAR regions. Comparing to other annotations, GENCODE annotation provides higher coverage among non-coding regions.
  • RefSeq Gene (RefGene): Annotations for well-characterized genes (mostly protein-coding genes). Projects like Gene Ontology, KEGG and MSigDB (Molecular Signatures Database, gene sets for GSEA) use this annotation as gene identifiers. So RefGene maybe the preferred annotation if you want to do enrichment analysis with GO/KEGG/GSEA.
  • UCSC Known genes: Automatically generated annotations (based on protein sequences from Swiss-Prot), mostly for protein-coding genes.
GENCODE
  • File name: gencode.v24.annotation.gtf.gz (MD5 checksum: 17395005bb4471605db62042b992893e)

  • Local path: /References/Annotations/human/hg38/gencode.v24.annotation.gtf.gz

  • Remote backup: OSF

  • Description: GENCODE comprehensive annotation release 24. Downloaded from GENCODE’s website.

  • Recipe:

    1
    wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/gencode.v24.annotation.gtf.gz
  • File name: gencode.v24.segmented.tssup1kb.bed.gz (MD5 checksum: 972a57431c6209667d5aac41bbb01ebd)

  • Local path: /References/Annotations/human/hg38/gencode.v24.segmented.tss*up1kb.bed.gz*

  • Remote backup: OSF

  • Description: Genomic segmentations (promoter, 5_UTR, exon, intron, 3_UTR, and intergenic region) based on GENCODE v24, promoters were defined as upstream 1kb of TSSs (transcripts).

  • Recipe:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    # promoters for protein-coding genes
    zcat gencode.v24.annotation.gtf.gz | \
    awk 'BEGIN{OFS="\t"} $3=="transcript" {print $1,$4-1,$5,$18,"promoter",$7,$14}' | tr -d '";' | \
    awk 'BEGIN{OFS="\t";FS="\t"}{if ($7=="protein_coding"){print $1,$2,$3,$4,$5,$6,$2,$3,"102,194,165"}}' | \
    bedtools flank -i - -g hg38.genome -l 1000 -r 0 -s > promoters_1kb_p.bed
    # promoters for non-protein-coding genes
    zcat gencode.v24.annotation.gtf.gz | \
    awk 'BEGIN{OFS="\t"} $3=="transcript" {print $1,$4-1,$5,$18,"promoter(NP)",$7,$14}' | tr -d '";' | \
    awk 'BEGIN{OFS="\t";FS="\t"}{if ($7!="protein_coding"){print $1,$2,$3,$4,$5,$6,$2,$3,"102,194,165"}}' | \
    bedtools flank -i - -g hg38.genome -l 1000 -r 0 -s > promoters_1kb_np.bed
  • File name: gencode.v24.segmented.tssflanking500b.bed.gz

  • Local Path: /References/Anotations/human/hg38/gencode.v24.segmented.tss*flanking500b.bed.gz*

  • Remote backup: OSF

  • Description: Genomic segmentations based on GENCODE v24, promoters were defined as TSS $\pm$ 500bp (transcripts). (2e624c3bc2330beb81464558ead1a11e)

  • Recipe:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    # promoters for protein-coding genes
    zcat gencode.v24.annotation.gtf.gz | \
    awk 'BEGIN{OFS="\t"} $3=="transcript" {print $1,$4-1,$5,$18,"promoter",$7,$14}' | tr -d '";' | \
    awk 'BEGIN{OFS="\t";FS="\t"}{if ($7=="protein_coding"){print $1,$2,$3,$4,$5,$6,$2,$3,"102,194,165"}}' | \
    bedtools flank -i - -g hg38.genome -l 500 -r 0 -s | \
    bedtools slop -i - -g hg38.genome -l 0 -r 500 -s > promoters_500bp.bed
    # promoters for non-protein-coding genes
    zcat gencode.v24.annotation.gtf.gz | \
    awk 'BEGIN{OFS="\t"} $3=="transcript" {print $1,$4-1,$5,$18,"promoter(NP)",$7,$14}' | tr -d '";' | \
    awk 'BEGIN{OFS="\t";FS="\t"}{if ($7!="protein_coding"){print $1,$2,$3,$4,$5,$6,$2,$3,"102,194,165"}}' | \
    bedtools flank -i - -g hg38.genome -l 500 -r 0 -s \
    bedtools slop -i - -g hg38.genome -l 0 -r 500 -s > np_promoters_500bp.bed
    # intergenic
    zcat gencode.v24.annotation.gtf.gz | \
    awk 'BEGIN{OFS="\t"} $3=="gene" {print $1,$4-1,$5,$10,$16,$7}' | \
    tr -d '";' | \
    bedtools slop -i - -g hg38.genome -l 500 -r 0 -s | \
    sortBed -g ../hg38.genome | \
    bedtools complement -i stdin -g ../hg38.genome | \
    awk 'BEGIN{OFS="\t";FS="\t"}{print $1,$2,$3,".","intergenic",".",$2,$3,"141,160,203"}' > intergenic_500bp.bed
    # exons
    zcat gencode.v24.annotation.gtf.gz | \
    awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5,$18,"exon",$7}' | \
    tr -d '";' | \
    sortBed -g ../hg38.genome | \
    mergeBed -i - -c 4,5,6 -o distinct,distinct,distinct -s | \
    awk 'BEGIN{OFS="\t";FS="\t"}{print $1,$2,$3,$4,$5,$6,$2,$3,"231,138,195"}' > exons.bed
    # introns
    zcat gencode.v24.annotation.gtf.gz | \
    awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5,$16,"intron",$7}' | \
    tr -d '";' | \
    sortBed -g ../hg38.genome | \
    subtractBed -a stdin -b exons.bed | \
    awk 'BEGIN{OFS="\t";FS="\t"}{print $1,$2,$3,$4,$5,$6,$2,$3,"255,217,47"}' > introns.bed
    # UTR, perl script from https://davetang.org/muse/2012/09/12/gencode/
    get_35_utr.pl gencode.v24.annotation.gtf.gz | \
    awk 'BEGIN{OFS="\t";FS="\t"}{if ($5=="3_UTR"){print $1,$2,$3,$4,$5,$6,$2,$3,"166,216,84"}else{print $1,$2,$3,$4,$5,$6,$2,$3,"252,141,98"}}' > utr.bed

    cat intergenic_500bp.bed promoters_500bp.bed np_promoters_500bp.bed utr.bed introns.bed exons.bed | sort -k1,1 -k2,2n | bgzip > gencode.v24.segmented.tssflanking500b.bed.gz
RefGene
  • File name: refseq.ver109.20190125.annotation.gtf.gz (MD5 checksum: 848813de5b516e0f328046ef9c931091)
  • Local path: /References/Annotations/human/hg38/refseq.ver109.20190125.annotation.gtf.gz
  • Remote backup: OSF
  • Description: RefSeq annotation in GTF format that has been remapped to use the same set of UCSC-style sequence identifiers used in the FASTA files. The annotation is NCBI Homo sapiens Updated Annotation Release 109.20190125 from 25 January 2019.
  • Recipe:
    1
    2
    wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz
    mv GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz refseq.ver109.20190125.annotation.gtf.gz

Other annotations

Repeat Masker
  • File name: rmsk.bed.gz (MD5 checksum: ae12aefbef9d4f5bc7695158a67d9a55)
  • Local path: /References/Annotations/human/hg38/rmsk.bed.gz
  • Remote backup: OSF
  • Description: Repeat Masker from UCSC. The following fields were selected:
    • genoName (Genomic sequence name)
    • genoStart (Start in genomic sequence)
    • genoEnd (End in genomic sequence)
    • strand (Relative orientation + or -)
    • repName (Name of repeat)
    • repFamily (Family of repeat).
  • Recipe:
    1
    2
    3
    4
    5
    wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz
    gunzip rmsk.txt.gz
    gawk 'OFS="\t"{print $6,$7,$8,$11,$13,$10}' rmsk.txt | \
    sort -k1,1 -k2,2n | \
    bgzip > rmsk.bed.gz

Generic

Sequences

  • Primary assembly:
  • rRNA: Human ribosomal DNA complete repeating unit, GenBank accession code: U13369.1 .

Annotations

Motif databases (MEME)

  • File name: motif_databases.12.19.tgz (MD5 checksum: f5ffcaecc07570ee19dba20b82d7bd73)
  • Local path: /References/Annotations/human/generic/motif_databases.12.19.tgz
  • Remote backup: OSF
  • Description: Motif databases for MEME suite (updated 28 Oct 2019).
  • Recipe:
    1
    wget http://alternate.meme-suite.org/meme-software/Databases/motifs/motif_databases.12.19.tgz

Note

  • For all fasta files, 3 standard annotations will also be generated simultaneously:
    • .fai: index which allows for fast and random access to any sequences in the indexed fasta file. This index is generated with the following command:
      1
      samtools faidx input.fa
    • .genome: Table with two columns, specifying length of each chromosome.
      1
      cut -f1,2 input.fa.fai > size.genome
    • .dict:
      1
      2
      3
      java -jar picard.jar CreateSequenceDictionary \
      R=input.fa \
      O=input.dict
  • There are two types of promoters in both gencode.v24.segmented.tssup1kb.bed and gencode.v24.segmented.tssflanking1kb.bed :
    • Promoters for protein coding genes (denote as promoter in these files)
    • Promoters for non-protein coding genes (denote as promoter(NP))