CGAT 0.6.2 - Computational Genomics Analysis Tools

CGAT is a collection of tools for the computational genomicist written in the Python language (Should work with Python 2.7, but we only actively support Python 3.6+). The tools have been developed and accumulated in various genome projects (Heger & Ponting, 2007, Warren et al., 2008) and NGS projects (Ramagopalan et al., 2010). The tools are in continuous development. The tools work from the command line, but can readily be installed within frameworks such as Galaxy.

The documentation below covers the script published in Bioinformatics.

Detailed instructions on installation, on usage and a tool reference are below, followed by a Quickstart guide.

Quickstart

Please install the CGAT-apps using the following Installation instructions for dependencies and troubleshooting.

CGAT-apps are run from the unix command line. Lets assume we have the results of the binding locations of a ChIP-Seq experiment (chipseq.hg19.bed) in bed format and we want to know, how many binding locations are intronic, intergenic and within exons.

Thus, we need to create a set of genomic annotations denoting intronic, intergenic regions, etc. with respect to a reference gene set. Here, we download the GENCODE geneset (Harrow et al., 2012) in GTF format from ENSEMBL (Flicek et al., 2013).

The following unix statement downloads the ENSEMBL gene set containing over-lapping transcripts, and outputs a set of non-overlapping genomic annotations in gff format (annotations.gff) by piping the data through various CGAT tools:

wget ftp://ftp.ensembl.org/pub/release-72/gtf/homo_sapiens/Homo_sapiens.GRCh37.72.gtf.gz
| gunzip
| awk '$2 == "protein_coding"'
| cgat gff2ff --genome-file=hg19 --method=sanitize --skip-missing
| cgat gtf2gtf --method=sort --sort-order=gene
| cgat gtf2gtf --method=merge-exons --with-utr
| cgat gtf2gtf --method=filter --filter-method=longest-gene
| cgat gtf2gtf --method=sort --sort-order=position
| cgat gtf2gff --genome-file=hg19 --flank-size=5000 --method=genome
| gzip
> annotations.gff.gz

Note

The statements above need an indexed genome. To create such an indexed genome for hg19, type the following:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
| index_fasta.py hg19 - > hg19.log

CGAT-apps can be chained into a single work flow using unix pipes. The above sequence of commands in turn (1) reconciles UCSC and ENSEMBL naming schemes for chromosome names, (2) merges all exons of alternative transcripts per gene, (3) keeps the longest gene in case of overlapping genes and (4) annotates exonic, intronic, intergenic and flanking region (size=5kb) within and between genes.

Note that the creation of annotations.gff.gz goes beyond simple interval intersection, as gene structures have to be normalized from multiple possible alternative transcripts to a single transcript that is chosen by the user depending on what is most relevant for the analysis.

Choosing different options can provide different sets of answers. Instead of merging all exons per gene, the longest transcript might be selected by replacing (2) with gtf2gtf --method=filter --filter-method=longest-transcript. Or, instead of genomic annotations, regulatory domains such as defined by GREAT might be obtained by removing (3) and replacing (4) with gtf2gff --method=great-domains.

The generated annotations in annotations.gff can then be used to count the number of transcription factor binding sites using bed-tools or other interval intersections. Here, we will use another CGAT tool, gtf2table, to do the counting and classification:

zcat /ifs/devel/gat/tutorial/data/srf.hg19.bed
| cgat bed2gff --as-gtf
| cgat gtf2table --counter=classifier-chipseq --gff-file=annotations.gff.gz

The scripts follow a consistent naming scheme centered around common genomic formats. Because of the common genomic formats, the tools can be easily combined with other tools such as bedtools (Quinlan and Hall, 2010) or UCSC Tools (Kuhn et al. 2013).

Disclaimer

This collection of scripts is the outcome of 10 years working in various fields in bioinformatics. It contains both the good, the bad and the ugly. Use at your own risk.