CGAT 0.6.2 - Computational Genomics Analysis Tools¶
CGAT is a collection of tools for the computational genomicist written in the Python language (Should work with Python 2.7, but we only actively support Python 3.6+). The tools have been developed and accumulated in various genome projects (Heger & Ponting, 2007, Warren et al., 2008) and NGS projects (Ramagopalan et al., 2010). The tools are in continuous development. The tools work from the command line, but can readily be installed within frameworks such as Galaxy.
The documentation below covers the script published in Bioinformatics.
Detailed instructions on installation, on usage and a tool reference are below, followed by a Quickstart guide.
- Mission statement
- Installation instructions
- Using CGAT Tools
- Tool map
- Tool reference
- Contributing to CGAT code
- Release Notes
Please install the CGAT-apps using the following Installation instructions for dependencies and troubleshooting.
CGAT-apps are run from the unix command line. Lets assume we have
the results of the binding locations of a ChIP-Seq experiment
chipseq.hg19.bed) in bed format and we want to know, how many
binding locations are intronic, intergenic and within exons.
Thus, we need to create a set of genomic annotations denoting intronic, intergenic regions, etc. with respect to a reference gene set. Here, we download the GENCODE geneset (Harrow et al., 2012) in GTF format from ENSEMBL (Flicek et al., 2013).
The following unix statement downloads the ENSEMBL gene set containing
over-lapping transcripts, and outputs a set of non-overlapping genomic
annotations in gff format (
annotations.gff) by piping the data
through various CGAT tools:
wget ftp://ftp.ensembl.org/pub/release-72/gtf/homo_sapiens/Homo_sapiens.GRCh37.72.gtf.gz | gunzip | awk '$2 == "protein_coding"' | cgat gff2ff --genome-file=hg19 --method=sanitize --skip-missing | cgat gtf2gtf --method=sort --sort-order=gene | cgat gtf2gtf --method=merge-exons --with-utr | cgat gtf2gtf --method=filter --filter-method=longest-gene | cgat gtf2gtf --method=sort --sort-order=position | cgat gtf2gff --genome-file=hg19 --flank-size=5000 --method=genome | gzip > annotations.gff.gz
The statements above need an indexed genome. To create such an indexed genome for hg19, type the following:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz | index_fasta.py hg19 - > hg19.log
CGAT-apps can be chained into a single work flow using unix pipes. The above sequence of commands in turn (1) reconciles UCSC and ENSEMBL naming schemes for chromosome names, (2) merges all exons of alternative transcripts per gene, (3) keeps the longest gene in case of overlapping genes and (4) annotates exonic, intronic, intergenic and flanking region (size=5kb) within and between genes.
Note that the creation of
annotations.gff.gz goes beyond
simple interval intersection, as gene structures have to be normalized
from multiple possible alternative transcripts to a single transcript
that is chosen by the user depending on what is most relevant for the
Choosing different options can provide different sets of
answers. Instead of merging all exons per gene, the longest transcript
might be selected by replacing (2) with
Or, instead of genomic annotations, regulatory domains such as defined by GREAT might be obtained by
removing (3) and replacing (4) with
The generated annotations in annotations.gff can then be used to count
the number of transcription factor binding sites using bed-tools or
other interval intersections. Here, we will use another CGAT tool,
gtf2table, to do the counting and classification:
zcat /ifs/devel/gat/tutorial/data/srf.hg19.bed | cgat bed2gff --as-gtf | cgat gtf2table --counter=classifier-chipseq --gff-file=annotations.gff.gz
The scripts follow a consistent naming scheme centered around common genomic formats. Because of the common genomic formats, the tools can be easily combined with other tools such as bedtools (Quinlan and Hall, 2010) or UCSC Tools (Kuhn et al. 2013).
- Contributing to CGAT code
- Style Guide
- Importing CGAT scripts into galaxy
This collection of scripts is the outcome of 10 years working in various fields in bioinformatics. It contains both the good, the bad and the ugly. Use at your own risk.