Using CGAT Tools

Command line usage

CGAT tools are written for command line usage with a consistent interface that makes them amenable to integration in pipelines. Tools can be accessed through the cgat front-end that will be installed in your PATH.

To get a list of all available commands, type:

cgat --help

Command line help for individual tools is available through each tool’s --help option:

cgat gff2gff --help

Logging

CGAT scripts output logging information as comments starting with a # into stdout or into a separate log file (--log).

Below is an example of logging output:

# output generated by /ifs/devel/andreas/cgat/beds2beds.py --force-output --exclusive-overlap --method=unmerged-combinations --output-filename-pattern=030m.intersection.tsv.dir/030m.intersection.tsv-%s.bed.gz --log=030m.intersection.tsv.log Irf5-030m-R1.bed.gz Rela-030m-R1.bed.gz
# job started at Thu Mar 29 13:06:33 2012 on cgat150.anat.ox.ac.uk -- e1c16e80-03a1-4023-9417-f3e44e33bdcd
# pid: 16649, system: Linux 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 x86_64
# exclusive                               : True
# filename_update                         : None
# ignore_strand                           : False
# loglevel                                : 1
# method                                  : unmerged-combinations
# output_filename_pattern                 : 030m.intersection.tsv.dir/030m.intersection.tsv-%s.bed.gz
# output_force                            : True
# pattern_id                              : (.*).bed.gz
# stderr                                  : <open file \'<stderr>\', mode \'w\' at 0x2ba70e0c2270>
# stdin                                   : <open file \'<stdin>\', mode \'r\' at 0x2ba70e0c2150>
# stdlog                                  : <open file \'030m.intersection.tsv.log\', mode \'a\' at 0x1f1a810>
# stdout                                  : <open file \'<stdout>\', mode \'w\' at 0x2ba70e0c21e0>
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tracks                                  : None

The header contains information about:

  • the script name (beds2beds.py)

  • the command line options (--force-output --exclusive-overlap --method=unmerged-combinations --output-filename-pattern=030m.intersection.tsv.dir/030m.intersection.tsv-%s.bed.gz --log=030m.intersection.tsv.log Irf5-030m-R1.bed.gz Rela-030m-R1.bed.gz)

  • the time when the job was started (Thu Mar 29 13:06:33 2012)

  • the location it was executed (cgat150.anat.ox.ac.uk)

  • a unique job id (e1c16e80-03a1-4023-9417-f3e44e33bdcd)

  • the pid of the job (16649)

  • the system specification (Linux 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 x86_64)

Once completed successfully, a script will output to the logfile. Below is typical output:

# job finished in 11 seconds at Thu Mar 29 13:06:44 2012 -- 11.36  0.45  0.00  0.01 -- e1c16e80-03a1-4023-9417-f3e44e33bdcd

The footer contains information about:

  • the job has finished (job finished)

  • the time it took to execute (11 seconds)

  • when it completed (Thu Mar 29 13:06:44 2012)

  • some benchmarking information (11.36  0.45  0.00  0.01) which is

    user time, system time, child user time, child system time.

  • the unique job id (e1c16e80-03a1-4023-9417-f3e44e33bdcd)

The unique job id can be used to easily retrieve matching information from a concatenation of log files.

The logging level can be determined by the --verbose option. A level of 0 means no logging output, while 1 is information messages only, while 2 outputs also debugging information.

I/O redirection

Most scripts work by reading data from stdin and outputting data to stdout. Both can be redirected to files with the -I/--stdin and -O/--stdout options. stderr can be redirected with -E/--stderr.

Indexing genomes

Many CGAT tools require genomic information, some require the actual genomic sequence, but many require information about chromosome sizes. Thus, many tools have the obligatory option --genome-file.

The genome-file argument points to an indexed fasta file. CGAT tools can read two different indices, either files indexed using the supplied index_fasta.py - Index fasta formatted files script or using the samtools faidx command.

Pipeline usage

We use a light-weight workflow system called ruffus, but others are equally possible such as galaxy (see GalaxyInstallation). These tools allow CGAT tools to run in an automated fashion.

Using unix pipes, CGAT tools can also be easily run in a parallel fashion. For example, we have a script called farm.py (not part of the CGAT collection, but within the CGAT repository), that allows to split input data and run separate chunks on our compute cluster. Below is a simple example of running the command:

zcat geneset.gtf.gz
| cgat gtf2table --counter=length --log=log |
gzip > out.tsv.gz

in parallel on the cluster, running one job per chromosome:

zcat geneset.gtf.gz
| farm.py --split-at-column=1
        "cgat gtf2table --counter=length --log=log"
| gzip
> out.tsv.gz