Using CGAT Tools¶
Command line usage¶
CGAT tools are written for command line usage with a consistent
interface that makes them amenable to integration in pipelines.
Tools can be accessed through the cgat
front-end that will
be installed in your PATH.
To get a list of all available commands, type:
cgat --help
Command line help for individual tools is available through
each tool’s --help
option:
cgat gff2gff --help
Logging¶
CGAT scripts output logging information as comments starting with a
#
into stdout or into a separate log file (--log
).
Below is an example of logging output:
# output generated by /ifs/devel/andreas/cgat/beds2beds.py --force-output --exclusive-overlap --method=unmerged-combinations --output-filename-pattern=030m.intersection.tsv.dir/030m.intersection.tsv-%s.bed.gz --log=030m.intersection.tsv.log Irf5-030m-R1.bed.gz Rela-030m-R1.bed.gz
# job started at Thu Mar 29 13:06:33 2012 on cgat150.anat.ox.ac.uk -- e1c16e80-03a1-4023-9417-f3e44e33bdcd
# pid: 16649, system: Linux 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 x86_64
# exclusive : True
# filename_update : None
# ignore_strand : False
# loglevel : 1
# method : unmerged-combinations
# output_filename_pattern : 030m.intersection.tsv.dir/030m.intersection.tsv-%s.bed.gz
# output_force : True
# pattern_id : (.*).bed.gz
# stderr : <open file \'<stderr>\', mode \'w\' at 0x2ba70e0c2270>
# stdin : <open file \'<stdin>\', mode \'r\' at 0x2ba70e0c2150>
# stdlog : <open file \'030m.intersection.tsv.log\', mode \'a\' at 0x1f1a810>
# stdout : <open file \'<stdout>\', mode \'w\' at 0x2ba70e0c21e0>
# timeit_file : None
# timeit_header : None
# timeit_name : all
# tracks : None
The header contains information about:
the script name (
beds2beds.py
)the command line options (
--force-output --exclusive-overlap --method=unmerged-combinations --output-filename-pattern=030m.intersection.tsv.dir/030m.intersection.tsv-%s.bed.gz --log=030m.intersection.tsv.log Irf5-030m-R1.bed.gz Rela-030m-R1.bed.gz
)the time when the job was started (
Thu Mar 29 13:06:33 2012
)the location it was executed (
cgat150.anat.ox.ac.uk
)a unique job id (
e1c16e80-03a1-4023-9417-f3e44e33bdcd
)the pid of the job (
16649
)the system specification (
Linux 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 x86_64
)
Once completed successfully, a script will output to the logfile. Below is typical output:
# job finished in 11 seconds at Thu Mar 29 13:06:44 2012 -- 11.36 0.45 0.00 0.01 -- e1c16e80-03a1-4023-9417-f3e44e33bdcd
The footer contains information about:
the job has finished (
job finished
)the time it took to execute (
11 seconds
)when it completed (
Thu Mar 29 13:06:44 2012
)
- some benchmarking information (
11.36 0.45 0.00 0.01
) which is
user time
,system time
,child user time
,child system time
.the unique job id (
e1c16e80-03a1-4023-9417-f3e44e33bdcd
)
The unique job id can be used to easily retrieve matching information from a concatenation of log files.
The logging level can be determined by the --verbose
option. A
level of 0
means no logging output, while 1
is information
messages only, while 2
outputs also debugging information.
I/O redirection¶
Most scripts work by reading data from stdin and outputting
data to stdout. Both can be redirected to files with the
-I/--stdin
and -O/--stdout
options. stderr can be
redirected with -E/--stderr
.
Indexing genomes¶
Many CGAT tools require genomic information, some require the actual
genomic sequence, but many require information about chromosome sizes.
Thus, many tools have the obligatory option --genome-file
.
The genome-file
argument points to an indexed fasta file. CGAT
tools can read two different indices, either files indexed using
the supplied index_fasta.py - Index fasta formatted files script or using the samtools
faidx
command.
Pipeline usage¶
We use a light-weight workflow system called ruffus, but others are equally possible such as galaxy (see GalaxyInstallation). These tools allow CGAT tools to run in an automated fashion.
Using unix pipes, CGAT tools can also be easily run in a parallel fashion. For example, we have a script called farm.py (not part of the CGAT collection, but within the CGAT repository), that allows to split input data and run separate chunks on our compute cluster. Below is a simple example of running the command:
zcat geneset.gtf.gz
| cgat gtf2table --counter=length --log=log |
gzip > out.tsv.gz
in parallel on the cluster, running one job per chromosome:
zcat geneset.gtf.gz
| farm.py --split-at-column=1
"cgat gtf2table --counter=length --log=log"
| gzip
> out.tsv.gz