Tool reference

This page summarizes prominent tools within the CGAT Code collection. The tools are grouped losely by functionality.

Genomic intervals/features

beds2counts - compute overlap stats between multiple bed files

Compute overlap statistics of multiple bed files. - get sequences from bed file

Transform interval data in a bed formatted file into a fasta formatted file of sequence data. - convert bed to gff/gtf

Convert between interval data. Convert a bed formatted file to a gff or gtf formatted file. - manipulate gff files

Work on gff formatted files with genomic features. This tools sorts/renames feature files, reconciles chromosome names, and more.

bed2bed - manipulate bed files

Filter or merge interval data in a bed formatted file. - compute the overlap graph between two bed files

Compare two sets of genomic intervals and output a list of overlapping features. - summary of bed file contents

Compute summary statistics of genomic intervals.

<no title>

Annotate genomic intervals (composition, peak location, overlap, …) - decompose bed files

Decompose multiple sets of genomic intervals into various intersections and unions. - count differences between several bed files

Compare multiple sets of interval data sets. The tools computes all-vs-all pairwise overlap summaries. Permits incremental updates of similarity table. - convert from gff/gtf to bed

Convert between formats

split_gff - split a gff file into chunks

Split a file in gff format into smaller files. The script ensures that overlapping intervals remain in the same file. - compute genomic coverage of gff intervals

This script computes the genomic coverage of intervals in a gff formatted file. The coverage is computed per feature. - output sequences from genomic features

Output genomic sequences from intervals. - compute histograms from intervals in gff or bed format

Compute distributions of interval sizes, intersegmental distances and interval ovelap from list of intervals. - count features, etc. in gff file

Summarize features within a gff formatted file. - convert from gff to psl

Convert between formats.

Gene sets - convert a transcript set to genomic features

Translate a gene set into genomic annotations such as introns, intergenic regions, regulatory domains, etc.

<no title>

Annotate transcripts in a gtf formatted file. Annotations can be in reference to a second gene set (fragments, extensions), aligned reads (coverage, intron overrun, …) or densities. - annotate genomic bases from a gene set

Annotate each base in the genome according to its use within a transcript. Outputs lists of junctions. - manipulate transcript models

merge exons/transcripts/genes, filter transcripts/genes, rename transcripts/genes, … - convert gtf file to a tab-separated table

convert gene set in gtf format to tabular format. - compare two genesets

Compare two gene sets - output common and unique lists of genes. - compute overlap between multiple gtf files

Compare multiple gene sets. The tools computes all-vs-all pairwise overlap of exons, bases and genes. Permits incremental updates of similarity table.

Sequence data - interleave two fastq files

Interleave paired reads from two fastq files into a single fasta file. - Index fasta formatted files

Build an index for a fasta file. Pre-requisite for many CGAT tools.

Count kmer content in a set of fasta sequences.

<no title>

Compute features of sequences in fasta formatted files - compare contents of two fasta files

Compare two sets of sequences. Outputs missing, identical and fragmented sequences. - segment sequences

Segment sequences based on G+C content, gaps, … - concatenate sequences from multiple fasta files

Concatentate sequences from multiple files. - create sequence variants from a set of sequences

In-silico creation of variants of protein coding sequences.

NGS data - build meta-gene profile for a set of transcripts/genes

Compute meta-gene profiles from aligned reads in a bam formatted file. Also accepts bed or bigwig formatted files. - modify bam files

Operate on bam formatted files - filtering, stripping, setting flags. - convert bam formatted file to bed formatted file

Convert bam formatted file of genomic alignments into genomic intervals. Permits merging of paired read data and filtering by insert-size. - output fastq files from a bam-file

Save sequence and quality information from a bam formatted file. - compute peak shape features from a bam-file

Compute read densities over a collection of intervals. Also accepts bed or bigwig formatted files.


Compute summary statistics of a bam formatted file. - convert bam to wig/bigwig file

Convert read coverage in a bam formatted file into a wiggle or bigwig formatted file. - compare bam file against gene set

Compute stats on exon over-/underrun and spliced reads. - count context that reads map to

Compute coverage of reads within multiple interval types. - compute coverage correlation between bam files

Outputs side-by-side comparison of residue level counts between multiple bam formatted files. - manipulate fastq files

Perform quality score conversion between fastq formatted files. - interleave two fastq files

Interleave paired end data. - compute stats on reads in fastq files

Output bases below quality threshold, number of N’s, quality score distribution. - manipulate (merge/reconcile) fastq files

Ensure that paired read fastq formatted files are consistent after filtering on the individual files. - compare multiple bam files against each other

Perform read-by-read comparison of two bam-files.

Variants - manipulate vcf files

Sort a vcf file.

Genomics - compare to chain formatted files

How many residues to the same locations, do different locations, etc.

<no title>

Output coverage statistics for a UCSC liftover chain file.