Tool reference

This page summarizes prominent tools within the CGAT Code collection. The tools are grouped losely by functionality.

Genomic intervals/features

beds2counts - compute overlap stats between multiple bed files

Compute overlap statistics of multiple bed files.

bed2fasta.py - get sequences from bed file

Transform interval data in a bed formatted file into a fasta formatted file of sequence data.

bed2gff.py - convert bed to gff/gtf

Convert between interval data. Convert a bed formatted file to a gff or gtf formatted file.

gff2gff.py - manipulate gff files

Work on gff formatted files with genomic features. This tools sorts/renames feature files, reconciles chromosome names, and more.

bed2bed - manipulate bed files

Filter or merge interval data in a bed formatted file.

bed2graph.py - compute the overlap graph between two bed files

Compare two sets of genomic intervals and output a list of overlapping features.

bed2stats.py - summary of bed file contents

Compute summary statistics of genomic intervals.

<no title>

Annotate genomic intervals (composition, peak location, overlap, …)

beds2beds.py - decompose bed files

Decompose multiple sets of genomic intervals into various intersections and unions.

diff_bed.py - count differences between several bed files

Compare multiple sets of interval data sets. The tools computes all-vs-all pairwise overlap summaries. Permits incremental updates of similarity table.

gff2bed.py - convert from gff/gtf to bed

Convert between formats

split_gff - split a gff file into chunks

Split a file in gff format into smaller files. The script ensures that overlapping intervals remain in the same file.

gff2coverage.py - compute genomic coverage of gff intervals

This script computes the genomic coverage of intervals in a gff formatted file. The coverage is computed per feature.

gff2fasta.py - output sequences from genomic features

Output genomic sequences from intervals.

gff2histogram.py - compute histograms from intervals in gff or bed format

Compute distributions of interval sizes, intersegmental distances and interval ovelap from list of intervals.

gff2stats.py - count features, etc. in gff file

Summarize features within a gff formatted file.

gff2psl.py - convert from gff to psl

Convert between formats.

Gene sets

gtf2gff.py - convert a transcript set to genomic features

Translate a gene set into genomic annotations such as introns, intergenic regions, regulatory domains, etc.

<no title>

Annotate transcripts in a gtf formatted file. Annotations can be in reference to a second gene set (fragments, extensions), aligned reads (coverage, intron overrun, …) or densities.

gtf2fasta.py - annotate genomic bases from a gene set

Annotate each base in the genome according to its use within a transcript. Outputs lists of junctions.

gtf2gtf.py - manipulate transcript models

merge exons/transcripts/genes, filter transcripts/genes, rename transcripts/genes, …

gtf2tsv.py - convert gtf file to a tab-separated table

convert gene set in gtf format to tabular format.

gtfs2tsv.py - compare two genesets

Compare two gene sets - output common and unique lists of genes.

diff_gtf.py - compute overlap between multiple gtf files

Compare multiple gene sets. The tools computes all-vs-all pairwise overlap of exons, bases and genes. Permits incremental updates of similarity table.

Sequence data

fastqs2fasta.py - interleave two fastq files

Interleave paired reads from two fastq files into a single fasta file.

index_fasta.py - Index fasta formatted files

Build an index for a fasta file. Pre-requisite for many CGAT tools.

fasta2kmercontent.py

Count kmer content in a set of fasta sequences.

<no title>

Compute features of sequences in fasta formatted files

diff_fasta.py - compare contents of two fasta files

Compare two sets of sequences. Outputs missing, identical and fragmented sequences.

fasta2bed.py - segment sequences

Segment sequences based on G+C content, gaps, …

fastas2fasta.py - concatenate sequences from multiple fasta files

Concatentate sequences from multiple files.

fasta2variants.py - create sequence variants from a set of sequences

In-silico creation of variants of protein coding sequences.

NGS data

bam2geneprofile.py - build meta-gene profile for a set of transcripts/genes

Compute meta-gene profiles from aligned reads in a bam formatted file. Also accepts bed or bigwig formatted files.

bam2bam.py - modify bam files

Operate on bam formatted files - filtering, stripping, setting flags.

bam2bed.py - convert bam formatted file to bed formatted file

Convert bam formatted file of genomic alignments into genomic intervals. Permits merging of paired read data and filtering by insert-size.

bam2fastq.py - output fastq files from a bam-file

Save sequence and quality information from a bam formatted file.

bam2peakshape.py - compute peak shape features from a bam-file

Compute read densities over a collection of intervals. Also accepts bed or bigwig formatted files.

Purpose

Compute summary statistics of a bam formatted file.

bam2wiggle.py - convert bam to wig/bigwig file

Convert read coverage in a bam formatted file into a wiggle or bigwig formatted file.

bam_vs_gtf.py - compare bam file against gene set

Compute stats on exon over-/underrun and spliced reads.

bam_vs_bed.py - count context that reads map to

Compute coverage of reads within multiple interval types.

bam_vs_bam.py - compute coverage correlation between bam files

Outputs side-by-side comparison of residue level counts between multiple bam formatted files.

fastq2fastq.py - manipulate fastq files

Perform quality score conversion between fastq formatted files.

fastqs2fasta.py - interleave two fastq files

Interleave paired end data.

fastq2table.py - compute stats on reads in fastq files

Output bases below quality threshold, number of N’s, quality score distribution.

fastqs2fastqs.py - manipulate (merge/reconcile) fastq files

Ensure that paired read fastq formatted files are consistent after filtering on the individual files.

diff_bam.py - compare multiple bam files against each other

Perform read-by-read comparison of two bam-files.

Variants

vcf2vcf.py - manipulate vcf files

Sort a vcf file.

Genomics

diff_chains.py - compare to chain formatted files

How many residues to the same locations, do different locations, etc.

<no title>

Output coverage statistics for a UCSC liftover chain file.