Tool reference¶

This page summarizes prominent tools within the CGAT Code collection. The tools are grouped losely by functionality.

Genomic intervals/features¶

beds2counts - compute overlap stats between multiple bed files: Compute overlap statistics of multiple bed files.
bed2fasta.py - get sequences from bed file: Transform interval data in a bed formatted file into a fasta formatted file of sequence data.
bed2gff.py - convert bed to gff/gtf: Convert between interval data. Convert a bed formatted file to a gff or gtf formatted file.
gff2gff.py - manipulate gff files: Work on gff formatted files with genomic features. This tools sorts/renames feature files, reconciles chromosome names, and more.
bed2bed - manipulate bed files: Filter or merge interval data in a bed formatted file.
bed2graph.py - compute the overlap graph between two bed files: Compare two sets of genomic intervals and output a list of overlapping features.
bed2stats.py - summary of bed file contents: Compute summary statistics of genomic intervals.
<no title>: Annotate genomic intervals (composition, peak location, overlap, …)
beds2beds.py - decompose bed files: Decompose multiple sets of genomic intervals into various intersections and unions.
diff_bed.py - count differences between several bed files: Compare multiple sets of interval data sets. The tools computes all-vs-all pairwise overlap summaries. Permits incremental updates of similarity table.
gff2bed.py - convert from gff/gtf to bed: Convert between formats
split_gff - split a gff file into chunks: Split a file in gff format into smaller files. The script ensures that overlapping intervals remain in the same file.
gff2coverage.py - compute genomic coverage of gff intervals: This script computes the genomic coverage of intervals in a gff formatted file. The coverage is computed per feature.
gff2fasta.py - output sequences from genomic features: Output genomic sequences from intervals.
gff2histogram.py - compute histograms from intervals in gff or bed format: Compute distributions of interval sizes, intersegmental distances and interval ovelap from list of intervals.
gff2stats.py - count features, etc. in gff file: Summarize features within a gff formatted file.
gff2psl.py - convert from gff to psl: Convert between formats.

Gene sets¶

gtf2gff.py - convert a transcript set to genomic features: Translate a gene set into genomic annotations such as introns, intergenic regions, regulatory domains, etc.
<no title>: Annotate transcripts in a gtf formatted file. Annotations can be in reference to a second gene set (fragments, extensions), aligned reads (coverage, intron overrun, …) or densities.
gtf2fasta.py - annotate genomic bases from a gene set: Annotate each base in the genome according to its use within a transcript. Outputs lists of junctions.
gtf2gtf.py - manipulate transcript models: merge exons/transcripts/genes, filter transcripts/genes, rename transcripts/genes, …
gtf2tsv.py - convert gtf file to a tab-separated table: convert gene set in gtf format to tabular format.
gtfs2tsv.py - compare two genesets: Compare two gene sets - output common and unique lists of genes.
diff_gtf.py - compute overlap between multiple gtf files: Compare multiple gene sets. The tools computes all-vs-all pairwise overlap of exons, bases and genes. Permits incremental updates of similarity table.

Sequence data¶

fastqs2fasta.py - interleave two fastq files: Interleave paired reads from two fastq files into a single fasta file.
index_fasta.py - Index fasta formatted files: Build an index for a fasta file. Pre-requisite for many CGAT tools.
fasta2kmercontent.py: Count kmer content in a set of fasta sequences.
<no title>: Compute features of sequences in fasta formatted files
diff_fasta.py - compare contents of two fasta files: Compare two sets of sequences. Outputs missing, identical and fragmented sequences.
fasta2bed.py - segment sequences: Segment sequences based on G+C content, gaps, …
fastas2fasta.py - concatenate sequences from multiple fasta files: Concatentate sequences from multiple files.
fasta2variants.py - create sequence variants from a set of sequences: In-silico creation of variants of protein coding sequences.

NGS data¶

bam2geneprofile.py - build meta-gene profile for a set of transcripts/genes: Compute meta-gene profiles from aligned reads in a bam formatted file. Also accepts bed or bigwig formatted files.
bam2bam.py - modify bam files: Operate on bam formatted files - filtering, stripping, setting flags.
bam2bed.py - convert bam formatted file to bed formatted file: Convert bam formatted file of genomic alignments into genomic intervals. Permits merging of paired read data and filtering by insert-size.
bam2fastq.py - output fastq files from a bam-file: Save sequence and quality information from a bam formatted file.
bam2peakshape.py - compute peak shape features from a bam-file: Compute read densities over a collection of intervals. Also accepts bed or bigwig formatted files.
Purpose: Compute summary statistics of a bam formatted file.
bam2wiggle.py - convert bam to wig/bigwig file: Convert read coverage in a bam formatted file into a wiggle or bigwig formatted file.
bam_vs_gtf.py - compare bam file against gene set: Compute stats on exon over-/underrun and spliced reads.
bam_vs_bed.py - count context that reads map to: Compute coverage of reads within multiple interval types.
bam_vs_bam.py - compute coverage correlation between bam files: Outputs side-by-side comparison of residue level counts between multiple bam formatted files.
fastq2fastq.py - manipulate fastq files: Perform quality score conversion between fastq formatted files.
fastqs2fasta.py - interleave two fastq files: Interleave paired end data.
fastq2table.py - compute stats on reads in fastq files: Output bases below quality threshold, number of N’s, quality score distribution.
fastqs2fastqs.py - manipulate (merge/reconcile) fastq files: Ensure that paired read fastq formatted files are consistent after filtering on the individual files.
diff_bam.py - compare multiple bam files against each other: Perform read-by-read comparison of two bam-files.

Variants¶

vcf2vcf.py - manipulate vcf files: Sort a vcf file.

Genomics¶

diff_chains.py - compare to chain formatted files: How many residues to the same locations, do different locations, etc.
<no title>: Output coverage statistics for a UCSC liftover chain file.