Tool reference¶
This page summarizes prominent tools within the CGAT Code collection. The tools are grouped losely by functionality.
Genomic intervals/features¶
- beds2counts - compute overlap stats between multiple bed files
Compute overlap statistics of multiple bed files.
- bed2fasta.py - get sequences from bed file
Transform interval data in a bed formatted file into a fasta formatted file of sequence data.
- bed2gff.py - convert bed to gff/gtf
Convert between interval data. Convert a bed formatted file to a gff or gtf formatted file.
- gff2gff.py - manipulate gff files
Work on gff formatted files with genomic features. This tools sorts/renames feature files, reconciles chromosome names, and more.
- bed2bed - manipulate bed files
Filter or merge interval data in a bed formatted file.
- bed2graph.py - compute the overlap graph between two bed files
Compare two sets of genomic intervals and output a list of overlapping features.
- bed2stats.py - summary of bed file contents
Compute summary statistics of genomic intervals.
- <no title>
Annotate genomic intervals (composition, peak location, overlap, …)
- beds2beds.py - decompose bed files
Decompose multiple sets of genomic intervals into various intersections and unions.
- diff_bed.py - count differences between several bed files
Compare multiple sets of interval data sets. The tools computes all-vs-all pairwise overlap summaries. Permits incremental updates of similarity table.
- gff2bed.py - convert from gff/gtf to bed
Convert between formats
- split_gff - split a gff file into chunks
Split a file in gff format into smaller files. The script ensures that overlapping intervals remain in the same file.
- gff2coverage.py - compute genomic coverage of gff intervals
This script computes the genomic coverage of intervals in a gff formatted file. The coverage is computed per feature.
- gff2fasta.py - output sequences from genomic features
Output genomic sequences from intervals.
- gff2histogram.py - compute histograms from intervals in gff or bed format
Compute distributions of interval sizes, intersegmental distances and interval ovelap from list of intervals.
- gff2stats.py - count features, etc. in gff file
Summarize features within a gff formatted file.
- gff2psl.py - convert from gff to psl
Convert between formats.
Gene sets¶
- gtf2gff.py - convert a transcript set to genomic features
Translate a gene set into genomic annotations such as introns, intergenic regions, regulatory domains, etc.
- <no title>
Annotate transcripts in a gtf formatted file. Annotations can be in reference to a second gene set (fragments, extensions), aligned reads (coverage, intron overrun, …) or densities.
- gtf2fasta.py - annotate genomic bases from a gene set
Annotate each base in the genome according to its use within a transcript. Outputs lists of junctions.
- gtf2gtf.py - manipulate transcript models
merge exons/transcripts/genes, filter transcripts/genes, rename transcripts/genes, …
- gtf2tsv.py - convert gtf file to a tab-separated table
convert gene set in gtf format to tabular format.
- gtfs2tsv.py - compare two genesets
Compare two gene sets - output common and unique lists of genes.
- diff_gtf.py - compute overlap between multiple gtf files
Compare multiple gene sets. The tools computes all-vs-all pairwise overlap of exons, bases and genes. Permits incremental updates of similarity table.
Sequence data¶
- fastqs2fasta.py - interleave two fastq files
Interleave paired reads from two fastq files into a single fasta file.
- index_fasta.py - Index fasta formatted files
Build an index for a fasta file. Pre-requisite for many CGAT tools.
- fasta2kmercontent.py
Count kmer content in a set of fasta sequences.
- <no title>
Compute features of sequences in fasta formatted files
- diff_fasta.py - compare contents of two fasta files
Compare two sets of sequences. Outputs missing, identical and fragmented sequences.
- fasta2bed.py - segment sequences
Segment sequences based on G+C content, gaps, …
- fastas2fasta.py - concatenate sequences from multiple fasta files
Concatentate sequences from multiple files.
- fasta2variants.py - create sequence variants from a set of sequences
In-silico creation of variants of protein coding sequences.
NGS data¶
- bam2geneprofile.py - build meta-gene profile for a set of transcripts/genes
Compute meta-gene profiles from aligned reads in a bam formatted file. Also accepts bed or bigwig formatted files.
- bam2bam.py - modify bam files
Operate on bam formatted files - filtering, stripping, setting flags.
- bam2bed.py - convert bam formatted file to bed formatted file
Convert bam formatted file of genomic alignments into genomic intervals. Permits merging of paired read data and filtering by insert-size.
- bam2fastq.py - output fastq files from a bam-file
Save sequence and quality information from a bam formatted file.
- bam2peakshape.py - compute peak shape features from a bam-file
Compute read densities over a collection of intervals. Also accepts bed or bigwig formatted files.
- Purpose
Compute summary statistics of a bam formatted file.
- bam2wiggle.py - convert bam to wig/bigwig file
Convert read coverage in a bam formatted file into a wiggle or bigwig formatted file.
- bam_vs_gtf.py - compare bam file against gene set
Compute stats on exon over-/underrun and spliced reads.
- bam_vs_bed.py - count context that reads map to
Compute coverage of reads within multiple interval types.
- bam_vs_bam.py - compute coverage correlation between bam files
Outputs side-by-side comparison of residue level counts between multiple bam formatted files.
- fastq2fastq.py - manipulate fastq files
Perform quality score conversion between fastq formatted files.
- fastqs2fasta.py - interleave two fastq files
Interleave paired end data.
- fastq2table.py - compute stats on reads in fastq files
Output bases below quality threshold, number of N’s, quality score distribution.
- fastqs2fastqs.py - manipulate (merge/reconcile) fastq files
Ensure that paired read fastq formatted files are consistent after filtering on the individual files.
- diff_bam.py - compare multiple bam files against each other
Perform read-by-read comparison of two bam-files.
Variants¶
- vcf2vcf.py - manipulate vcf files
Sort a vcf file.
Genomics¶
- diff_chains.py - compare to chain formatted files
How many residues to the same locations, do different locations, etc.
- <no title>
Output coverage statistics for a UCSC liftover chain file.