diff_gtf.py - compute overlap between multiple gtf files

Tags

Genomics Intervals Genesets GTF Comparison

Purpose

This script compares multiple set of gtf files. It computes the overlap between bases, exons and genes between each pair of gtf files.

If results from a previous run are present, existing pairs are not re-computed but simply echoed.

The output is a tab-separated table with counts for each pair of files being compared. The fields are:

Column

Content

set

Name of the set

ngenes_total

number of genes in set

ngenes_ovl

number of genes overlapping

ngenes_unique

number of unique genes

nexons_total

number of exons in set

nexons_ovl

number of exons overlapping

nexons_unique

number of unique exons

nbases_total

number of bases in gene set

nbases_ovl

number of bases overlapping

nbases_unique

number of unique bases

Each of these fields will appear twice, once for each of the pair of files. Hence ngenes_unique1 will be the number of genes in set1 that have no exons that overlap with any exons in set2, and vice versa for ngenes_unique2. And on for each field in the table above. This makes a total of 9*2=18 fields containing counts, each starting with an n.

A further set of 18 fields each start with a p and are the corresponding percentage values.

Options

-s, --ignore-strand

Ignore strand infomation so that bases overlap even if exons/genes are on different strands

-u, --update=FILENAME

Read in previous results from FILENAME and only output comparisons that are missing.

-p, --pattern-identifier=PATTERN

Provide a regular expression pattern for converting a filename into a set name for the output. The regular expression should capture at least one group. That group will be used to identify that file in the output table (see examples)

Examples

For example if we have two gtf_files that look like:

first_set_of_genes.gtf:
1    protein_coding  exon    1       10      .       +       .       gene_id "1"; transcript_id "1"
1    protein_coding  exon    20      30      .       +       .       gene_id "1"; transcript_id "1"

second_set_of_genes.gtf:
1    protein_coding  exon    25      35      .       +       .       gene_id "1"; transcript_id "1"
2    protein_coding  exon    100     200     .       +       .       gene_id "2"; transcript_id "3"

Then the command:

python diff_gtf.py *.gtf --pattern-identifier='(.+)_of_genes.gtf' > out.tsv

would produce an output file that has a single row with set1 being “second_set” and set2 being “first_set” (these are extracted using that –pattern-identifier option). It will report that set1 contains 2 genes and set2 1 gene. That for each set one of these genes overlaps with the other set. For set1 it will report that 1 gene is unique and that no genes are unique for set2 and so on for exons and bases.

If we want to add a third file to the comparison, “third_set_of_genes.gtf”, we don’t need to redo the comparison between first_set_of_genes.gtf and second_set_of_genes.gtf:

python diff_gtf.py --update=out.tsv *.gtf.gz > new.tsv

This will output a table with a row for third_set vs second_set and third_set vs second_set, along with the comparison of first_set and second_set that will simply be copied from the previous results. It is important to include all files on the command line that are to be output. Any comparisons in out.tsv that correspond to files that are not given on the command line will not be output.

Usage

::

cgat diff_gtf.py GTF GTF [GTF [GTF […]]] [OPTIONS] cgat diff_gtf GTF1 –update=OUTFILE [OPTIONS]

where GTF is a gtf or compressed gtf formated file and OUTFILE is the results from a previous run. At least two must be provided unless –update is present.

Type:

python diff_gtf.py --help

for command line help.

Command line options

usage: diff-gtf [-h] [--version] [-s] [-u FILENAME_UPDATE] [-p PATTERN_ID]
                [-g] [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
                [--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                [--log-config-filename LOG_CONFIG_FILENAME]
                [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
                [-E STDERR] [-S STDOUT]
diff-gtf: error: argument -?: expected one argument