diff_gtf.py - compute overlap between multiple gtf files¶

Tags: Genomics Intervals Genesets GTF Comparison

Purpose¶

This script compares multiple set of gtf files. It computes the overlap between bases, exons and genes between each pair of gtf files.

If results from a previous run are present, existing pairs are not re-computed but simply echoed.

The output is a tab-separated table with counts for each pair of files being compared. The fields are:

Column	Content
set	Name of the set
ngenes_total	number of genes in set
ngenes_ovl	number of genes overlapping
ngenes_unique	number of unique genes
nexons_total	number of exons in set
nexons_ovl	number of exons overlapping
nexons_unique	number of unique exons
nbases_total	number of bases in gene set
nbases_ovl	number of bases overlapping
nbases_unique	number of unique bases

Each of these fields will appear twice, once for each of the pair of files. Hence ngenes_unique1 will be the number of genes in set1 that have no exons that overlap with any exons in set2, and vice versa for ngenes_unique2. And on for each field in the table above. This makes a total of 9*2=18 fields containing counts, each starting with an n.

A further set of 18 fields each start with a p and are the corresponding percentage values.

Options¶

-s, --ignore-strand: Ignore strand infomation so that bases overlap even if exons/genes are on different strands
-u, --update=FILENAME: Read in previous results from FILENAME and only output comparisons that are missing.
-p, --pattern-identifier=PATTERN: Provide a regular expression pattern for converting a filename into a set name for the output. The regular expression should capture at least one group. That group will be used to identify that file in the output table (see examples)

Examples

For example if we have two gtf_files that look like:

first_set_of_genes.gtf:
1    protein_coding  exon    1       10      .       +       .       gene_id "1"; transcript_id "1"
1    protein_coding  exon    20      30      .       +       .       gene_id "1"; transcript_id "1"

second_set_of_genes.gtf:
1    protein_coding  exon    25      35      .       +       .       gene_id "1"; transcript_id "1"
2    protein_coding  exon    100     200     .       +       .       gene_id "2"; transcript_id "3"

Then the command:

python diff_gtf.py *.gtf --pattern-identifier='(.+)_of_genes.gtf' > out.tsv

would produce an output file that has a single row with set1 being “second_set” and set2 being “first_set” (these are extracted using that –pattern-identifier option). It will report that set1 contains 2 genes and set2 1 gene. That for each set one of these genes overlaps with the other set. For set1 it will report that 1 gene is unique and that no genes are unique for set2 and so on for exons and bases.

If we want to add a third file to the comparison, “third_set_of_genes.gtf”, we don’t need to redo the comparison between first_set_of_genes.gtf and second_set_of_genes.gtf:

python diff_gtf.py --update=out.tsv *.gtf.gz > new.tsv

This will output a table with a row for third_set vs second_set and third_set vs second_set, along with the comparison of first_set and second_set that will simply be copied from the previous results. It is important to include all files on the command line that are to be output. Any comparisons in out.tsv that correspond to files that are not given on the command line will not be output.

Usage¶

::: cgat diff_gtf.py GTF GTF [GTF [GTF […]]] [OPTIONS] cgat diff_gtf GTF1 –update=OUTFILE [OPTIONS]

where GTF is a gtf or compressed gtf formated file and OUTFILE is the results from a previous run. At least two must be provided unless –update is present.

Type:

python diff_gtf.py --help

for command line help.

Command line options¶

usage: diff-gtf [-h] [--version] [-s] [-u FILENAME_UPDATE] [-p PATTERN_ID]
                [-g] [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
                [--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                [--log-config-filename LOG_CONFIG_FILENAME]
                [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
                [-E STDERR] [-S STDOUT]
diff-gtf: error: argument -?: expected one argument