diff_gtf.py - compute overlap between multiple gtf files¶
- Tags
Genomics Intervals Genesets GTF Comparison
Purpose¶
This script compares multiple set of gtf files. It computes the overlap between bases, exons and genes between each pair of gtf files.
If results from a previous run are present, existing pairs are not re-computed but simply echoed.
The output is a tab-separated table with counts for each pair of files being compared. The fields are:
Column |
Content |
set |
Name of the set |
ngenes_total |
number of genes in set |
ngenes_ovl |
number of genes overlapping |
ngenes_unique |
number of unique genes |
nexons_total |
number of exons in set |
nexons_ovl |
number of exons overlapping |
nexons_unique |
number of unique exons |
nbases_total |
number of bases in gene set |
nbases_ovl |
number of bases overlapping |
nbases_unique |
number of unique bases |
Each of these fields will appear twice, once for each of the pair of files. Hence ngenes_unique1 will be the number of genes in set1 that have no exons that overlap with any exons in set2, and vice versa for ngenes_unique2. And on for each field in the table above. This makes a total of 9*2=18 fields containing counts, each starting with an n.
A further set of 18 fields each start with a p
and are the corresponding
percentage values.
Options¶
- -s, --ignore-strand
Ignore strand infomation so that bases overlap even if exons/genes are on different strands
- -u, --update=FILENAME
Read in previous results from FILENAME and only output comparisons that are missing.
- -p, --pattern-identifier=PATTERN
Provide a regular expression pattern for converting a filename into a set name for the output. The regular expression should capture at least one group. That group will be used to identify that file in the output table (see examples)
Examples
For example if we have two gtf_files that look like:
first_set_of_genes.gtf:
1 protein_coding exon 1 10 . + . gene_id "1"; transcript_id "1"
1 protein_coding exon 20 30 . + . gene_id "1"; transcript_id "1"
second_set_of_genes.gtf:
1 protein_coding exon 25 35 . + . gene_id "1"; transcript_id "1"
2 protein_coding exon 100 200 . + . gene_id "2"; transcript_id "3"
Then the command:
python diff_gtf.py *.gtf --pattern-identifier='(.+)_of_genes.gtf' > out.tsv
would produce an output file that has a single row with set1 being “second_set” and set2 being “first_set” (these are extracted using that –pattern-identifier option). It will report that set1 contains 2 genes and set2 1 gene. That for each set one of these genes overlaps with the other set. For set1 it will report that 1 gene is unique and that no genes are unique for set2 and so on for exons and bases.
If we want to add a third file to the comparison, “third_set_of_genes.gtf”, we don’t need to redo the comparison between first_set_of_genes.gtf and second_set_of_genes.gtf:
python diff_gtf.py --update=out.tsv *.gtf.gz > new.tsv
This will output a table with a row for third_set vs second_set and
third_set vs second_set, along with the comparison of first_set and
second_set that will simply be copied from the previous results. It is
important to include all files on the command line that are to be
output. Any comparisons in out.tsv
that correspond to files that
are not given on the command line will not be output.
Usage¶
- ::
cgat diff_gtf.py GTF GTF [GTF [GTF […]]] [OPTIONS] cgat diff_gtf GTF1 –update=OUTFILE [OPTIONS]
where GTF is a gtf or compressed gtf formated file and OUTFILE is the results from a previous run. At least two must be provided unless –update is present.
Type:
python diff_gtf.py --help
for command line help.
Command line options¶
usage: diff-gtf [-h] [--version] [-s] [-u FILENAME_UPDATE] [-p PATTERN_ID]
[-g] [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
[--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
[--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
[-E STDERR] [-S STDOUT]
diff-gtf: error: argument -?: expected one argument