gtfs2tsv.py - compare two genesets

Tags

Python

Purpose

This script compares two genesets (required) in gtf-formatted files and output lists of shared and unique genes.

It outputs the results of the comparison into various sections. The sections are split into separate output files whose names are determined by the --output-filename-pattern option. The sections are:

genes_ovl

Table with overlapping genes

genes_total

Summary statistic of overlapping genes

genes_uniq1

List of genes unique in set 1

genes_uniq2

List of genes unique in set 2

Options

--output-filename-pattern

This option defines how the output filenames are determined for the sections described in the Purpose section above.

Usage

Example:

head a.gtf::

  19 processed_transcript exon 66346 66509 . - . gene_id "ENSG00000225373";
  transcript_id "ENST00000592209"; exon_number "1"; gene_name "AC008993.5";
  gene_biotype "pseudogene"; transcript_name "AC008993.5-002";
  exon_id "ENSE00001701708";

  19 processed_transcript exon 60521 60747 . - . gene_id "ENSG00000225373";
  transcript_id "ENST00000592209"; exon_number "2"; gene_name "AC008993.5";
  gene_biotype "pseudogene"; transcript_name "AC008993.5-002";
  exon_id "ENSE00002735807";

  19 processed_transcript exon 60105 60162 . - . gene_id "ENSG00000225373";
  transcript_id "ENST00000592209"; exon_number "3"; gene_name "AC008993.5";
  gene_biotype "pseudogene"; transcript_name "AC008993.5-002";
  exon_id "ENSE00002846866";

head b.gtf::

  19 transcribed_processed_pseudogene exon 66320 66492 . - .
  gene_id "ENSG00000225373"; transcript_id "ENST00000587045"; exon_number "1";
  gene_name "AC008993.5"; gene_biotype "pseudogene";
  transcript_name "AC008993.5-001"; exon_id "ENSE00002739353";

  19 lincRNA exon 68403 69146 . + . gene_id "ENSG00000267111";
  transcript_id "ENST00000589495"; exon_number "1"; gene_name "AC008993.2";
  gene_biotype "lincRNA"; transcript_name "AC008993.2-001";
  exon_id "ENSE00002777656";

  19 lincRNA exon 71161 71646 . + . gene_id "ENSG00000267588";
  transcript_id "ENST00000590978"; exon_number "1"; gene_name "MIR1302-2";
  gene_biotype "lincRNA"; transcript_name "MIR1302-2-001";
  exon_id "ENSE00002870487";

python gtfs2tsv.py a.gtf b.gtf > out.tsv

head out.tsv::

  contigs source feature start end score strand frame gene_id transcript_id attributes
  19 processed_transcript exon 66345 66509 . - . ENSG00000225373 ENST00000592209 exon_number "1";
  gene_name "AC008993.5"; gene_biotype "pseudogene"; transcript_name "AC008993.5-002";
  exon_id "ENSE00001701708"
  19 processed_transcript exon 60520 60747 . - . ENSG00000225373 ENST00000592209 exon_number "2";
  gene_name "AC008993.5"; gene_biotype "pseudogene"; transcript_name "AC008993.5-002";
  exon_id "ENSE00002735807"

Type:

python gtfs2tsv.py --help

for command line help.

Command line options

usage: gtfs2tsv [-h] [--version] [-e] [-f] [-p] [-s] [--timeit TIMEIT_FILE]
                [--timeit-name TIMEIT_NAME] [--timeit-header]
                [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                [--log-config-filename LOG_CONFIG_FILENAME]
                [--tracing {function}] [-? ?] [-P OUTPUT_FILENAME_PATTERN]
                [-F] [-I STDIN] [-L STDLOG] [-E STDERR] [-S STDOUT]
gtfs2tsv: error: argument -?: expected one argument