gff2stats.py - count features, etc. in gff file

Tags

Genomics Intervals GFF GTF Summary

Purpose

This script generates summary statistics over features, source, gene_id and transcript_id in one or more gff or gtf formatted files.

Usage

Input is either a gff or gtf file; gtf input must be specified with the –is-gtf option.

Example:

python gff2stats.py --is-gtf example.gtf > example_sum.tsv

cat example.gtf

19  processed_transcript  exon  6634666509  .  -  .  gene_id "ENSG00000225373"; transcript_id "ENST00000592209" ...
19  processed_transcript  exon  6052160747  .  -  .  gene_id "ENSG00000225373"; transcript_id "ENST00000592209" ...
19  processed_transcript  exon  6010560162  .  -  .  gene_id "ENSG00000225373"; transcript_id "ENST00000592209" ...
19  processed_transcript  exon  6634666416  .  -  .  gene_id "ENSG00000225373"; transcript_id "ENST00000589741" ...

cat example_sum.tsv

track  contigs  strands  features  sources  genes  transcripts ...
stdin  1        2        4         23       2924   12752       ...

The counter used is dependent on the file type. For a gff file, the implemented counters are:

  1. number of intervals per contig, strand, feature and source

For a gtf file, the additional implemented counters are:

  1. number of genes, transcripts, single exon transcripts

  2. summary statistics for exon numbers, exon sizes, intron sizes and transcript sizes

The output is a tab-separated table.

Options

The default action of gff2stats is to count over contigs, strand, feature and source. This assumes the input file is a gff file

There is a single option for this script:

``--is-gtf``

The input file is gtf format. The output will therefore contain summaries over exon numbers, exon sizes, intron sizes and transcript sizes in addition to the the number of genes, transcripts and single exon transcripts.

Type:

python gff2stats.py --help

for command line help.

Command line options

usage: gff2stats [-h] [--version] [--is-gtf] [--timeit TIMEIT_FILE]
                 [--timeit-name TIMEIT_NAME] [--timeit-header]
                 [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                 [--log-config-filename LOG_CONFIG_FILENAME]
                 [--tracing {function}] [-? ?] [-P OUTPUT_FILENAME_PATTERN]
                 [-F] [-I STDIN] [-L STDLOG] [-E STDERR] [-S STDOUT]
gff2stats: error: argument -?: expected one argument