gff32gtf.py - various methods for converting gff3 files to gtf

Tags

Python

Purpose

Provide a range of methods for converting GFF3 formated files to valid GTF format files.

Background

While the various flavours of GFF format are supposedly backward compatible, this is broken by GTF2.2 and GFF3. GTF requires the presence of gene_id and transcript_id fields for each record. This not so for GFF3. Further key,value tags in the attributes fields of GTF are ” ” delimited, but are “=” delimited in GFF.

Conversion is non-trivial. GFF3 records are hierachical. To find the gene_id and transcript_id one must traverse the hierarchy to the correct point. Futher records can have multiple parents.

-> Exon

While the standard structure is Gene -> mRNA -| ,

-> CDS

this is not manditory, and it is possible the conversion will want to be done in a different way.

Usage

Example:

python gff32gtf.py --method=[METHOD] [options]

Their are several ways in which the conversion can be done:

hierachical

By default this script will read in the entire GFF3 file, and then for each entry traverse the hierarchy until an object of type GENE_TYPE (“gene” by default”) or an object with no parent is found. This becomes the “gene_id”. Any object of TRANSCRIPT_TYPE encountered on the way is set as the transcript_id. If not such object is encountered then the object directly below the gene object is used as the trancript_id. Objects that belong to multipe transcripts or genes are duplicated.

This method requires ID and Parent fields to be present.

Because this method reads the whole file in, it uses the most memory, although see –read-twice and –by-chrom for tricks that might help.

set-field

The gene_id and transcript_id fields are set to the value of a provided field. Records that don’t have these fields are discarded. By default:

transcript_id=ID gene_id=Parent

set-pattern

As above, but the fieldnames are set by a string format involving the fields of the record.

set-none

transcript_id and gene_id are set to None.

Command line options

usage: gff32gtf [-h] [-m {hierarchy,set-field,set-pattern,set-none}]
                [-g GENE_TYPE] [-t TRANSCRIPT_TYPE] [-d]
                [--gene-id GENE_FIELD_OR_PATTERN]
                [--transcript-id TRANSCRIPT_FIELD_OR_PATTERN]
                [--parent-field PARENT] [--read-twice] [--by-chrom]
                [--fail-missing-gene] [--timeit TIMEIT_FILE]
                [--timeit-name TIMEIT_NAME] [--timeit-header]
                [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                [--log-config-filename LOG_CONFIG_FILENAME]
                [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
                [-E STDERR] [-S STDOUT]
gff32gtf: error: argument -?: expected one argument