gtf2tsv.py - convert gtf file to a tab-separated table

Tags

Genomics Genesets

Purpose

convert a gtf formatted file to tab-separated table. The difference to a plain gtf formatted file is that column headers are added, which can be useful when importing the gene models into a database.

Note that coordinates are converted to 0-based open/closed notation (all on the forward strand).

By default, the gene_id and transcript_id are extracted from the attributes field into separated columns. If -f/--attributes-as-columns is set, all fields in the attributes will be split into separate columns.

The script also implements the reverse operation, converting a tab-separated table into a gtf formatted file.

When using the -m, --map option, the script will output a table mapping gene identifiers to transcripts or peptides.

USING GFF3 FILE: The script also can convert gff3 formatted files to tsv files when specifiying the option –is-gff3 and –attributes-as-columns. Currently only the full GFF3 to task is implimented. Further improvements to this script can be made to only output the attributes only, i.e. –output-only-attributes.

Usage

Example:

cgat gtf2tsv < in.gtf

contig

source

feature

start

end

score

strand

frame

gene_id

transcript_id

attributes

chr19

processed_transcript

exon

66345

66509

.

.

ENSG00000225373

ENST00000592209

exon_number “1”; gene_name “AC008993.5”; gene_biotype “pseudogene”; transcript_name “AC008993.5-002”; exon_id “ENSE00001701708”

chr19

processed_transcript

exon

60520

60747

.

.

ENSG00000225373

ENST00000592209

exon_number “2”; gene_name “AC008993.5”; gene_biotype “pseudogene”; transcript_name “AC008993.5-002”; exon_id “ENSE00002735807”

chr19

processed_transcript

exon

60104

60162

.

.

ENSG00000225373

ENST00000592209

exon_number “3”; gene_name “AC008993.5”; gene_biotype “pseudogene”; transcript_name “AC008993.5-002”; exon_id “ENSE00002846866”

To build a map between gene and transcrip identiers, type:

cgat gtf2tsv --output-map=transcript2gene < in.gtf

transcript_id

gene_id

ENST00000269812

ENSG00000141934

ENST00000318050

ENSG00000176695

ENST00000327790

ENSG00000141934

To run the script to convert a gff3 formatted file to tsv, type:

cat file.gff3.gz | cgat gtf3tsv --is-gff3 --attributes-as-columns
> outfile.tsv

Type:

cgat gtf2tsv --help

for command line help.

Command line options

usage: gtf2tsv [-h] [--version] [-o] [-f] [--is-gff3] [-i]
               [-m {transcript2gene,peptide2gene,peptide2transcript}]
               [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
               [--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
               [--log-config-filename LOG_CONFIG_FILENAME]
               [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
               [-E STDERR] [-S STDOUT]
gtf2tsv: error: argument -?: expected one argument