gtf2tsv.py - convert gtf file to a tab-separated table¶

Tags: Genomics Genesets

Purpose¶

convert a gtf formatted file to tab-separated table. The difference to a plain gtf formatted file is that column headers are added, which can be useful when importing the gene models into a database.

Note that coordinates are converted to 0-based open/closed notation (all on the forward strand).

By default, the gene_id and transcript_id are extracted from the attributes field into separated columns. If -f/--attributes-as-columns is set, all fields in the attributes will be split into separate columns.

The script also implements the reverse operation, converting a tab-separated table into a gtf formatted file.

When using the -m, --map option, the script will output a table mapping gene identifiers to transcripts or peptides.

USING GFF3 FILE: The script also can convert gff3 formatted files to tsv files when specifiying the option –is-gff3 and –attributes-as-columns. Currently only the full GFF3 to task is implimented. Further improvements to this script can be made to only output the attributes only, i.e. –output-only-attributes.

Usage¶

Example:

cgat gtf2tsv < in.gtf

contig	source	feature	start	end	score	strand	frame	gene_id	transcript_id	attributes
chr19	processed_transcript	exon	66345	66509	.		.	ENSG00000225373	ENST00000592209	exon_number “1”; gene_name “AC008993.5”; gene_biotype “pseudogene”; transcript_name “AC008993.5-002”; exon_id “ENSE00001701708”
chr19	processed_transcript	exon	60520	60747	.		.	ENSG00000225373	ENST00000592209	exon_number “2”; gene_name “AC008993.5”; gene_biotype “pseudogene”; transcript_name “AC008993.5-002”; exon_id “ENSE00002735807”
chr19	processed_transcript	exon	60104	60162	.		.	ENSG00000225373	ENST00000592209	exon_number “3”; gene_name “AC008993.5”; gene_biotype “pseudogene”; transcript_name “AC008993.5-002”; exon_id “ENSE00002846866”

To build a map between gene and transcrip identiers, type:

cgat gtf2tsv --output-map=transcript2gene < in.gtf

transcript_id	gene_id
ENST00000269812	ENSG00000141934
ENST00000318050	ENSG00000176695
ENST00000327790	ENSG00000141934

To run the script to convert a gff3 formatted file to tsv, type:

cat file.gff3.gz | cgat gtf3tsv --is-gff3 --attributes-as-columns
> outfile.tsv

Type:

cgat gtf2tsv --help

for command line help.

Command line options¶

usage: gtf2tsv [-h] [--version] [-o] [-f] [--is-gff3] [-i]
               [-m {transcript2gene,peptide2gene,peptide2transcript}]
               [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
               [--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
               [--log-config-filename LOG_CONFIG_FILENAME]
               [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
               [-E STDERR] [-S STDOUT]
gtf2tsv: error: argument -?: expected one argument