gtf2tsv.py - convert gtf file to a tab-separated table¶
- Tags
Genomics Genesets
Purpose¶
convert a gtf formatted file to tab-separated table. The difference to a plain gtf formatted file is that column headers are added, which can be useful when importing the gene models into a database.
Note that coordinates are converted to 0-based open/closed notation (all on the forward strand).
By default, the gene_id and transcript_id are extracted from the
attributes field into separated columns. If
-f/--attributes-as-columns
is set, all fields in the attributes
will be split into separate columns.
The script also implements the reverse operation, converting a tab-separated table into a gtf formatted file.
When using the -m, --map
option, the script will output a table
mapping gene identifiers to transcripts or peptides.
USING GFF3 FILE: The script also can convert gff3 formatted files to tsv files when specifiying the option –is-gff3 and –attributes-as-columns. Currently only the full GFF3 to task is implimented. Further improvements to this script can be made to only output the attributes only, i.e. –output-only-attributes.
Usage¶
Example:
cgat gtf2tsv < in.gtf
contig |
source |
feature |
start |
end |
score |
strand |
frame |
gene_id |
transcript_id |
attributes |
chr19 |
processed_transcript |
exon |
66345 |
66509 |
. |
. |
ENSG00000225373 |
ENST00000592209 |
exon_number “1”; gene_name “AC008993.5”; gene_biotype “pseudogene”; transcript_name “AC008993.5-002”; exon_id “ENSE00001701708” |
|
chr19 |
processed_transcript |
exon |
60520 |
60747 |
. |
. |
ENSG00000225373 |
ENST00000592209 |
exon_number “2”; gene_name “AC008993.5”; gene_biotype “pseudogene”; transcript_name “AC008993.5-002”; exon_id “ENSE00002735807” |
|
chr19 |
processed_transcript |
exon |
60104 |
60162 |
. |
. |
ENSG00000225373 |
ENST00000592209 |
exon_number “3”; gene_name “AC008993.5”; gene_biotype “pseudogene”; transcript_name “AC008993.5-002”; exon_id “ENSE00002846866” |
To build a map between gene and transcrip identiers, type:
cgat gtf2tsv --output-map=transcript2gene < in.gtf
transcript_id |
gene_id |
ENST00000269812 |
ENSG00000141934 |
ENST00000318050 |
ENSG00000176695 |
ENST00000327790 |
ENSG00000141934 |
To run the script to convert a gff3 formatted file to tsv, type:
cat file.gff3.gz | cgat gtf3tsv --is-gff3 --attributes-as-columns
> outfile.tsv
Type:
cgat gtf2tsv --help
for command line help.
Command line options¶
usage: gtf2tsv [-h] [--version] [-o] [-f] [--is-gff3] [-i]
[-m {transcript2gene,peptide2gene,peptide2transcript}]
[--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
[--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
[--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
[-E STDERR] [-S STDOUT]
gtf2tsv: error: argument -?: expected one argument