gtf2fasta.py - annotate genomic bases from a gene set¶

Tags: Genomics Genesets Sequences GTF FASTA Transformation

Purpose¶

This script can be used for a quick-and-dirty annotation of variants in a genome. It is most appropriately used in exploratory analyses of the effect of variants/alleles.

For a better prediction of variant effects in coding sequences, see <no title> and <no title>.

If you wish to convert gtf intervals into fasta sequences, use gff2fasta.py.

This script takes a gtf formatted file from ENSEMBL and annotates each base in the genome according to its function. The script multiplexes both strands with lower- case characters referring to the forward strand and upper-case characters referring to the reverse strand.

The codes and their meaning are:

code	description
a	first codon position within a complete codon
b	second codon position within a complete codon
c	third codon position within a complete codon
d	coding base, but in multiple frames or strands
e	non-coding base in exon
f	frame-shifted base
g	intergenic base
i	intronic base
l	base in other RNA
m	base in miRNA
n	base in snRNA
o	base in snoRNA
r	base in rRNA (both genomic and mitochondrial)
p	base in pseudogene (including transcribed, unprocessed and processed)
q	base in retrotransposon
s	base within a splice signal (GT/AG)
t	base in tRNA (both genomic and mitochondrial)
u	base in 5’ UTR
v	base in 3’ UTR
x	ambiguous base with multiple functions.
y	unknown base

Output files¶

The annotated genome is output on stdout.

The script creates the following additional output files:

counts

Counts for each annotations

junctions

Splice junctions. This is a tab separated table linking residues that are joined via features. The coordinates are forward/reverse coordinates.

The columns are:

contig: the contig
strand: direction of linkage
end: last base of exon in direction of strand
start: first base of exon in direction of strand
frame: frame base at second coordinate (for coding sequences)

Known problems¶

The stop-codon is part of the UTR. This has the following effects:

On the mitochondrial chromosome, the stop-codon might be used for ncRNA transcripts and thus the base is recorded as ambiguous.

On the mitochondrial chromosome, alternative transcripts might read through a stop-codon (RNA editing). The codon itself will be recorded as ambiguous.

Usage¶

For example:

zcat hg19.gtf.gz | python gtf2fasta.py --genome-file=hg19 > hg19.annotated

Type:

python gtf2fasta.py --help

for command line help.

Command line options¶

--genome-file: required option. filename for genome fasta file
--ignore-missing: transcripts on contigs not in the genome file will be ignored
--min-intron-length: intronic bases in introns less than specified length will be marked “unknown”

usage: gtf2fasta [-h] [--version] [-g GENOME_FILE] [-i]
                 [--min-intron-length MIN_INTRON_LENGTH] [-m {full}]
                 [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
                 [--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                 [--log-config-filename LOG_CONFIG_FILENAME]
                 [--tracing {function}] [-? ?] [-P OUTPUT_FILENAME_PATTERN]
                 [-F] [-I STDIN] [-L STDLOG] [-E STDERR] [-S STDOUT]
gtf2fasta: error: argument -?: expected one argument