gtf2fasta.py - annotate genomic bases from a gene set

Tags

Genomics Genesets Sequences GTF FASTA Transformation

Purpose

This script can be used for a quick-and-dirty annotation of variants in a genome. It is most appropriately used in exploratory analyses of the effect of variants/alleles.

For a better prediction of variant effects in coding sequences, see <no title> and <no title>.

If you wish to convert gtf intervals into fasta sequences, use gff2fasta.py.

This script takes a gtf formatted file from ENSEMBL and annotates each base in the genome according to its function. The script multiplexes both strands with lower- case characters referring to the forward strand and upper-case characters referring to the reverse strand.

The codes and their meaning are:

code

description

a

first codon position within a complete codon

b

second codon position within a complete codon

c

third codon position within a complete codon

d

coding base, but in multiple frames or strands

e

non-coding base in exon

f

frame-shifted base

g

intergenic base

i

intronic base

l

base in other RNA

m

base in miRNA

n

base in snRNA

o

base in snoRNA

r

base in rRNA (both genomic and mitochondrial)

p

base in pseudogene (including transcribed, unprocessed and processed)

q

base in retrotransposon

s

base within a splice signal (GT/AG)

t

base in tRNA (both genomic and mitochondrial)

u

base in 5’ UTR

v

base in 3’ UTR

x

ambiguous base with multiple functions.

y

unknown base

Output files

The annotated genome is output on stdout.

The script creates the following additional output files:

counts

Counts for each annotations

junctions

Splice junctions. This is a tab separated table linking residues that are joined via features. The coordinates are forward/reverse coordinates.

The columns are:

contig

the contig

strand

direction of linkage

end

last base of exon in direction of strand

start

first base of exon in direction of strand

frame

frame base at second coordinate (for coding sequences)

Known problems

The stop-codon is part of the UTR. This has the following effects:

  • On the mitochondrial chromosome, the stop-codon might be used for ncRNA transcripts and thus the base is recorded as ambiguous.

  • On the mitochondrial chromosome, alternative transcripts might read through a stop-codon (RNA editing). The codon itself will be recorded as ambiguous.

Usage

For example:

zcat hg19.gtf.gz | python gtf2fasta.py --genome-file=hg19 > hg19.annotated

Type:

python gtf2fasta.py --help

for command line help.

Command line options

--genome-file

required option. filename for genome fasta file

--ignore-missing

transcripts on contigs not in the genome file will be ignored

--min-intron-length

intronic bases in introns less than specified length will be marked “unknown”

usage: gtf2fasta [-h] [--version] [-g GENOME_FILE] [-i]
                 [--min-intron-length MIN_INTRON_LENGTH] [-m {full}]
                 [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
                 [--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                 [--log-config-filename LOG_CONFIG_FILENAME]
                 [--tracing {function}] [-? ?] [-P OUTPUT_FILENAME_PATTERN]
                 [-F] [-I STDIN] [-L STDLOG] [-E STDERR] [-S STDOUT]
gtf2fasta: error: argument -?: expected one argument