gtf2fasta.py - annotate genomic bases from a gene set¶
- Tags
Genomics Genesets Sequences GTF FASTA Transformation
Purpose¶
This script can be used for a quick-and-dirty annotation of variants in a genome. It is most appropriately used in exploratory analyses of the effect of variants/alleles.
For a better prediction of variant effects in coding sequences, see <no title> and <no title>.
If you wish to convert gtf intervals into fasta sequences, use gff2fasta.py.
This script takes a gtf formatted file from ENSEMBL and annotates each base in the genome according to its function. The script multiplexes both strands with lower- case characters referring to the forward strand and upper-case characters referring to the reverse strand.
The codes and their meaning are:
code |
description |
a |
first codon position within a complete codon |
b |
second codon position within a complete codon |
c |
third codon position within a complete codon |
d |
coding base, but in multiple frames or strands |
e |
non-coding base in exon |
f |
frame-shifted base |
g |
intergenic base |
i |
intronic base |
l |
base in other RNA |
m |
base in miRNA |
n |
base in snRNA |
o |
base in snoRNA |
r |
base in rRNA (both genomic and mitochondrial) |
p |
base in pseudogene (including transcribed, unprocessed and processed) |
q |
base in retrotransposon |
s |
base within a splice signal (GT/AG) |
t |
base in tRNA (both genomic and mitochondrial) |
u |
base in 5’ UTR |
v |
base in 3’ UTR |
x |
ambiguous base with multiple functions. |
y |
unknown base |
Output files¶
The annotated genome is output on stdout.
The script creates the following additional output files:
- counts
Counts for each annotations
- junctions
Splice junctions. This is a tab separated table linking residues that are joined via features. The coordinates are forward/reverse coordinates.
The columns are:
- contig
the contig
- strand
direction of linkage
- end
last base of exon in direction of strand
- start
first base of exon in direction of strand
- frame
frame base at second coordinate (for coding sequences)
Known problems¶
The stop-codon is part of the UTR. This has the following effects:
On the mitochondrial chromosome, the stop-codon might be used for ncRNA transcripts and thus the base is recorded as ambiguous.
On the mitochondrial chromosome, alternative transcripts might read through a stop-codon (RNA editing). The codon itself will be recorded as ambiguous.
Usage¶
For example:
zcat hg19.gtf.gz | python gtf2fasta.py --genome-file=hg19 > hg19.annotated
Type:
python gtf2fasta.py --help
for command line help.
Command line options¶
--genome-file
required option. filename for genome fasta file
--ignore-missing
transcripts on contigs not in the genome file will be ignored
--min-intron-length
intronic bases in introns less than specified length will be marked “unknown”
usage: gtf2fasta [-h] [--version] [-g GENOME_FILE] [-i]
[--min-intron-length MIN_INTRON_LENGTH] [-m {full}]
[--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
[--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
[--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?] [-P OUTPUT_FILENAME_PATTERN]
[-F] [-I STDIN] [-L STDLOG] [-E STDERR] [-S STDOUT]
gtf2fasta: error: argument -?: expected one argument