gtf2gff.py - convert a transcript set to genomic features¶
- Tags
Genomics Genesets Intervals Transformation GTF GFF
Purpose¶
This scripts converts a transcript set in a gtf formatted file into a set of features in a gff formatted file.
In other words, a gene set (gtf), which constitutes a hierarchical set of annotations, will be converted into a non-hierarchical list of genomic segments.
Various methods can be used to do the conversion (see command line
argument --method
):
- exons
annotate exons. Exonic segments are classified according to the transcript structure.
- genome/full
annotate genome with gene set. Genomic segments are labeled
intronic
,intergenic
, etc. This annotation aggregates the information of multiple genes such that each annotation is either valid or ambiguous.- genes
annotate genome using the information on a gene-by-gene basis. Multiple overlapping annotations will be created for each transcript. Redundant annotations will be merged.
- great-domains
regulatory domains using the basal+extended model according to GREAT.
- promotors
declare promoter regions. These segments might be overlapping. A promotor is the region x kb upstream of a transcription start site. The option
--promotor-size
sets the region width.- regulons
declare regulatory regions. Regulatory regions contain the region x kb of upstream and downstream of a transciption start site. The options
--upstream-extension
and-downstream
set the region width.- tts-regulons
declare tts regulatory regions. tts-regulatory regions contain the region x kb of upstream and downstream of a transciption termination site. The options
--upstream-extension
and-downstream
set the region width.- territories
build gene territories around full length genes.
- tss-territories
build gene territories around transcription start sites.
In a simple setting, assume we have the two genes below, the first with a single transcript on the positive strand, the second on the negative strand:
Gene A Gene B
|---| |---| |---|
>>>> >>>> <<<< <<<< <<<<
Genome (simplified result without UTRs and flanks)
exon exon exon exon exon
..---><--><-><--><---------<--><-><--><-><--><-----...
intergenic intron intergenic intron intron intergenic
Territories
Gene A Gene B
<---------------------><------------------------------>
TSS-Territories
Gene A Gene B
<--------> <----------->
Promotors
<----> <---->
Genome¶
If --method=genome
, the gene set is used to annotate the complete genome.
Note
The gtf file has to be sorted first by contig and then by position.
A segment in the genome will either be covered by:
- cds
a coding exon (also: CDS, start_codon).
- utr
a UTR (also: stop_codon)
- 5flank, 3flank, flank
an upstream/downstream segment of defined size. If the intergenic region is too small to accomodate a flank, the regions is just ‘flank’.
- intergenic
intergenic region.
- 5telomeric, 3telomeric
telomeric region (before/after first/last gene).
- intronic
intronic region. An intron has a minimum size of 30 bases.
- frameshift
frameshift. Introns of less than 4 residues length
- ambiguous
in case of overlapping genes, regions are designated ambiguous
- unknown
unknown are
intronic
regions that are less than the minimum size of an intron (default: 30) and larger than the size of frameshift (default:4). These could be either genuine small introns or they could be artefactual arising from collapsing the exons within a gene model.
All segments are annotated by their closest gene. Intergenic regions are annotated with their two neighbouring genes. The upstream gene is listed in the attribute gene_id, the downstream one is listed in the attribute downstream_gene_id.
Genes¶
If --method=genes
, the gene set is used to annotate the complete genome.
Note
The gtf file has to be sorted by gene.
A segment in the genome will be annotated as:
- cds
a coding exon
- utr5, utr3
a 5’ or 3’ utr
- exon
an exon. Exons are further classified into first, middle and last exons.
- intronic
an intronic region. Intronic regions are further divided into first, middle, last.
- upstream, downstream
upstream/downstream regions in 5 intervals of a total of 1kb (see option –flank-size to increase the total size).
Territories¶
If --method=territories
, the gene set is used to define gene
territories. Territories are segments around genes and are
non-overlapping. Exons in a gene are merged and the resulting the
region is enlarged by –radius. Overlapping territories are divided at
the midpoint between the two genes. The maximum extent of a territory
is limited by the option --territory-extension
Note
The gtf file has to be sorted first by contig and then by position.
Note
Genes should already have been merged (gtf2gtf –merge-transcripts)
TSSTerritories¶
If --method=tss-territories
, the gene set is used to define gene
territories. Instead of the full gene length as in
Territories, only the tss is used to define a
territory. Territories are segments around genes and are
non-overlapping. Overlapping territories are divided at the midpoint
between the two genes. The maximum extent of a territory is limited by
the option --territory-extension
.
Note
The gtf file has to be sorted first by contig and then by position.
Note
Genes should already have been merged (gtf2gtf –merge-transcripts)
The domain definitions corresponds to the nearest gene
rule in GREAT.
GREAT-Domains¶
Define GREAT regulatory domains. Each TSS in a gene is associated with a basal region. The basal region is then extended upstream to the basal region of the closest gene, but at most to –radius. In the case of overlapping genes, the extension is towards the next non-overlapping gene.
This is the “basal plus extension” rule in GREAT. Commonly used are 5+1 with 1 Mb extension. To achieve this, use for example:
cgat gtf2gff --genome-file=hg19 --method=great-domains --upstream-extension=5000 --downstream-extension=1000 --territory-extension=1000000 < in.gtf > out.gff
If there are a multiple TSS in a transcript, the basal region extends from the first to the last TSS plus the upstream/downstream flank.
Exons¶
If --method=exons
, exons are annotated by their dispensibility.
Note
The gtf file should be sorted by genes
For each exon, the following additional fields are added to the gtf file:
- ntranscripts
number of transcripts
- nused
number of transcripts using this exon
- positions
positions of exon within transcripts. This is a
,
separated list of tuplespos:total
. For example,1:10,5:8
indicates an exon that appears in first position in a ten exon transcript and fifth position in an eight exon transcript. The position is according to the direction of transcription.
Note
overlapping but non-identical exons, for example due to internal
splice sites, are listed as separate exons. Thus the output is not
fully flat as some segments could be overlapping (see output
variable noverlapping
in the log file).
The following example uses an ENSEMBL gene set:: (needs genome-file to run)
gunzip < Mus_musculus.NCBIM37.55.gtf.gz | awk ‘$3 == “CDS”’ | python gtf2gff.py –method=exons –restrict-source=protein_coding
Promoters¶
If --method=promotors
, putative promotor regions are output. A
promoter is a pre-defined segment upstream of the transcription start
site. As the actual start site is usually not known, the start of the
first exon within a transcript is used as a proxy. A gene can have
several promotors associated with it, but overlapping promotor regions
of the same gene will be merged. A promoter can extend into an
adjacent upstream gene.
The --restrict-source
option determines which GTF entries are
output. The default is to output all entries but the user can choose
from protein_coding, pseudogene or lncRNA.
The size of the promotor region can be specified by the command line
argument --promotor-size
.
Regulons¶
If --method=regulons
, putative regulon regions are output. This is similar
to a promotor
, but the region extends both upstream and downstream from
the transcription start site.
The --restrict-source
option determines which GTF entries are
output. The default is to output all entries but the user can choose
from protein_coding, pseudogene or lncRNA.
The size of the promotor region can be specified by the command line
argument --upstream-extension
and --downstream-extension
If --method=tts-regulons
, regulons will be defined around the
transcription termination site.
Usage¶
Type:
cgat gtf2gff --method=genome --genome-file=hg19 < geneset.gtf > annotations.gff
For command line help:
cgat gtf2gff --help
Command line options¶
usage: gtf2gff [-h] [--version] [-g GENOME_FILE] [-i]
[-s {protein_coding,pseudogene,lncRNA}]
[-m {full,genome,exons,promotors,tts,regulons,tts-regulons,genes,territories,tss-territories,great-domains}]
[-r RADIUS] [-f FLANK] [--flank-increment-size INCREMENT]
[-p PROMOTOR] [-u UPSTREAM] [-d DOWNSTREAM]
[--gene-detail {introns+exons,exons,introns}]
[--merge-overlapping-promotors]
[--min-intron-length MIN_INTRON_LENGTH] [--is-unsorted]
[--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
[--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
[--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
[-E STDERR] [-S STDOUT]
gtf2gff: error: argument -?: expected one argument