gtf2gff.py - convert a transcript set to genomic features

Tags

Genomics Genesets Intervals Transformation GTF GFF

Purpose

This scripts converts a transcript set in a gtf formatted file into a set of features in a gff formatted file.

In other words, a gene set (gtf), which constitutes a hierarchical set of annotations, will be converted into a non-hierarchical list of genomic segments.

Various methods can be used to do the conversion (see command line argument --method):

exons

annotate exons. Exonic segments are classified according to the transcript structure.

genome/full

annotate genome with gene set. Genomic segments are labeled intronic, intergenic, etc. This annotation aggregates the information of multiple genes such that each annotation is either valid or ambiguous.

genes

annotate genome using the information on a gene-by-gene basis. Multiple overlapping annotations will be created for each transcript. Redundant annotations will be merged.

great-domains

regulatory domains using the basal+extended model according to GREAT.

promotors

declare promoter regions. These segments might be overlapping. A promotor is the region x kb upstream of a transcription start site. The option --promotor-size sets the region width.

regulons

declare regulatory regions. Regulatory regions contain the region x kb of upstream and downstream of a transciption start site. The options --upstream-extension and -downstream set the region width.

tts-regulons

declare tts regulatory regions. tts-regulatory regions contain the region x kb of upstream and downstream of a transciption termination site. The options --upstream-extension and -downstream set the region width.

territories

build gene territories around full length genes.

tss-territories

build gene territories around transcription start sites.

In a simple setting, assume we have the two genes below, the first with a single transcript on the positive strand, the second on the negative strand:

       Gene A                    Gene B
        |---|                 |---|  |---|
     >>>>   >>>>           <<<<   <<<<   <<<<

Genome (simplified result without UTRs and flanks)

       exon   exon         exon   exon   exon
..---><--><-><--><---------<--><-><--><-><--><-----...
intergenic intron  intergenic  intron intron intergenic

Territories

     Gene A                    Gene B
<---------------------><------------------------------>

TSS-Territories

     Gene A                    Gene B
<-------->            <----------->

Promotors

<---->              <---->

Genome

If --method=genome, the gene set is used to annotate the complete genome.

Note

The gtf file has to be sorted first by contig and then by position.

A segment in the genome will either be covered by:

cds

a coding exon (also: CDS, start_codon).

utr

a UTR (also: stop_codon)

5flank, 3flank, flank

an upstream/downstream segment of defined size. If the intergenic region is too small to accomodate a flank, the regions is just ‘flank’.

intergenic

intergenic region.

5telomeric, 3telomeric

telomeric region (before/after first/last gene).

intronic

intronic region. An intron has a minimum size of 30 bases.

frameshift

frameshift. Introns of less than 4 residues length

ambiguous

in case of overlapping genes, regions are designated ambiguous

unknown

unknown are intronic regions that are less than the minimum size of an intron (default: 30) and larger than the size of frameshift (default:4). These could be either genuine small introns or they could be artefactual arising from collapsing the exons within a gene model.

All segments are annotated by their closest gene. Intergenic regions are annotated with their two neighbouring genes. The upstream gene is listed in the attribute gene_id, the downstream one is listed in the attribute downstream_gene_id.

Genes

If --method=genes, the gene set is used to annotate the complete genome.

Note

The gtf file has to be sorted by gene.

A segment in the genome will be annotated as:

cds

a coding exon

utr5, utr3

a 5’ or 3’ utr

exon

an exon. Exons are further classified into first, middle and last exons.

intronic

an intronic region. Intronic regions are further divided into first, middle, last.

upstream, downstream

upstream/downstream regions in 5 intervals of a total of 1kb (see option –flank-size to increase the total size).

Territories

If --method=territories, the gene set is used to define gene territories. Territories are segments around genes and are non-overlapping. Exons in a gene are merged and the resulting the region is enlarged by –radius. Overlapping territories are divided at the midpoint between the two genes. The maximum extent of a territory is limited by the option --territory-extension

Note

The gtf file has to be sorted first by contig and then by position.

Note

Genes should already have been merged (gtf2gtf –merge-transcripts)

TSSTerritories

If --method=tss-territories, the gene set is used to define gene territories. Instead of the full gene length as in Territories, only the tss is used to define a territory. Territories are segments around genes and are non-overlapping. Overlapping territories are divided at the midpoint between the two genes. The maximum extent of a territory is limited by the option --territory-extension.

Note

The gtf file has to be sorted first by contig and then by position.

Note

Genes should already have been merged (gtf2gtf –merge-transcripts)

The domain definitions corresponds to the nearest gene rule in GREAT.

GREAT-Domains

Define GREAT regulatory domains. Each TSS in a gene is associated with a basal region. The basal region is then extended upstream to the basal region of the closest gene, but at most to –radius. In the case of overlapping genes, the extension is towards the next non-overlapping gene.

This is the “basal plus extension” rule in GREAT. Commonly used are 5+1 with 1 Mb extension. To achieve this, use for example:

cgat gtf2gff    --genome-file=hg19    --method=great-domains    --upstream-extension=5000    --downstream-extension=1000    --territory-extension=1000000    < in.gtf > out.gff

If there are a multiple TSS in a transcript, the basal region extends from the first to the last TSS plus the upstream/downstream flank.

Exons

If --method=exons, exons are annotated by their dispensibility.

Note

The gtf file should be sorted by genes

For each exon, the following additional fields are added to the gtf file:

ntranscripts

number of transcripts

nused

number of transcripts using this exon

positions

positions of exon within transcripts. This is a , separated list of tuples pos:total. For example, 1:10,5:8 indicates an exon that appears in first position in a ten exon transcript and fifth position in an eight exon transcript. The position is according to the direction of transcription.

Note

overlapping but non-identical exons, for example due to internal splice sites, are listed as separate exons. Thus the output is not fully flat as some segments could be overlapping (see output variable noverlapping in the log file).

The following example uses an ENSEMBL gene set:: (needs genome-file to run)

gunzip < Mus_musculus.NCBIM37.55.gtf.gz | awk ‘$3 == “CDS”’ | python gtf2gff.py –method=exons –restrict-source=protein_coding

Promoters

If --method=promotors, putative promotor regions are output. A promoter is a pre-defined segment upstream of the transcription start site. As the actual start site is usually not known, the start of the first exon within a transcript is used as a proxy. A gene can have several promotors associated with it, but overlapping promotor regions of the same gene will be merged. A promoter can extend into an adjacent upstream gene.

The --restrict-source option determines which GTF entries are output. The default is to output all entries but the user can choose from protein_coding, pseudogene or lncRNA.

The size of the promotor region can be specified by the command line argument --promotor-size.

Regulons

If --method=regulons, putative regulon regions are output. This is similar to a promotor, but the region extends both upstream and downstream from the transcription start site.

The --restrict-source option determines which GTF entries are output. The default is to output all entries but the user can choose from protein_coding, pseudogene or lncRNA.

The size of the promotor region can be specified by the command line argument --upstream-extension and --downstream-extension

If --method=tts-regulons, regulons will be defined around the transcription termination site.

Usage

Type:

cgat gtf2gff --method=genome --genome-file=hg19 < geneset.gtf > annotations.gff

For command line help:

cgat gtf2gff --help

Command line options

usage: gtf2gff [-h] [--version] [-g GENOME_FILE] [-i]
               [-s {protein_coding,pseudogene,lncRNA}]
               [-m {full,genome,exons,promotors,tts,regulons,tts-regulons,genes,territories,tss-territories,great-domains}]
               [-r RADIUS] [-f FLANK] [--flank-increment-size INCREMENT]
               [-p PROMOTOR] [-u UPSTREAM] [-d DOWNSTREAM]
               [--gene-detail {introns+exons,exons,introns}]
               [--merge-overlapping-promotors]
               [--min-intron-length MIN_INTRON_LENGTH] [--is-unsorted]
               [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
               [--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
               [--log-config-filename LOG_CONFIG_FILENAME]
               [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
               [-E STDERR] [-S STDOUT]
gtf2gff: error: argument -?: expected one argument