gtf2gtf.py - manipulate transcript models

Tags

Genomics Genesets GTF Manipulation

Purpose

This script reads a gene set in gtf format from stdin, applies some transformation, and outputs a new gene set in gtf format to stdout. The transformation is chosen by the --method command line option.

Transformations available for use in this script can broadly be classified into four categories:

  1. sorting gene sets

  2. manipulating gene models

  3. filtering gene sets

  4. setting/resetting fields within a gtf file

Further options for working with gtf files are available in gff2gff.py, which can be run with the specification –is-gtf

Sorting gene sets

sort

Sorts entries in gtf file by one or more fields

option

gene

order in which fields are sorted

gene_id, contig, start

gene+transcript

gene_id, transcript_id, contig, start

contig+gene

contig, gene_id, transcript_id, start

transcript

transcript_id, contig, start

position

contig, start

position+gene

contig( gene_id, start )

gene+position

gene_id, contig, start

gene+exon

gene_id, exon_id

N.B. position+gene sorts by gene_id, start, then subsequently sorts flattened gene lists by contig, start

Manipulating gene-models

Options that can be used to alter the features represented in a gtf file. Only one method can be specified at once.

Input gtfs need to be sorted so that features for a gene or transcript appear consecutively within the file. This can be achevied using --method=sort.

genes-to-unique-chunks`

Divide the complete length of a gene up into chunks that represent ranges of bases that are all present in the same set of transcripts. E.g. for two overlapping exons an entry will be output representing the overlap and a seperate entry each for the sequences only present in one. Ranges which are between the first TSS and last TTS but not present in any transcript (i.e. merged introns) are also output. Useful for DEXSeq like splicing analysis

find-retained-introns

Finds intervals within a transcript that represent retained-introns, here a retained intron is considered to be an intron in one transcript that is entirely contianed within the exon of another. The retained intron will be assigned to the transcript with the containing exon. Where multiple, overlapping introns are contained within a single exon of a transcript, the union of the introns will be output. Thus when considering an indevidual transcript, outputs will be non-overlapping. However, overlapping, or even identical feature can be output if they belong to different transcripts.

merge-exons

Merges overlapping exons for all transcripts of a gene, outputting the merged exons. Can be used in conjunction with merge-exons-distance to set the minimum distance that may appear between two exons before they are merged.If --mark-utr is set, the UTR regions will be output separately.

merge-transcripts

Merges all transcripts of a gene. Outputs contains a single interval that spans the original gene (both introns and exons). If --with-utr is set, the output interval will also contain UTR.

merge-genes

Merges genes that have overlapping exons, outputting a single gene_id and transcript_id for all exons of overlapping genes. The input needs te sorted by transcript ” (Does not merge intervals on different strands).

join-exons

Joins together all exons of a transcript, outputting a single interval that spans the original transcript (both introns and exons). Input needs to be sorted by transcript.

intersect-transcripts

Finds regions representing the intersection of all transcripts of a gene. Output will contain intervals spanning only those bases covered by all transcripts. If --with-utr is set, the UTR will also be included in the intersect. This method only uses exon or CDS features.

merge-introns

Outputs a single interval that spans the region between the start of the first intron and the end of last intron. Single exons genes will not be output. The input needs to be sorted by gene

exons2introns

Merges overlapping introns for all transcripts of a gene, outputting the merged introns. Use --intron-min-length to ignore merged introns below a specified length. Use --intron-border to specify a number of residues to remove at either end of output introns (residues are removed prior to filtering on size when used in conjunction with --intron-min-length).

transcripts2genes

Cluster transcripts into genes by exon overlap ignoring any gene_ids in the gtf file. May be used in conjunction with reset-strand

The option permit-duplicates may be specified in order to allow gene-ids to be duplicated within the input gtf file (i.e. for the same gene-id to appear non-consecutively within the input file). However, this option currently only works for merge-exons, merge-transcripts, merge-introns, and intersect-transcripts. It DOES NOT work for merge-genes, join-exons, or exons-file2introns.

Filtering gene sets

Options that can be used to filter gtf files. For further detail see command line options.

Input gtfs need to be sorted so that features for a gene or transcript appear consecutively within the file. This can be achevied using --method=sort --sort-order.

filter

When filtering on the basis of ‘gene-id’ or ‘transcript-id’ a filename containing ids to be removed may provided using --map-tsv-file. Alternatively, a random subsample of genes/transcripts may be retained using --sam-fileple-size. Use --min-exons-length in conjunction with --sam-fileple-size to specify a minimum length for genes/transcripts to be retained. Use --ignore-strand to set strand to ‘.’ in output.

Other filter options include longest-gene, longest-transcript, or representative-transcript.

When filtering on the basis of gene-id, transcript-id or longest-gene, --invert-filter may be used to invert the selection.

remove-overlapping

Given a second gff formatted file (--file-gff) removes any features overlapping. Any transcripts that intersect intervals in the supplied file are removed. (Does not account for strand.)

remove-duplicates

Remove duplicate features from gtf file. The type of feature to be removed is set by the option -duplicate-feature. Setting --duplicate-feature to ‘gene’, ‘transcript’, or ‘coordinates’ will remove any interval for which non-consecutive occurrances of specified term appear in input gtf file. Setting to ‘ucsc’, will remove any interval for which transcript-id contains ‘_dup’.

Setting fields

Options for altering fields within gtf.

rename-genes

With a mapping file is provided using --map-tsv-file, renames the gene_id to the one supplied. Outputs a gtf file with field renamed. Any entry in input gtf not appearing in mapping file is discarded.

rename-transcripts

as rename-genes, but renames the transcript_id.

add-protein-id

Takes a map of transcript_id to protein_id from the a tsv file (see option --map-tsv-file) and appends the protein_id provided to the attributes field. Any entry with a transcript_id not appearing in the tsv file is discarded.

renumber-genes

Renumber genes from 1 using the pattern provided in --pattern-identifier.

renumber-transcripts

Renumber transcripts from 1 using the pattern provided in --pattern-identifier.

unset-genes

Renumber genes from 1 using the pattern provided in --pattern-identifier. Transcripts with the same gene-id in the input gtf file will have different gene-ids in the output gtf file.

set-transcript-to-gene

Will set the transcript-id to the gene-id for each feature.

set-gene-to-transcript

Will set the gene-id to the transcript-id for each each feature.

set-protein-to-transcript

Will append transcript_id to attributes field as ‘protein_id’

set-score-to-distance

Will reset the score field (field 6) of each feature in input gtf to be the distance from transcription start site to the start of the feature. (Assumes input file is sorted by transcript-id)

set-gene_biotype-to-source

Sets the gene_biotype attribute from the source column. Will only set if biotype attribute is not present in the current record.

rename-duplicates

Rename duplicate gene_ids and transcript_ids by addition of numerical suffix

set-source-to-transcript_biotype

Sets the source attribute to the transcript_biotype attribute. Will only set if transcript_biotype attribute is present in the current record.

Usage

The following example sorts the input gene set by gene (method=sort) so that it can be used as input for method=intersect-transcripts that outputs genomic the genomic regions within a gene that is covered by all transcripts in a gene. Finally, the resultant transcripts are renamed with the pattern “MERGED_%i”:

cgat gtf2gtf
        --method=sort
        --sort-order=gene     | cgat gtf2gtf
           --method=intersect-transcripts
           --with-utr
| cgat gtf2gtf
           --method=renumber-transcripts
           --pattern-identifier=MERGED_%i

Type:

cgat gtf2gtf --help

for command line options.

Command line Options

usage: gtf2gtf [-h] [--version] [--merge-exons-distance MERGE_EXONS_DISTANCE]
               [--pattern-identifier PATTERN]
               [--sort-order {gene,gene+transcript,transcript,position,contig+gene,position+gene,gene+position,gene+exon}]
               [--mark-utr] [--without-utr]
               [--filter-method {gene,transcript,longest-gene,longest-transcript,representative-transcript,proteincoding,lincrna}]
               [-a tsv] [--gff-file GFF] [--invert-filter]
               [--sample-size SAMPLE_SIZE]
               [--intron-min-length INTRON_MIN_LENGTH]
               [--min-exons-length MIN_EXONS_LENGTH]
               [--intron-border INTRON_BORDER] [--ignore-strand]
               [--permit-duplicates]
               [--duplicate-feature {gene,transcript,both,ucsc,coordinates}]
               [--use-gene-id]
               [-m {add-protein-id,exons2introns,filter,find-retained-introns,genes-to-unique-chunks,intersect-transcripts,join-exons,merge-exons,merge-transcripts,merge-genes,merge-introns,remove-overlapping,remove-duplicates,rename-genes,rename-transcripts,rename-duplicates,renumber-genes,renumber-transcripts,set-transcript-to-gene,set-gene-to-transcript,set-protein-to-transcript,set-score-to-distance,set-gene_biotype-to-source,set-source-to-transcript_biotype,sort,transcript2genes,unset-genes}]
               [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
               [--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
               [--log-config-filename LOG_CONFIG_FILENAME]
               [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
               [-E STDERR] [-S STDOUT]
gtf2gtf: error: argument -?: expected one argument