gtf2gtf.py - manipulate transcript models¶
- Tags
Genomics Genesets GTF Manipulation
Purpose¶
This script reads a gene set in gtf format from stdin, applies some
transformation, and outputs a new gene set in gtf format to stdout.
The transformation is chosen by the --method
command line option.
Transformations available for use in this script can broadly be classified into four categories:
sorting gene sets
manipulating gene models
filtering gene sets
setting/resetting fields within a gtf file
Further options for working with gtf files are available in gff2gff.py, which can be run with the specification –is-gtf
Sorting gene sets¶
sort
Sorts entries in gtf file by one or more fields
option
gene
order in which fields are sorted
gene_id, contig, start
gene+transcript
gene_id, transcript_id, contig, start
contig+gene
contig, gene_id, transcript_id, start
transcript
transcript_id, contig, start
position
contig, start
position+gene
contig( gene_id, start )
gene+position
gene_id, contig, start
gene+exon
gene_id, exon_id
N.B. position+gene sorts by gene_id, start, then subsequently sorts flattened gene lists by contig, start
Manipulating gene-models¶
Options that can be used to alter the features represented in a gtf file. Only one method can be specified at once.
Input gtfs need to be sorted so that features for a gene or transcript
appear consecutively within the file. This can be achevied using
--method=sort
.
genes-to-unique-chunks`
Divide the complete length of a gene up into chunks that represent ranges of bases that are all present in the same set of transcripts. E.g. for two overlapping exons an entry will be output representing the overlap and a seperate entry each for the sequences only present in one. Ranges which are between the first TSS and last TTS but not present in any transcript (i.e. merged introns) are also output. Useful for DEXSeq like splicing analysis
find-retained-introns
Finds intervals within a transcript that represent retained-introns, here a retained intron is considered to be an intron in one transcript that is entirely contianed within the exon of another. The retained intron will be assigned to the transcript with the containing exon. Where multiple, overlapping introns are contained within a single exon of a transcript, the union of the introns will be output. Thus when considering an indevidual transcript, outputs will be non-overlapping. However, overlapping, or even identical feature can be output if they belong to different transcripts.
merge-exons
Merges overlapping exons for all transcripts of a gene, outputting the merged exons. Can be used in conjunction with
merge-exons-distance
to set the minimum distance that may appear between two exons before they are merged.If--mark-utr
is set, the UTR regions will be output separately.merge-transcripts
Merges all transcripts of a gene. Outputs contains a single interval that spans the original gene (both introns and exons). If
--with-utr
is set, the output interval will also contain UTR.
merge-genes
Merges genes that have overlapping exons, outputting a single gene_id and transcript_id for all exons of overlapping genes. The input needs te sorted by transcript ” (Does not merge intervals on different strands).
join-exons
Joins together all exons of a transcript, outputting a single interval that spans the original transcript (both introns and exons). Input needs to be sorted by transcript.
intersect-transcripts
Finds regions representing the intersection of all transcripts of a gene. Output will contain intervals spanning only those bases covered by all transcripts. If
--with-utr
is set, the UTR will also be included in the intersect. This method only usesexon
orCDS
features.merge-introns
Outputs a single interval that spans the region between the start of the first intron and the end of last intron. Single exons genes will not be output. The input needs to be sorted by gene
exons2introns
Merges overlapping introns for all transcripts of a gene, outputting the merged introns. Use
--intron-min-length
to ignore merged introns below a specified length. Use--intron-border
to specify a number of residues to remove at either end of output introns (residues are removed prior to filtering on size when used in conjunction with--intron-min-length
).transcripts2genes
Cluster transcripts into genes by exon overlap ignoring any gene_ids in the gtf file. May be used in conjunction with
reset-strand
The option permit-duplicates
may be specified in order to
allow gene-ids to be duplicated within the input gtf file
(i.e. for the same gene-id to appear non-consecutively within the
input file). However, this option currently only works for
merge-exons
, merge-transcripts
, merge-introns
, and
intersect-transcripts
. It DOES NOT work for merge-genes
,
join-exons
, or exons-file2introns
.
Filtering gene sets¶
Options that can be used to filter gtf files. For further detail see command line options.
Input gtfs need to be sorted so that features for a gene or transcript
appear consecutively within the file. This can be achevied using
--method=sort --sort-order
.
filter
When filtering on the basis of ‘gene-id’ or ‘transcript-id’ a filename containing ids to be removed may provided using
--map-tsv-file
. Alternatively, a random subsample of genes/transcripts may be retained using--sam-fileple-size
. Use--min-exons-length
in conjunction with--sam-fileple-size
to specify a minimum length for genes/transcripts to be retained. Use--ignore-strand
to set strand to ‘.’ in output.Other filter options include longest-gene, longest-transcript, or representative-transcript.
When filtering on the basis of gene-id, transcript-id or longest-gene,
--invert-filter
may be used to invert the selection.remove-overlapping
Given a second gff formatted file (
--file-gff
) removes any features overlapping. Any transcripts that intersect intervals in the supplied file are removed. (Does not account for strand.)remove-duplicates
Remove duplicate features from gtf file. The type of feature to be removed is set by the option
-duplicate-feature
. Setting--duplicate-feature
to ‘gene’, ‘transcript’, or ‘coordinates’ will remove any interval for which non-consecutive occurrances of specified term appear in input gtf file. Setting to ‘ucsc’, will remove any interval for which transcript-id contains ‘_dup’.
Setting fields¶
Options for altering fields within gtf.
rename-genes
With a mapping file is provided using
--map-tsv-file
, renames the gene_id to the one supplied. Outputs a gtf file with field renamed. Any entry in input gtf not appearing in mapping file is discarded.rename-transcripts
as
rename-genes
, but renames the transcript_id.add-protein-id
Takes a map of transcript_id to protein_id from the a tsv file (see option
--map-tsv-file
) and appends the protein_id provided to the attributes field. Any entry with a transcript_id not appearing in the tsv file is discarded.renumber-genes
Renumber genes from 1 using the pattern provided in
--pattern-identifier
.renumber-transcripts
Renumber transcripts from 1 using the pattern provided in
--pattern-identifier
.unset-genes
Renumber genes from 1 using the pattern provided in
--pattern-identifier
. Transcripts with the same gene-id in the input gtf file will have different gene-ids in the output gtf file.set-transcript-to-gene
Will set the transcript-id to the gene-id for each feature.
set-gene-to-transcript
Will set the gene-id to the transcript-id for each each feature.
set-protein-to-transcript
Will append transcript_id to attributes field as ‘protein_id’
set-score-to-distance
Will reset the score field (field 6) of each feature in input gtf to be the distance from transcription start site to the start of the feature. (Assumes input file is sorted by transcript-id)
set-gene_biotype-to-source
Sets the
gene_biotype
attribute from the source column. Will only set if biotype attribute is not present in the current record.rename-duplicates
Rename duplicate gene_ids and transcript_ids by addition of numerical suffix
set-source-to-transcript_biotype
Sets the source attribute to the
transcript_biotype
attribute. Will only set iftranscript_biotype
attribute is present in the current record.
Usage¶
The following example sorts the input gene set by gene
(method=sort
) so that it can be used as input for
method=intersect-transcripts
that outputs genomic the genomic
regions within a gene that is covered by all transcripts in a gene.
Finally, the resultant transcripts are renamed with the pattern
“MERGED_%i”:
cgat gtf2gtf
--method=sort
--sort-order=gene | cgat gtf2gtf
--method=intersect-transcripts
--with-utr
| cgat gtf2gtf
--method=renumber-transcripts
--pattern-identifier=MERGED_%i
Type:
cgat gtf2gtf --help
for command line options.
Command line Options¶
usage: gtf2gtf [-h] [--version] [--merge-exons-distance MERGE_EXONS_DISTANCE]
[--pattern-identifier PATTERN]
[--sort-order {gene,gene+transcript,transcript,position,contig+gene,position+gene,gene+position,gene+exon}]
[--mark-utr] [--without-utr]
[--filter-method {gene,transcript,longest-gene,longest-transcript,representative-transcript,proteincoding,lincrna}]
[-a tsv] [--gff-file GFF] [--invert-filter]
[--sample-size SAMPLE_SIZE]
[--intron-min-length INTRON_MIN_LENGTH]
[--min-exons-length MIN_EXONS_LENGTH]
[--intron-border INTRON_BORDER] [--ignore-strand]
[--permit-duplicates]
[--duplicate-feature {gene,transcript,both,ucsc,coordinates}]
[--use-gene-id]
[-m {add-protein-id,exons2introns,filter,find-retained-introns,genes-to-unique-chunks,intersect-transcripts,join-exons,merge-exons,merge-transcripts,merge-genes,merge-introns,remove-overlapping,remove-duplicates,rename-genes,rename-transcripts,rename-duplicates,renumber-genes,renumber-transcripts,set-transcript-to-gene,set-gene-to-transcript,set-protein-to-transcript,set-score-to-distance,set-gene_biotype-to-source,set-source-to-transcript_biotype,sort,transcript2genes,unset-genes}]
[--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
[--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
[--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
[-E STDERR] [-S STDOUT]
gtf2gtf: error: argument -?: expected one argument