gtf2gtf.py - manipulate transcript models¶
- Tags
Genomics Genesets GTF Manipulation
Purpose¶
This script reads a gene set in gtf format from stdin, applies some
transformation, and outputs a new gene set in gtf format to stdout.
The transformation is chosen by the --method command line option.
Transformations available for use in this script can broadly be classified into four categories:
sorting gene sets
manipulating gene models
filtering gene sets
setting/resetting fields within a gtf file
Further options for working with gtf files are available in gff2gff.py, which can be run with the specification –is-gtf
Sorting gene sets¶
sort
Sorts entries in gtf file by one or more fields
option
gene
order in which fields are sorted
gene_id, contig, start
gene+transcript
gene_id, transcript_id, contig, start
contig+gene
contig, gene_id, transcript_id, start
transcript
transcript_id, contig, start
position
contig, start
position+gene
contig( gene_id, start )
gene+position
gene_id, contig, start
gene+exon
gene_id, exon_id
N.B. position+gene sorts by gene_id, start, then subsequently sorts flattened gene lists by contig, start
Manipulating gene-models¶
Options that can be used to alter the features represented in a gtf file. Only one method can be specified at once.
Input gtfs need to be sorted so that features for a gene or transcript
appear consecutively within the file. This can be achevied using
--method=sort.
genes-to-unique-chunks`Divide the complete length of a gene up into chunks that represent ranges of bases that are all present in the same set of transcripts. E.g. for two overlapping exons an entry will be output representing the overlap and a seperate entry each for the sequences only present in one. Ranges which are between the first TSS and last TTS but not present in any transcript (i.e. merged introns) are also output. Useful for DEXSeq like splicing analysis
find-retained-intronsFinds intervals within a transcript that represent retained-introns, here a retained intron is considered to be an intron in one transcript that is entirely contianed within the exon of another. The retained intron will be assigned to the transcript with the containing exon. Where multiple, overlapping introns are contained within a single exon of a transcript, the union of the introns will be output. Thus when considering an indevidual transcript, outputs will be non-overlapping. However, overlapping, or even identical feature can be output if they belong to different transcripts.
merge-exonsMerges overlapping exons for all transcripts of a gene, outputting the merged exons. Can be used in conjunction with
merge-exons-distanceto set the minimum distance that may appear between two exons before they are merged.If--mark-utris set, the UTR regions will be output separately.merge-transcriptsMerges all transcripts of a gene. Outputs contains a single interval that spans the original gene (both introns and exons). If
--with-utris set, the output interval will also contain UTR.
merge-genes
Merges genes that have overlapping exons, outputting a single gene_id and transcript_id for all exons of overlapping genes. The input needs te sorted by transcript ” (Does not merge intervals on different strands).
join-exonsJoins together all exons of a transcript, outputting a single interval that spans the original transcript (both introns and exons). Input needs to be sorted by transcript.
intersect-transcriptsFinds regions representing the intersection of all transcripts of a gene. Output will contain intervals spanning only those bases covered by all transcripts. If
--with-utris set, the UTR will also be included in the intersect. This method only usesexonorCDSfeatures.merge-intronsOutputs a single interval that spans the region between the start of the first intron and the end of last intron. Single exons genes will not be output. The input needs to be sorted by gene
exons2intronsMerges overlapping introns for all transcripts of a gene, outputting the merged introns. Use
--intron-min-lengthto ignore merged introns below a specified length. Use--intron-borderto specify a number of residues to remove at either end of output introns (residues are removed prior to filtering on size when used in conjunction with--intron-min-length).transcripts2genesCluster transcripts into genes by exon overlap ignoring any gene_ids in the gtf file. May be used in conjunction with
reset-strand
The option permit-duplicates may be specified in order to
allow gene-ids to be duplicated within the input gtf file
(i.e. for the same gene-id to appear non-consecutively within the
input file). However, this option currently only works for
merge-exons, merge-transcripts, merge-introns, and
intersect-transcripts. It DOES NOT work for merge-genes,
join-exons, or exons-file2introns.
Filtering gene sets¶
Options that can be used to filter gtf files. For further detail see command line options.
Input gtfs need to be sorted so that features for a gene or transcript
appear consecutively within the file. This can be achevied using
--method=sort --sort-order.
filterWhen filtering on the basis of ‘gene-id’ or ‘transcript-id’ a filename containing ids to be removed may provided using
--map-tsv-file. Alternatively, a random subsample of genes/transcripts may be retained using--sam-fileple-size. Use--min-exons-lengthin conjunction with--sam-fileple-sizeto specify a minimum length for genes/transcripts to be retained. Use--ignore-strandto set strand to ‘.’ in output.Other filter options include longest-gene, longest-transcript, or representative-transcript.
When filtering on the basis of gene-id, transcript-id or longest-gene,
--invert-filtermay be used to invert the selection.remove-overlappingGiven a second gff formatted file (
--file-gff) removes any features overlapping. Any transcripts that intersect intervals in the supplied file are removed. (Does not account for strand.)remove-duplicatesRemove duplicate features from gtf file. The type of feature to be removed is set by the option
-duplicate-feature. Setting--duplicate-featureto ‘gene’, ‘transcript’, or ‘coordinates’ will remove any interval for which non-consecutive occurrances of specified term appear in input gtf file. Setting to ‘ucsc’, will remove any interval for which transcript-id contains ‘_dup’.
Setting fields¶
Options for altering fields within gtf.
rename-genesWith a mapping file is provided using
--map-tsv-file, renames the gene_id to the one supplied. Outputs a gtf file with field renamed. Any entry in input gtf not appearing in mapping file is discarded.rename-transcriptsas
rename-genes, but renames the transcript_id.add-protein-idTakes a map of transcript_id to protein_id from the a tsv file (see option
--map-tsv-file) and appends the protein_id provided to the attributes field. Any entry with a transcript_id not appearing in the tsv file is discarded.renumber-genesRenumber genes from 1 using the pattern provided in
--pattern-identifier.renumber-transcriptsRenumber transcripts from 1 using the pattern provided in
--pattern-identifier.unset-genesRenumber genes from 1 using the pattern provided in
--pattern-identifier. Transcripts with the same gene-id in the input gtf file will have different gene-ids in the output gtf file.set-transcript-to-geneWill set the transcript-id to the gene-id for each feature.
set-gene-to-transcriptWill set the gene-id to the transcript-id for each each feature.
set-protein-to-transcriptWill append transcript_id to attributes field as ‘protein_id’
set-score-to-distanceWill reset the score field (field 6) of each feature in input gtf to be the distance from transcription start site to the start of the feature. (Assumes input file is sorted by transcript-id)
set-gene_biotype-to-sourceSets the
gene_biotypeattribute from the source column. Will only set if biotype attribute is not present in the current record.rename-duplicatesRename duplicate gene_ids and transcript_ids by addition of numerical suffix
set-source-to-transcript_biotypeSets the source attribute to the
transcript_biotypeattribute. Will only set iftranscript_biotypeattribute is present in the current record.
Usage¶
The following example sorts the input gene set by gene
(method=sort) so that it can be used as input for
method=intersect-transcripts that outputs genomic the genomic
regions within a gene that is covered by all transcripts in a gene.
Finally, the resultant transcripts are renamed with the pattern
“MERGED_%i”:
cgat gtf2gtf
--method=sort
--sort-order=gene | cgat gtf2gtf
--method=intersect-transcripts
--with-utr
| cgat gtf2gtf
--method=renumber-transcripts
--pattern-identifier=MERGED_%i
Type:
cgat gtf2gtf --help
for command line options.
Command line Options¶
usage: gtf2gtf [-h] [--version] [--merge-exons-distance MERGE_EXONS_DISTANCE]
[--pattern-identifier PATTERN]
[--sort-order {gene,gene+transcript,transcript,position,contig+gene,position+gene,gene+position,gene+exon}]
[--mark-utr] [--without-utr]
[--filter-method {gene,transcript,longest-gene,longest-transcript,representative-transcript,proteincoding,lincrna}]
[-a tsv] [--gff-file GFF] [--invert-filter]
[--sample-size SAMPLE_SIZE]
[--intron-min-length INTRON_MIN_LENGTH]
[--min-exons-length MIN_EXONS_LENGTH]
[--intron-border INTRON_BORDER] [--ignore-strand]
[--permit-duplicates]
[--duplicate-feature {gene,transcript,both,ucsc,coordinates}]
[--use-gene-id]
[-m {add-protein-id,exons2introns,filter,find-retained-introns,genes-to-unique-chunks,intersect-transcripts,join-exons,merge-exons,merge-transcripts,merge-genes,merge-introns,remove-overlapping,remove-duplicates,rename-genes,rename-transcripts,rename-duplicates,renumber-genes,renumber-transcripts,set-transcript-to-gene,set-gene-to-transcript,set-protein-to-transcript,set-score-to-distance,set-gene_biotype-to-source,set-source-to-transcript_biotype,sort,transcript2genes,unset-genes}]
[--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
[--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
[--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
[-E STDERR] [-S STDOUT]
gtf2gtf: error: argument -?: expected one argument