gff2gff.py - manipulate gff files¶
- Tags
Genomics Intervals GFF Manipulation
Purpose¶
This scripts reads a gff formatted file, applies a transformation and outputs the new intervals in gff format. The type of transformation chosen is given through the –method` option. Below is a list of available transformations:
complement-groups
output the complenent intervals for the features in the file, for example to output introns from exons. The option
--group-field
sets field/attribute to group by, e.g gene_id, transcript_id, feature, source.
combine-groups
combine all features in a group into a single interval. The option
--group-field
sets field/attribute to group by, see alsecomplement-groups
.
to-forward-coordinates
translate all features forward coordinates.
to-forward-strand
convert to forward strand
add-upstream-flank/add-downstream-flank/add-flank
add an upstream/downstream flanking segment to first/last exon of a group. The amount added is given through the options
--extension-upstream
and--extension-downstream
. If--flank-method
isextend
, the first/last exon will be extended, otherwise a new feature will be added.
crop
crop features according to features in a separate gff file. If a feature falls in the middle of another, two entries will be output.””” )
crop-unique
remove non-unique features from gff file.
merge-features
merge consecutive features.
join-features
group consecutive features.
filter-range
extract features overlapping a chromosomal range. The range can be set by the
--filter-range
option.sanitize
reconcile chromosome names between ENSEMBL/UCSC or with an indexed genomic fasta file (see index_fasta.py - Index fasta formatted files). Raises an exception if an unknown contig is found (unless
--skip-missing
is set). The method to sanitize is specified by--sanitize-method
.The method to sanitize is specified by--sanitize-method
. Options for`--sanitize-method`
include “ucsc”, “ensembl”, “genome”. A pattern of contigs to remove can be given in the option--contig-pattern
. If--sanitize-method
is set toucsc
orensembl
, the option--assembly-report
is required to allow for accurate mapping between UCSC and Ensembl. If not found in the assembly report the contig names are forced into the desired convention, either by removing or prependingchr
, this is useful for gff files with custom contigs. The Assembly Report can be found on the NCBI assembly page under the link “Download the full sequence report”. If--sanitize-method
is set togenome
, the genome file has to be provided via the option--genome-file
or--contigs-tsv-file
skip-missing
skip entries on missing contigs. This prevents exception from being raised
filename-agp
agp file to map coordinates from contigs to scaffolds
rename-chr
Renames chromosome names. Source and target names are supplied as a file with two columns. Examples are available at: https://github.com/dpryan79/ChromosomeMappings Note that unmapped chromosomes are dropped from the output file.
Usage¶
For many downstream applications it is helpful to make sure that a gff formatted file contains only features on placed chromosomes.
As an example, to sanitise hg38 chromosome names and remove chromosome matching the regular expression patterns “ChrUn”, “_alt” or “_random”, use the following:
cat in.gff | gff2gff.py –method=sanitize –sanitize-method=ucsc
–assembly-report=/path/to/file –skip-missing
gff2gff.py –remove-contigs=”chrUn,_random,_alt” > gff.out
The “–skip-missing” option prevents an exception being raised if entries are found on missing chromosomes
Another example, to rename UCSC chromosomes to ENSEMBL.
cat ucsc.gff | gff2gff.py –method=rename-chr
–rename-chr-file=ucsc2ensembl.txt > ensembl.gff
Type:
cgat gff2gff --help
for command line help.
Command line options¶
usage: gff2gff [-h] [--version]
[-m {add-flank,add-upstream-flank,add-downstream-flank,crop,crop-unique,complement-groups,combine-groups,filter-range,join-features,merge-features,sanitize,to-forward-coordinates,to-forward-strand,rename-chr}]
[--ignore-strand] [--is-gtf] [-c INPUT_FILENAME_CONTIGS]
[--agp-file INPUT_FILENAME_AGP] [-g GENOME_FILE]
[--crop-gff-file FILENAME_CROP_GFF] [--group-field GROUP_FIELD]
[--filter-range FILTER_RANGE]
[--sanitize-method {ucsc,ensembl,genome}]
[--flank-method {add,extend}] [--skip-missing]
[--contig-pattern CONTIG_PATTERN]
[--assembly-report ASSEMBLY_REPORT]
[--assembly-report-hasids ASSEMBLY_REPORT_HASIDS]
[--assembly-report-ucsccol ASSEMBLY_REPORT_UCSCCOL]
[--assembly-report-ensemblcol ASSEMBLY_REPORT_ENSEMBLCOL]
[--assembly-extras ASSEMBLY_EXTRAS]
[--extension-upstream EXTENSION_UPSTREAM]
[--extension-downstream EXTENSION_DOWNSTREAM]
[--min-distance MIN_DISTANCE] [--max-distance MAX_DISTANCE]
[--min-features MIN_FEATURES] [--max-features MAX_FEATURES]
[--rename-chr-file RENAME_CHR_FILE] [--timeit TIMEIT_FILE]
[--timeit-name TIMEIT_NAME] [--timeit-header]
[--random-seed RANDOM_SEED] [-v LOGLEVEL]
[--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
[-E STDERR] [-S STDOUT]
gff2gff: error: argument -?: expected one argument