gff2gff.py - manipulate gff files

Tags

Genomics Intervals GFF Manipulation

Purpose

This scripts reads a gff formatted file, applies a transformation and outputs the new intervals in gff format. The type of transformation chosen is given through the –method` option. Below is a list of available transformations:

complement-groups

output the complenent intervals for the features in the file, for example to output introns from exons. The option --group-field sets field/attribute to group by, e.g gene_id, transcript_id, feature, source.

combine-groups

combine all features in a group into a single interval. The option --group-field sets field/attribute to group by, see alse complement-groups.

to-forward-coordinates

translate all features forward coordinates.

to-forward-strand

convert to forward strand

add-upstream-flank/add-downstream-flank/add-flank

add an upstream/downstream flanking segment to first/last exon of a group. The amount added is given through the options --extension-upstream and --extension-downstream. If --flank-method is extend, the first/last exon will be extended, otherwise a new feature will be added.

crop

crop features according to features in a separate gff file. If a feature falls in the middle of another, two entries will be output.””” )

crop-unique

remove non-unique features from gff file.

merge-features

merge consecutive features.

join-features

group consecutive features.

filter-range

extract features overlapping a chromosomal range. The range can be set by the --filter-range option.

sanitize

reconcile chromosome names between ENSEMBL/UCSC or with an indexed genomic fasta file (see index_fasta.py - Index fasta formatted files). Raises an exception if an unknown contig is found (unless --skip-missing is set). The method to sanitize is specified by --sanitize-method.The method to sanitize is specified by --sanitize-method. Options for `--sanitize-method` include “ucsc”, “ensembl”, “genome”. A pattern of contigs to remove can be given in the option --contig-pattern. If --sanitize-method is set to ucsc or ensembl, the option --assembly-report is required to allow for accurate mapping between UCSC and Ensembl. If not found in the assembly report the contig names are forced into the desired convention, either by removing or prepending chr, this is useful for gff files with custom contigs. The Assembly Report can be found on the NCBI assembly page under the link “Download the full sequence report”. If --sanitize-method is set to genome, the genome file has to be provided via the option --genome-file or --contigs-tsv-file

skip-missing

skip entries on missing contigs. This prevents exception from being raised

filename-agp

agp file to map coordinates from contigs to scaffolds

rename-chr

Renames chromosome names. Source and target names are supplied as a file with two columns. Examples are available at: https://github.com/dpryan79/ChromosomeMappings Note that unmapped chromosomes are dropped from the output file.

Usage

For many downstream applications it is helpful to make sure that a gff formatted file contains only features on placed chromosomes.

As an example, to sanitise hg38 chromosome names and remove chromosome matching the regular expression patterns “ChrUn”, “_alt” or “_random”, use the following:

cat in.gff | gff2gff.py –method=sanitize –sanitize-method=ucsc

–assembly-report=/path/to/file –skip-missing

gff2gff.py –remove-contigs=”chrUn,_random,_alt” > gff.out

The “–skip-missing” option prevents an exception being raised if entries are found on missing chromosomes

Another example, to rename UCSC chromosomes to ENSEMBL.

cat ucsc.gff | gff2gff.py –method=rename-chr

–rename-chr-file=ucsc2ensembl.txt > ensembl.gff

Type:

cgat gff2gff --help

for command line help.

Command line options

usage: gff2gff [-h] [--version]
               [-m {add-flank,add-upstream-flank,add-downstream-flank,crop,crop-unique,complement-groups,combine-groups,filter-range,join-features,merge-features,sanitize,to-forward-coordinates,to-forward-strand,rename-chr}]
               [--ignore-strand] [--is-gtf] [-c INPUT_FILENAME_CONTIGS]
               [--agp-file INPUT_FILENAME_AGP] [-g GENOME_FILE]
               [--crop-gff-file FILENAME_CROP_GFF] [--group-field GROUP_FIELD]
               [--filter-range FILTER_RANGE]
               [--sanitize-method {ucsc,ensembl,genome}]
               [--flank-method {add,extend}] [--skip-missing]
               [--contig-pattern CONTIG_PATTERN]
               [--assembly-report ASSEMBLY_REPORT]
               [--assembly-report-hasids ASSEMBLY_REPORT_HASIDS]
               [--assembly-report-ucsccol ASSEMBLY_REPORT_UCSCCOL]
               [--assembly-report-ensemblcol ASSEMBLY_REPORT_ENSEMBLCOL]
               [--assembly-extras ASSEMBLY_EXTRAS]
               [--extension-upstream EXTENSION_UPSTREAM]
               [--extension-downstream EXTENSION_DOWNSTREAM]
               [--min-distance MIN_DISTANCE] [--max-distance MAX_DISTANCE]
               [--min-features MIN_FEATURES] [--max-features MAX_FEATURES]
               [--rename-chr-file RENAME_CHR_FILE] [--timeit TIMEIT_FILE]
               [--timeit-name TIMEIT_NAME] [--timeit-header]
               [--random-seed RANDOM_SEED] [-v LOGLEVEL]
               [--log-config-filename LOG_CONFIG_FILENAME]
               [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
               [-E STDERR] [-S STDOUT]
gff2gff: error: argument -?: expected one argument