gff2fasta.py - output sequences from genomic features

Tags

Genomics Intervals Sequences GFF Fasta Transformation

Purpose

This script outputs the genomic sequences for intervals within a gff or :term: gtf formatted file.

The ouput can be optionally masked and filtered.

Usage

If you want to convert a features.gff file with intervals information into a fasta file containing the sequence of each interval, use this script as follows:

python gff2fasta.py --genome-file=hg19 < features.gff > features.fasta

The input can also be a gtf formatted file. In that case, use the --is-gtf option:

python gff2fasta.py --genome-file=hg19 --is-gtf < features.gtf > features.fasta

If you want to add a polyA tail onto each transcript you can use the extend options:

python gff2fasta.py –genome-file=hg19 –is-gtf –extend-at=3 –extend-by=125 –extend-with=A < features.gtf > features.fasta

If you want to merge the sequence of similar features together, please use --merge-overlapping:

python gff2fasta.py --genome-file=hg19 --merge-overlapping < features.gff > features.fasta

It is possible to filter the output by selecting a minimum or maximum number of nucleotides in the resultant fasta sequence with --max-length or --min-interval-length respectively:

python gff2fasta.py --genome-file=hg19 --max-length=100 < features.gff > features.fasta

Or you can also filter the output by features name with the --feature option:

python gff2fasta.py --genome-file=hg19 --feature=exon < features.gff > features.fasta

On the other hand, low-complexity regions can be masked with the --masker option and a given gff formatted file:

python gff2fasta.py --genome-file=hg19 --masker=dust --maskregions-bed-file=intervals.gff < features.gff > features.fasta

where --masker can take the following values: dust, dustmasker, and softmask.

Options

--is-gtf

Tells the script to expect a gtf format file

--genome-file

PATH to Fasta file of genome build to use

--merge-overlapping

Merge features in gtf/gff file that are adjacent and share attributes

--method=filter --filter-method

Filter on a gff feature such as exon or CDS

--maskregions-bed-file

Mask sequences in intervals in gff file

--remove-masked-regions

Remove sequences in intervals in gff file rather than masking them

--min-interval-length

Minimum output sequence length

--max-length

Maximum output sequence length

--extend-at

Extend sequence at 3’, 5’ or both end. Optionally ‘3only’ or ‘5only’ will return only the 3’ or 5’ extended sequence

--extend-by

Used in conjunction with --extend-at, the number of nucleotides to extend by

--extend-with

Optional. Used in conjunction with --extend-at and --extend-by. Instead of extending by the genomic sequence, extend by this string repeated n times, where n is –entend-by

--masker

Masker type to use: dust, dustmasker, soft or none

--fold-at

Fold the fasta sequence every n bases

--naming-attribute

Use this attribute to name the fasta entries

Command line options

usage: gff2fasta [-h] [--is-gtf] [-g GENOME_FILE] [-m] [-e FEATURE] [-f gff]
                 [--remove-masked-regions] [--min-interval-length MIN_LENGTH]
                 [--max-length MAX_LENGTH]
                 [--extend-at {none,3,5,both,3only,5only}]
                 [--header-attributes] [--extend-by EXTEND_BY]
                 [--extend-with EXTEND_WITH]
                 [--masker {dust,dustmasker,softmask,none}]
                 [--fold-at FOLD_AT] [--fasta-name-attribute NAMING_ATTRIBUTE]
                 [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
                 [--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                 [--log-config-filename LOG_CONFIG_FILENAME]
                 [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
                 [-E STDERR] [-S STDOUT]
gff2fasta: error: argument -?: expected one argument