gff2fasta.py - output sequences from genomic features¶

Tags: Genomics Intervals Sequences GFF Fasta Transformation

Purpose¶

This script outputs the genomic sequences for intervals within a gff or :term: gtf formatted file.

The ouput can be optionally masked and filtered.

Usage¶

If you want to convert a features.gff file with intervals information into a fasta file containing the sequence of each interval, use this script as follows:

python gff2fasta.py --genome-file=hg19 < features.gff > features.fasta

The input can also be a gtf formatted file. In that case, use the --is-gtf option:

python gff2fasta.py --genome-file=hg19 --is-gtf < features.gtf > features.fasta

If you want to add a polyA tail onto each transcript you can use the extend options:

python gff2fasta.py –genome-file=hg19 –is-gtf –extend-at=3 –extend-by=125 –extend-with=A < features.gtf > features.fasta

If you want to merge the sequence of similar features together, please use --merge-overlapping:

python gff2fasta.py --genome-file=hg19 --merge-overlapping < features.gff > features.fasta

It is possible to filter the output by selecting a minimum or maximum number of nucleotides in the resultant fasta sequence with --max-length or --min-interval-length respectively:

python gff2fasta.py --genome-file=hg19 --max-length=100 < features.gff > features.fasta

Or you can also filter the output by features name with the --feature option:

python gff2fasta.py --genome-file=hg19 --feature=exon < features.gff > features.fasta

On the other hand, low-complexity regions can be masked with the --masker option and a given gff formatted file:

python gff2fasta.py --genome-file=hg19 --masker=dust --maskregions-bed-file=intervals.gff < features.gff > features.fasta

where --masker can take the following values: dust, dustmasker, and softmask.

Options¶

--is-gtf: Tells the script to expect a gtf format file
--genome-file: PATH to Fasta file of genome build to use
--merge-overlapping: Merge features in gtf/gff file that are adjacent and share attributes
--method=filter --filter-method: Filter on a gff feature such as exon or CDS
--maskregions-bed-file: Mask sequences in intervals in gff file
--remove-masked-regions: Remove sequences in intervals in gff file rather than masking them
--min-interval-length: Minimum output sequence length
--max-length: Maximum output sequence length
--extend-at: Extend sequence at 3’, 5’ or both end. Optionally ‘3only’ or ‘5only’ will return only the 3’ or 5’ extended sequence
--extend-by: Used in conjunction with --extend-at, the number of nucleotides to extend by
--extend-with: Optional. Used in conjunction with --extend-at and --extend-by. Instead of extending by the genomic sequence, extend by this string repeated n times, where n is –entend-by
--masker: Masker type to use: dust, dustmasker, soft or none
--fold-at: Fold the fasta sequence every n bases
--naming-attribute: Use this attribute to name the fasta entries

Command line options¶

usage: gff2fasta [-h] [--is-gtf] [-g GENOME_FILE] [-m] [-e FEATURE] [-f gff]
                 [--remove-masked-regions] [--min-interval-length MIN_LENGTH]
                 [--max-length MAX_LENGTH]
                 [--extend-at {none,3,5,both,3only,5only}]
                 [--header-attributes] [--extend-by EXTEND_BY]
                 [--extend-with EXTEND_WITH]
                 [--masker {dust,dustmasker,softmask,none}]
                 [--fold-at FOLD_AT] [--fasta-name-attribute NAMING_ATTRIBUTE]
                 [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
                 [--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                 [--log-config-filename LOG_CONFIG_FILENAME]
                 [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
                 [-E STDERR] [-S STDOUT]
gff2fasta: error: argument -?: expected one argument