gff2fasta.py - output sequences from genomic features¶
- Tags
Genomics Intervals Sequences GFF Fasta Transformation
Purpose¶
This script outputs the genomic sequences for intervals within a gff or :term: gtf formatted file.
The ouput can be optionally masked and filtered.
Usage¶
If you want to convert a features.gff
file with intervals information
into a fasta file containing the sequence of each interval, use this
script as follows:
python gff2fasta.py --genome-file=hg19 < features.gff > features.fasta
The input can also be a gtf formatted file. In that case, use the
--is-gtf
option:
python gff2fasta.py --genome-file=hg19 --is-gtf < features.gtf > features.fasta
If you want to add a polyA tail onto each transcript you can use the extend options:
python gff2fasta.py –genome-file=hg19 –is-gtf –extend-at=3 –extend-by=125 –extend-with=A < features.gtf > features.fasta
If you want to merge the sequence of similar features together, please use
--merge-overlapping
:
python gff2fasta.py --genome-file=hg19 --merge-overlapping < features.gff > features.fasta
It is possible to filter the output by selecting a minimum or maximum number
of nucleotides in the resultant fasta sequence with --max-length
or
--min-interval-length
respectively:
python gff2fasta.py --genome-file=hg19 --max-length=100 < features.gff > features.fasta
Or you can also filter the output by features name with the --feature
option:
python gff2fasta.py --genome-file=hg19 --feature=exon < features.gff > features.fasta
On the other hand, low-complexity regions can be masked with the --masker
option and a given gff formatted file:
python gff2fasta.py --genome-file=hg19 --masker=dust --maskregions-bed-file=intervals.gff < features.gff > features.fasta
where --masker
can take the following values: dust
, dustmasker
,
and softmask
.
Options¶
--is-gtf
Tells the script to expect a gtf format file
--genome-file
PATH to Fasta file of genome build to use
--merge-overlapping
Merge features in gtf/gff file that are adjacent and share attributes
--method=filter --filter-method
Filter on a gff feature such as
exon
orCDS
--maskregions-bed-file
Mask sequences in intervals in gff file
--remove-masked-regions
Remove sequences in intervals in gff file rather than masking them
--min-interval-length
Minimum output sequence length
--max-length
Maximum output sequence length
--extend-at
Extend sequence at 3’, 5’ or both end. Optionally ‘3only’ or ‘5only’ will return only the 3’ or 5’ extended sequence
--extend-by
Used in conjunction with
--extend-at
, the number of nucleotides to extend by--extend-with
Optional. Used in conjunction with
--extend-at
and--extend-by
. Instead of extending by the genomic sequence, extend by this string repeated n times, where n is –entend-by--masker
Masker type to use: dust, dustmasker, soft or none
--fold-at
Fold the fasta sequence every n bases
--naming-attribute
Use this attribute to name the fasta entries
Command line options¶
usage: gff2fasta [-h] [--is-gtf] [-g GENOME_FILE] [-m] [-e FEATURE] [-f gff]
[--remove-masked-regions] [--min-interval-length MIN_LENGTH]
[--max-length MAX_LENGTH]
[--extend-at {none,3,5,both,3only,5only}]
[--header-attributes] [--extend-by EXTEND_BY]
[--extend-with EXTEND_WITH]
[--masker {dust,dustmasker,softmask,none}]
[--fold-at FOLD_AT] [--fasta-name-attribute NAMING_ATTRIBUTE]
[--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
[--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
[--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
[-E STDERR] [-S STDOUT]
gff2fasta: error: argument -?: expected one argument