gff2fasta.py - output sequences from genomic features¶
- Tags
Genomics Intervals Sequences GFF Fasta Transformation
Purpose¶
This script outputs the genomic sequences for intervals within a gff or :term: gtf formatted file.
The ouput can be optionally masked and filtered.
Usage¶
If you want to convert a features.gff file with intervals information
into a fasta file containing the sequence of each interval, use this
script as follows:
python gff2fasta.py --genome-file=hg19 < features.gff > features.fasta
The input can also be a gtf formatted file. In that case, use the
--is-gtf option:
python gff2fasta.py --genome-file=hg19 --is-gtf < features.gtf > features.fasta
If you want to add a polyA tail onto each transcript you can use the extend options:
python gff2fasta.py –genome-file=hg19 –is-gtf –extend-at=3 –extend-by=125 –extend-with=A < features.gtf > features.fasta
If you want to merge the sequence of similar features together, please use
--merge-overlapping:
python gff2fasta.py --genome-file=hg19 --merge-overlapping < features.gff > features.fasta
It is possible to filter the output by selecting a minimum or maximum number
of nucleotides in the resultant fasta sequence with --max-length or
--min-interval-length respectively:
python gff2fasta.py --genome-file=hg19 --max-length=100 < features.gff > features.fasta
Or you can also filter the output by features name with the --feature
option:
python gff2fasta.py --genome-file=hg19 --feature=exon < features.gff > features.fasta
On the other hand, low-complexity regions can be masked with the --masker
option and a given gff formatted file:
python gff2fasta.py --genome-file=hg19 --masker=dust --maskregions-bed-file=intervals.gff < features.gff > features.fasta
where --masker can take the following values: dust, dustmasker,
and softmask.
Options¶
--is-gtfTells the script to expect a gtf format file
--genome-filePATH to Fasta file of genome build to use
--merge-overlappingMerge features in gtf/gff file that are adjacent and share attributes
--method=filter --filter-methodFilter on a gff feature such as
exonorCDS--maskregions-bed-fileMask sequences in intervals in gff file
--remove-masked-regionsRemove sequences in intervals in gff file rather than masking them
--min-interval-lengthMinimum output sequence length
--max-lengthMaximum output sequence length
--extend-atExtend sequence at 3’, 5’ or both end. Optionally ‘3only’ or ‘5only’ will return only the 3’ or 5’ extended sequence
--extend-byUsed in conjunction with
--extend-at, the number of nucleotides to extend by--extend-withOptional. Used in conjunction with
--extend-atand--extend-by. Instead of extending by the genomic sequence, extend by this string repeated n times, where n is –entend-by--maskerMasker type to use: dust, dustmasker, soft or none
--fold-atFold the fasta sequence every n bases
--naming-attributeUse this attribute to name the fasta entries
Command line options¶
usage: gff2fasta [-h] [--is-gtf] [-g GENOME_FILE] [-m] [-e FEATURE] [-f gff]
[--remove-masked-regions] [--min-interval-length MIN_LENGTH]
[--max-length MAX_LENGTH]
[--extend-at {none,3,5,both,3only,5only}]
[--header-attributes] [--extend-by EXTEND_BY]
[--extend-with EXTEND_WITH]
[--masker {dust,dustmasker,softmask,none}]
[--fold-at FOLD_AT] [--fasta-name-attribute NAMING_ATTRIBUTE]
[--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
[--timeit-header] [--random-seed RANDOM_SEED] [-v LOGLEVEL]
[--log-config-filename LOG_CONFIG_FILENAME]
[--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
[-E STDERR] [-S STDOUT]
gff2fasta: error: argument -?: expected one argument