fasta2variants.py - create sequence variants from a set of sequences

Tags

Genomics Sequences Variants Protein FASTA Transformation

Purpose

This script reads a collection of sequences in fasta format and outputs a table of possible variants. It outputs for each position in a protein sequence the number of variants.

If the input sequences are nucleotide coding (CDS) sequences, for each variant a weight is output indicating the number of times that variant can occur from single nucleotide changes.

Usage

Example:

python fasta2variants.py -I CCDS_nucleotide.current.fna.gz -L CDS.log -S CDS.output -c

This will take a CDS file as input, save the log and output files, and count variants based on single nucleotide changes using the -c option.

Type:

python fasta2variants.py --help

for command line help.

Compressed (.gz) and various fasta format files (.fasta, .fna) are accepted. If the -c option is specified and the file is not a CDS sequence the script will throw an error (‘length of sequence ‘<input_file>’ is not a multiple of 3’).

Command line options

usage: fasta2variants [-h] [--version] [-c] [--timeit TIMEIT_FILE]
                      [--timeit-name TIMEIT_NAME] [--timeit-header]
                      [--random-seed RANDOM_SEED] [-v LOGLEVEL]
                      [--log-config-filename LOG_CONFIG_FILENAME]
                      [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
                      [-E STDERR] [-S STDOUT]
fasta2variants: error: argument -?: expected one argument