fasta2kmercontent.py

Tags

Genomics Sequences FASTA Summary

Purpose

This script takes an input fasta file from stdin and computes a k-nucleotide content for each contig in the file. The output is a tab-delimited file of kmer counts:

     contig1  contig2  contig3  contig4
n1
n2
n3

where n is the kmer and contig is the fasta entry.

The user specifies the kmer length that is to be searched. Note that the longer the kmer, the longer the script will take to run.

Note the order of output will not necessarily be the same order as the input.

Usage

Example:

zcat in.fasta.gz | head::

 >NODE_1_length_120_cov_4.233333
 TCACGAGCACCGCTATTATCAGCAACTTTTAAGCGACTTTCTTGTTGAATCATTTCAATT
 GTCTCCTTTTAGTTTTATTAGATAATAACAGCTTCTTCCACAACTTCTACAAGACGGAAG
 CGTTTTGTAGCTGAAAGTGGGCGAGTTTCCATGATACGAAcgatATCGCC

 >NODE_3_length_51_cov_33.000000
 CGAGTTTCCATGATACGAAcgatATCGCCTTCTTTAGCAACGTTGTTTTCGTCATGTGCT
 TTATATTTTTTAGAATAGTTGATACGTTTACCATAGACTGG

zcat in.fasta.gz | python fasta2kmercontent.py
                   --kmer-size 4
                   > tetranucleotide_counts.tsv

head tetranucleotide_counts.tsv::

  kmer NODE_228_length_74_cov_506.432434 NODE_167_length_57_cov_138.438599
  GTAC 0                                 0
  TGCT 0                                 0
  GTAA 2                                 0
  CGAA 1                                 1
  AAAT 1                                 0
  CGAC 0                                 0

In this example, for each contig in in.fasta.gz the occurrence of each four nucleotide combination is counted.

Alternative example:

zcat in.fasta.gz | python fasta2kmercontent.py
                   --kmer-size 4
                   --output-proportion
                   > tetranucleotide_proportions.tsv

In this example, for each contig in in.fasta.gz we return the proportion of each four base combination out of the total tetranucleotide occurences. --output-proportion overides the count output.

Options

Two options control the behaviour of fasta2kmercontent.py; --kmer-size and --output-proportion.

--kmer-size::

The kmer length to count over in the input fasta file

--output-proportion::

The output values are proportions rather than absolute counts

Type:

python fasta2composition.py --help

for command line help.

Command line options

usage: fasta2kmercontent [-h] [--version] [-k KMER] [-p]
                         [--timeit TIMEIT_FILE] [--timeit-name TIMEIT_NAME]
                         [--timeit-header] [--random-seed RANDOM_SEED]
                         [-v LOGLEVEL]
                         [--log-config-filename LOG_CONFIG_FILENAME]
                         [--tracing {function}] [-? ?] [-I STDIN] [-L STDLOG]
                         [-E STDERR] [-S STDOUT]
fasta2kmercontent: error: argument -?: expected one argument