SequenceProperties.py - Computing metrics of nucleotide sequences

This module provides methods for extracting and reporting sequence properties of nucleotide sequences such as the composition, length, etc.

The classes provide the algorithms to provide the property. They will store the latest result for output. Thus, processing is a two-step procedure:

from SequenceProperties import SequencePropertiesLength
from SequenceProperties import SequencePropertiesNA

counters = [SequencePropertiesLength(), SequencePropertiesNA()]

# output column headers
headers = sum(c.getHeaders() for c in counters]
print "      ".join(headers)

for sequence in sequences:
   # load sequence in each counter
   for c in counters:
       c.loadSequence(sequence)
   # output results
   print "   ".join(map(str, counters))

This design is useful to compute multiple properties while iterating only once over an input file and output a single, multi-column table.

Note

While useful and in working order, the design of the classes is cumbersome.

Reference

class SequenceProperties.SequenceProperties

Bases: object

Base class.

This class is the base class for SequenceProperty objects. Derived classes need to overload most of its methods.

addProperties(other)

add properties.

loadSequence(sequence, seqtype='na')

load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesSequence

Bases: SequenceProperties.SequenceProperties

Add properties: the actual sequence.

sequence

The sequence

This class outputs the actual sequence supplied.

addProperties(other)

add properties.

loadSequence(sequence, seqtype='na')

load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesHid

Bases: SequenceProperties.SequenceProperties

Add properties: a hash of sequence

hid

Hash identifier of a sequence

The hash is computed using the md5 algorithm and the resulting byte sequence is then translated into printable characters.

loadSequence(sequence, seqtype='na')

load hid sequence properties from a sequence.

class SequenceProperties.SequencePropertiesLength

Bases: SequenceProperties.SequenceProperties

Add properties: sequence length and number of codons

length

Sequence length

ncodons

Length in codons

The number of codons is 0 for an amino-acid sequence.

addProperties(other)

add properties.

loadSequence(sequence, seqtype='na')

load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesNA(reference_usage=[])

Bases: SequenceProperties.SequenceProperties

Add properties: nucleotide composition

nUnk

Number of unknown residues

nA, nC, nG, nT, nGC, nAT

Nucleotide counts

pA, pC, pG, pT, pGC, pAT

Nucleotide frequencies

addProperties(other)

add properties.

loadSequence(sequence, seqtype='na')

load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesDN(reference_usage=[])

Bases: SequenceProperties.SequenceProperties

Add Properties : dinucleotide counts

nAA, nAC, …

Dinucleotide counts

mCountsOthers

Unknown dinucleotides

addProperties(other)

add properties.

loadSequence(sequence, seqtype='na')

load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesCpg(reference_usage=[])

Bases: SequenceProperties.SequencePropertiesNA, SequenceProperties.SequencePropertiesDN

Add Properties : CpG density and observed / expected.

CpG_count

Number of CpG in sequence

CpG_density

CpG density, number of CpG divided by 2 * sequence length

CpG_ObsExp

Ratio of observed to expected number of CpG. The latter is calculated as the product of nC * nG. The ratio is normalized by the sequence length. Set to 0 if no C or G in sequence.

addProperties(other)

add properties.

loadSequence(sequence, seqtype='na')

load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesGaps(gap_chars='xXnN', *args, **kwargs)

Bases: SequenceProperties.SequenceProperties

Add Properties : number of gaps in a sequence

Gaps are identified by unknown characters ([XN])

ngaps

Number of gap characters in sequnce

nseq_regions

Number of ungapped regions

ngap_regions

Number of gapped regions

loadSequence(sequence, seqtype='na')

load sequence properties from a sequence.

addProperties(other)

add properties.

class SequenceProperties.SequencePropertiesDegeneracy

Bases: SequenceProperties.SequencePropertiesLength

Add properties : codon degeneracy

nstops

Number of stop codons

nsites1d

Number of non-degenerate sites

nsites2d, nsites3d, nsites4d

Number 2-fold, 3-fold, 4-fold degenerate sites

ngc

Number of positions containing either G or C

ngc3

Number of 3rd codon position containing G or C

ngc3

Number of non-degenerate 3rd codon position containing G or C

n2gc3, n3gc3, n4gc3

Number of 2-fold, 3-fold, 4-fold degenerate 3rd codon positions containing G or C

pgc

Percentage of positions containing either G or C

pgc3

Percentage of 3rd codon position containing G or C

pgc3

Percentage of non-degenerate 3rd codon position containing G or C

p2gc3, p3gc3, p4gc3

Percentage of 2-fold, 3-fold, 4-fold degenerate 3rd codon positions containing G or C

The degeneracies for amino acids are:

2: MW are non-degenerate.
9: EDKNQHCYF are 2-fold degenerate.
1: I is 3-fold degenerate
5: VGATP are 4-fold degenerate.
3: RLS are 2-fold and four-fold degenerate.
   Depending on the first two codons, the codons are counted
   as two or four-fold degenerate codons. This is encoded
   in the file Genomics.py.

The number of degenerate sites is computed across all codon positions.

addProperties(other)

add properties.

loadSequence(sequence, seqtype='na')

load sequence properties from a sequence.

updateProperties()

update fields from counts.

class SequenceProperties.SequencePropertiesAA(reference_usage=[])

Bases: SequenceProperties.SequenceProperties

Add Properties : amino acid composition of nucleotide sequence.

The codons in the nucleotide sequence are translated into amino acids before counting. The nucleotide sequence must be a multiple of 3.

nA, nC, nD, …

Amino acid counts.

pA, pC, pD, …

Amino acid frequencies.

addProperties(other)

add properties.

loadSequence(sequence, seqtype='na')

load sequence properties from a sequence.

getHeaders()

Return list of data headers

class SequenceProperties.SequencePropertiesAminoAcids(reference_usage=[])

Bases: SequenceProperties.SequenceProperties

Add Properties : amino acid composition

nA, nC, nD, …

Amino acid counts.

pA, pC, pD, …

Amino acid frequencies.

addProperties(other)

add properties.

loadSequence(sequence, seqtype='na')

load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesCodons

Bases: SequenceProperties.SequencePropertiesLength

Add Properties : codon frequencies

nAAA, nAAC, …

Codon counts

pAAA, pAAC, …

Codon frequencies

addProperties(other)

add properties.

loadSequence(sequence, seqtype='na')

load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesCodonUsage

Bases: SequenceProperties.SequencePropertiesCodons

Add properties : Codon usage

The codon frequency is the ratio of the number of times a particular codon is used for a particular amino acid, didived the number of times that particular amino acid appears in the sequence. A ratio of 1.0 means that this particular codon is always used to encode its amino acid, while a frequency of 0.5 means it is used 50% of the times.

rAAA, rAAC, …

Codon frequencies.

addProperties(other)

add properties.

class SequenceProperties.SequencePropertiesCodonTranslator

Bases: SequenceProperties.SequencePropertiesCodonUsage

Add properties : codon sequence is translated into frequencies.

tsequence

comma separated list of codon frequencies. The frequencies are in percentages.

addProperties(other)

add properties.

loadSequence(sequence, seqtype='na')

load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesBias(reference_usage=[], pseudocounts=0)

Bases: SequenceProperties.SequencePropertiesCodons

Add properties : bias measures of codon sequence.

This class outputs metrics showing how biased the codon usage in a particular sequence is compared to a reference codon usage. The reference codon usage is given as a dictionary of codon frequencies and multiple dictionaries can be given to compute the bias against multiple codon usages.

entropy

Entropy of the sequence.

ml0, ml1, …

Message length of sequence compared to reference codon usages.

relml0, relml1, …

Relative message length of sequence compared to reference codon usages. The relative message length is the message lenght divided by the number of codons.

relentropy0, relentropy1, …

Relative entropy of sequence compared to reference codon usages. Also called conditional entropy or encoding cost.

kl0, kl1, …

Kullback-Leibler Divergence (relative entropy) of sequence compared to reference codon usages.

Parameters
  • reference_usage (list) – A list of codon frequency tables. The bias will be computed against each.

  • pseudocounts (int) – Pseudo-counts to add

getMessageLength(usage)

return message length of a sequence in terms of its reference usage.

getEntropy(usage=None)

return entropy of a source in terms of a reference usage. Also called conditional entropy or encoding cost.

Note that here I compute the sum over 20 entropies, one for each amino acid.

If not given, calculate entropy.

getKL(usage)

return Kullback-Leibler Divergence (relative entropy) of sequences with respect to reference codon usage.

class SequenceProperties.SequencePropertiesCounts(alphabet)

Bases: SequenceProperties.SequenceProperties

Add Properties : Residue counts against arbirtrary alphabet

nUnk

Number of unknown residues

nA, nB, …

Character counts

pA, pB, …

Character frequencies

Parameters

alphabet (string) – List of characters in alphabet

addProperties(other)

add properties.

loadSequence(sequence, seqtype='na')

load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesEntropy(alphabet, pseudocounts=0)

Bases: SequenceProperties.SequencePropertiesCounts

Add properties : Entropy of a sequence

entropy

Entropy of the sequence

Parameters
  • alphabet (string) – List of characters in alphabet

  • pseudocounts (int) – Pseudo-counts to add

addProperties(other)

add properties.

getEntropy(usage=None)

return entropy of a source in terms of a reference usage.

Also called conditional entropy or encoding cost.