SequenceProperties.py - Computing metrics of nucleotide sequences¶

This module provides methods for extracting and reporting sequence properties of nucleotide sequences such as the composition, length, etc.

The classes provide the algorithms to provide the property. They will store the latest result for output. Thus, processing is a two-step procedure:

from SequenceProperties import SequencePropertiesLength
from SequenceProperties import SequencePropertiesNA

counters = [SequencePropertiesLength(), SequencePropertiesNA()]

# output column headers
headers = sum(c.getHeaders() for c in counters]
print "      ".join(headers)

for sequence in sequences:
   # load sequence in each counter
   for c in counters:
       c.loadSequence(sequence)
   # output results
   print "   ".join(map(str, counters))

This design is useful to compute multiple properties while iterating only once over an input file and output a single, multi-column table.

Note

While useful and in working order, the design of the classes is cumbersome.

Reference¶

class SequenceProperties.SequenceProperties¶

Bases: object

Base class.

This class is the base class for SequenceProperty objects. Derived classes need to overload most of its methods.

addProperties(other)¶: add properties.

loadSequence(sequence, seqtype='na')¶: load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesSequence¶

Bases: SequenceProperties.SequenceProperties

Add properties: the actual sequence.

sequence: The sequence

This class outputs the actual sequence supplied.

addProperties(other)¶: add properties.

loadSequence(sequence, seqtype='na')¶: load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesHid¶

Bases: SequenceProperties.SequenceProperties

Add properties: a hash of sequence

hid: Hash identifier of a sequence

The hash is computed using the md5 algorithm and the resulting byte sequence is then translated into printable characters.

loadSequence(sequence, seqtype='na')¶: load hid sequence properties from a sequence.

class SequenceProperties.SequencePropertiesLength¶

Bases: SequenceProperties.SequenceProperties

Add properties: sequence length and number of codons

length: Sequence length
ncodons: Length in codons

The number of codons is 0 for an amino-acid sequence.

addProperties(other)¶: add properties.

loadSequence(sequence, seqtype='na')¶: load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesNA(reference_usage=[])¶

Bases: SequenceProperties.SequenceProperties

Add properties: nucleotide composition

nUnk: Number of unknown residues
nA, nC, nG, nT, nGC, nAT: Nucleotide counts
pA, pC, pG, pT, pGC, pAT: Nucleotide frequencies

addProperties(other)¶: add properties.

loadSequence(sequence, seqtype='na')¶: load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesDN(reference_usage=[])¶

Bases: SequenceProperties.SequenceProperties

Add Properties : dinucleotide counts

nAA, nAC, …: Dinucleotide counts
mCountsOthers: Unknown dinucleotides

addProperties(other)¶: add properties.

loadSequence(sequence, seqtype='na')¶: load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesCpg(reference_usage=[])¶

Bases: SequenceProperties.SequencePropertiesNA, SequenceProperties.SequencePropertiesDN

Add Properties : CpG density and observed / expected.

CpG_count: Number of CpG in sequence
CpG_density: CpG density, number of CpG divided by 2 * sequence length
CpG_ObsExp: Ratio of observed to expected number of CpG. The latter is calculated as the product of nC * nG. The ratio is normalized by the sequence length. Set to 0 if no C or G in sequence.

addProperties(other)¶: add properties.

loadSequence(sequence, seqtype='na')¶: load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesGaps(gap_chars='xXnN', *args, **kwargs)¶

Bases: SequenceProperties.SequenceProperties

Add Properties : number of gaps in a sequence

Gaps are identified by unknown characters ([XN])

ngaps: Number of gap characters in sequnce
nseq_regions: Number of ungapped regions
ngap_regions: Number of gapped regions

loadSequence(sequence, seqtype='na')¶: load sequence properties from a sequence.

addProperties(other)¶: add properties.

class SequenceProperties.SequencePropertiesDegeneracy¶

Bases: SequenceProperties.SequencePropertiesLength

Add properties : codon degeneracy

nstops: Number of stop codons
nsites1d: Number of non-degenerate sites
nsites2d, nsites3d, nsites4d: Number 2-fold, 3-fold, 4-fold degenerate sites
ngc: Number of positions containing either G or C
ngc3: Number of 3rd codon position containing G or C
ngc3: Number of non-degenerate 3rd codon position containing G or C
n2gc3, n3gc3, n4gc3: Number of 2-fold, 3-fold, 4-fold degenerate 3rd codon positions containing G or C
pgc: Percentage of positions containing either G or C
pgc3: Percentage of 3rd codon position containing G or C
pgc3: Percentage of non-degenerate 3rd codon position containing G or C
p2gc3, p3gc3, p4gc3: Percentage of 2-fold, 3-fold, 4-fold degenerate 3rd codon positions containing G or C

The degeneracies for amino acids are:

2: MW are non-degenerate.
9: EDKNQHCYF are 2-fold degenerate.
1: I is 3-fold degenerate
5: VGATP are 4-fold degenerate.
3: RLS are 2-fold and four-fold degenerate.
   Depending on the first two codons, the codons are counted
   as two or four-fold degenerate codons. This is encoded
   in the file Genomics.py.

The number of degenerate sites is computed across all codon positions.

addProperties(other)¶: add properties.

loadSequence(sequence, seqtype='na')¶: load sequence properties from a sequence.

updateProperties()¶: update fields from counts.

class SequenceProperties.SequencePropertiesAA(reference_usage=[])¶

Bases: SequenceProperties.SequenceProperties

Add Properties : amino acid composition of nucleotide sequence.

The codons in the nucleotide sequence are translated into amino acids before counting. The nucleotide sequence must be a multiple of 3.

nA, nC, nD, …: Amino acid counts.
pA, pC, pD, …: Amino acid frequencies.

addProperties(other)¶: add properties.

loadSequence(sequence, seqtype='na')¶: load sequence properties from a sequence.

getHeaders()¶: Return list of data headers

class SequenceProperties.SequencePropertiesAminoAcids(reference_usage=[])¶

Bases: SequenceProperties.SequenceProperties

Add Properties : amino acid composition

nA, nC, nD, …: Amino acid counts.
pA, pC, pD, …: Amino acid frequencies.

addProperties(other)¶: add properties.

loadSequence(sequence, seqtype='na')¶: load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesCodons¶

Bases: SequenceProperties.SequencePropertiesLength

Add Properties : codon frequencies

nAAA, nAAC, …: Codon counts
pAAA, pAAC, …: Codon frequencies

addProperties(other)¶: add properties.

loadSequence(sequence, seqtype='na')¶: load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesCodonUsage¶

Bases: SequenceProperties.SequencePropertiesCodons

Add properties : Codon usage

The codon frequency is the ratio of the number of times a particular codon is used for a particular amino acid, didived the number of times that particular amino acid appears in the sequence. A ratio of 1.0 means that this particular codon is always used to encode its amino acid, while a frequency of 0.5 means it is used 50% of the times.

rAAA, rAAC, …: Codon frequencies.

addProperties(other)¶: add properties.

class SequenceProperties.SequencePropertiesCodonTranslator¶

Bases: SequenceProperties.SequencePropertiesCodonUsage

Add properties : codon sequence is translated into frequencies.

tsequence: comma separated list of codon frequencies. The frequencies are in percentages.

addProperties(other)¶: add properties.

loadSequence(sequence, seqtype='na')¶: load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesBias(reference_usage=[], pseudocounts=0)¶

Bases: SequenceProperties.SequencePropertiesCodons

Add properties : bias measures of codon sequence.

This class outputs metrics showing how biased the codon usage in a particular sequence is compared to a reference codon usage. The reference codon usage is given as a dictionary of codon frequencies and multiple dictionaries can be given to compute the bias against multiple codon usages.

entropy: Entropy of the sequence.
ml0, ml1, …: Message length of sequence compared to reference codon usages.
relml0, relml1, …: Relative message length of sequence compared to reference codon usages. The relative message length is the message lenght divided by the number of codons.
relentropy0, relentropy1, …: Relative entropy of sequence compared to reference codon usages. Also called conditional entropy or encoding cost.
kl0, kl1, …: Kullback-Leibler Divergence (relative entropy) of sequence compared to reference codon usages.

Parameters

reference_usage (list) – A list of codon frequency tables. The bias will be computed against each.
pseudocounts (int) – Pseudo-counts to add

getMessageLength(usage)¶: return message length of a sequence in terms of its reference usage.

getEntropy(usage=None)¶

return entropy of a source in terms of a reference usage. Also called conditional entropy or encoding cost.

Note that here I compute the sum over 20 entropies, one for each amino acid.

If not given, calculate entropy.

getKL(usage)¶: return Kullback-Leibler Divergence (relative entropy) of sequences with respect to reference codon usage.

class SequenceProperties.SequencePropertiesCounts(alphabet)¶

Bases: SequenceProperties.SequenceProperties

Add Properties : Residue counts against arbirtrary alphabet

nUnk: Number of unknown residues
nA, nB, …: Character counts
pA, pB, …: Character frequencies

Parameters: alphabet (string) – List of characters in alphabet

addProperties(other)¶: add properties.

loadSequence(sequence, seqtype='na')¶: load sequence properties from a sequence.

class SequenceProperties.SequencePropertiesEntropy(alphabet, pseudocounts=0)¶

Bases: SequenceProperties.SequencePropertiesCounts

Add properties : Entropy of a sequence

entropy: Entropy of the sequence

Parameters

alphabet (string) – List of characters in alphabet
pseudocounts (int) – Pseudo-counts to add

addProperties(other)¶: add properties.

getEntropy(usage=None)¶

return entropy of a source in terms of a reference usage.

Also called conditional entropy or encoding cost.