SequenceProperties.py - Computing metrics of nucleotide sequences¶
This module provides methods for extracting and reporting sequence properties of nucleotide sequences such as the composition, length, etc.
The classes provide the algorithms to provide the property. They will store the latest result for output. Thus, processing is a two-step procedure:
from SequenceProperties import SequencePropertiesLength
from SequenceProperties import SequencePropertiesNA
counters = [SequencePropertiesLength(), SequencePropertiesNA()]
# output column headers
headers = sum(c.getHeaders() for c in counters]
print " ".join(headers)
for sequence in sequences:
# load sequence in each counter
for c in counters:
c.loadSequence(sequence)
# output results
print " ".join(map(str, counters))
This design is useful to compute multiple properties while iterating only once over an input file and output a single, multi-column table.
Note
While useful and in working order, the design of the classes is cumbersome.
Reference¶
-
class
SequenceProperties.
SequenceProperties
¶ Bases:
object
Base class.
This class is the base class for SequenceProperty objects. Derived classes need to overload most of its methods.
-
addProperties
(other)¶ add properties.
-
loadSequence
(sequence, seqtype='na')¶ load sequence properties from a sequence.
-
-
class
SequenceProperties.
SequencePropertiesSequence
¶ Bases:
SequenceProperties.SequenceProperties
Add properties: the actual sequence.
- sequence
The sequence
This class outputs the actual sequence supplied.
-
addProperties
(other)¶ add properties.
-
loadSequence
(sequence, seqtype='na')¶ load sequence properties from a sequence.
-
class
SequenceProperties.
SequencePropertiesHid
¶ Bases:
SequenceProperties.SequenceProperties
Add properties: a hash of sequence
- hid
Hash identifier of a sequence
The hash is computed using the md5 algorithm and the resulting byte sequence is then translated into printable characters.
-
loadSequence
(sequence, seqtype='na')¶ load hid sequence properties from a sequence.
-
class
SequenceProperties.
SequencePropertiesLength
¶ Bases:
SequenceProperties.SequenceProperties
Add properties: sequence length and number of codons
- length
Sequence length
- ncodons
Length in codons
The number of codons is 0 for an amino-acid sequence.
-
addProperties
(other)¶ add properties.
-
loadSequence
(sequence, seqtype='na')¶ load sequence properties from a sequence.
-
class
SequenceProperties.
SequencePropertiesNA
(reference_usage=[])¶ Bases:
SequenceProperties.SequenceProperties
Add properties: nucleotide composition
- nUnk
Number of unknown residues
- nA, nC, nG, nT, nGC, nAT
Nucleotide counts
- pA, pC, pG, pT, pGC, pAT
Nucleotide frequencies
-
addProperties
(other)¶ add properties.
-
loadSequence
(sequence, seqtype='na')¶ load sequence properties from a sequence.
-
class
SequenceProperties.
SequencePropertiesDN
(reference_usage=[])¶ Bases:
SequenceProperties.SequenceProperties
Add Properties : dinucleotide counts
- nAA, nAC, …
Dinucleotide counts
- mCountsOthers
Unknown dinucleotides
-
addProperties
(other)¶ add properties.
-
loadSequence
(sequence, seqtype='na')¶ load sequence properties from a sequence.
-
class
SequenceProperties.
SequencePropertiesCpg
(reference_usage=[])¶ Bases:
SequenceProperties.SequencePropertiesNA
,SequenceProperties.SequencePropertiesDN
Add Properties : CpG density and observed / expected.
- CpG_count
Number of CpG in sequence
- CpG_density
CpG density, number of CpG divided by 2 * sequence length
- CpG_ObsExp
Ratio of observed to expected number of CpG. The latter is calculated as the product of nC * nG. The ratio is normalized by the sequence length. Set to 0 if no
C
orG
in sequence.
-
addProperties
(other)¶ add properties.
-
loadSequence
(sequence, seqtype='na')¶ load sequence properties from a sequence.
-
class
SequenceProperties.
SequencePropertiesGaps
(gap_chars='xXnN', *args, **kwargs)¶ Bases:
SequenceProperties.SequenceProperties
Add Properties : number of gaps in a sequence
Gaps are identified by unknown characters (
[XN]
)- ngaps
Number of gap characters in sequnce
- nseq_regions
Number of ungapped regions
- ngap_regions
Number of gapped regions
-
loadSequence
(sequence, seqtype='na')¶ load sequence properties from a sequence.
-
addProperties
(other)¶ add properties.
-
class
SequenceProperties.
SequencePropertiesDegeneracy
¶ Bases:
SequenceProperties.SequencePropertiesLength
Add properties : codon degeneracy
- nstops
Number of stop codons
- nsites1d
Number of non-degenerate sites
- nsites2d, nsites3d, nsites4d
Number 2-fold, 3-fold, 4-fold degenerate sites
- ngc
Number of positions containing either G or C
- ngc3
Number of 3rd codon position containing G or C
- ngc3
Number of non-degenerate 3rd codon position containing G or C
- n2gc3, n3gc3, n4gc3
Number of 2-fold, 3-fold, 4-fold degenerate 3rd codon positions containing G or C
- pgc
Percentage of positions containing either G or C
- pgc3
Percentage of 3rd codon position containing G or C
- pgc3
Percentage of non-degenerate 3rd codon position containing G or C
- p2gc3, p3gc3, p4gc3
Percentage of 2-fold, 3-fold, 4-fold degenerate 3rd codon positions containing G or C
The degeneracies for amino acids are:
2: MW are non-degenerate. 9: EDKNQHCYF are 2-fold degenerate. 1: I is 3-fold degenerate 5: VGATP are 4-fold degenerate. 3: RLS are 2-fold and four-fold degenerate. Depending on the first two codons, the codons are counted as two or four-fold degenerate codons. This is encoded in the file Genomics.py.
The number of degenerate sites is computed across all codon positions.
-
addProperties
(other)¶ add properties.
-
loadSequence
(sequence, seqtype='na')¶ load sequence properties from a sequence.
-
updateProperties
()¶ update fields from counts.
-
class
SequenceProperties.
SequencePropertiesAA
(reference_usage=[])¶ Bases:
SequenceProperties.SequenceProperties
Add Properties : amino acid composition of nucleotide sequence.
The codons in the nucleotide sequence are translated into amino acids before counting. The nucleotide sequence must be a multiple of 3.
- nA, nC, nD, …
Amino acid counts.
- pA, pC, pD, …
Amino acid frequencies.
-
addProperties
(other)¶ add properties.
-
loadSequence
(sequence, seqtype='na')¶ load sequence properties from a sequence.
-
getHeaders
()¶ Return list of data headers
-
class
SequenceProperties.
SequencePropertiesAminoAcids
(reference_usage=[])¶ Bases:
SequenceProperties.SequenceProperties
Add Properties : amino acid composition
- nA, nC, nD, …
Amino acid counts.
- pA, pC, pD, …
Amino acid frequencies.
-
addProperties
(other)¶ add properties.
-
loadSequence
(sequence, seqtype='na')¶ load sequence properties from a sequence.
-
class
SequenceProperties.
SequencePropertiesCodons
¶ Bases:
SequenceProperties.SequencePropertiesLength
Add Properties : codon frequencies
- nAAA, nAAC, …
Codon counts
- pAAA, pAAC, …
Codon frequencies
-
addProperties
(other)¶ add properties.
-
loadSequence
(sequence, seqtype='na')¶ load sequence properties from a sequence.
-
class
SequenceProperties.
SequencePropertiesCodonUsage
¶ Bases:
SequenceProperties.SequencePropertiesCodons
Add properties : Codon usage
The codon frequency is the ratio of the number of times a particular codon is used for a particular amino acid, didived the number of times that particular amino acid appears in the sequence. A ratio of 1.0 means that this particular codon is always used to encode its amino acid, while a frequency of 0.5 means it is used 50% of the times.
- rAAA, rAAC, …
Codon frequencies.
-
addProperties
(other)¶ add properties.
-
class
SequenceProperties.
SequencePropertiesCodonTranslator
¶ Bases:
SequenceProperties.SequencePropertiesCodonUsage
Add properties : codon sequence is translated into frequencies.
- tsequence
comma separated list of codon frequencies. The frequencies are in percentages.
-
addProperties
(other)¶ add properties.
-
loadSequence
(sequence, seqtype='na')¶ load sequence properties from a sequence.
-
class
SequenceProperties.
SequencePropertiesBias
(reference_usage=[], pseudocounts=0)¶ Bases:
SequenceProperties.SequencePropertiesCodons
Add properties : bias measures of codon sequence.
This class outputs metrics showing how biased the codon usage in a particular sequence is compared to a reference codon usage. The reference codon usage is given as a dictionary of codon frequencies and multiple dictionaries can be given to compute the bias against multiple codon usages.
- entropy
Entropy of the sequence.
- ml0, ml1, …
Message length of sequence compared to reference codon usages.
- relml0, relml1, …
Relative message length of sequence compared to reference codon usages. The relative message length is the message lenght divided by the number of codons.
- relentropy0, relentropy1, …
Relative entropy of sequence compared to reference codon usages. Also called conditional entropy or encoding cost.
- kl0, kl1, …
Kullback-Leibler Divergence (relative entropy) of sequence compared to reference codon usages.
- Parameters
reference_usage (list) – A list of codon frequency tables. The bias will be computed against each.
pseudocounts (int) – Pseudo-counts to add
-
getMessageLength
(usage)¶ return message length of a sequence in terms of its reference usage.
-
getEntropy
(usage=None)¶ return entropy of a source in terms of a reference usage. Also called conditional entropy or encoding cost.
Note that here I compute the sum over 20 entropies, one for each amino acid.
If not given, calculate entropy.
-
getKL
(usage)¶ return Kullback-Leibler Divergence (relative entropy) of sequences with respect to reference codon usage.
-
class
SequenceProperties.
SequencePropertiesCounts
(alphabet)¶ Bases:
SequenceProperties.SequenceProperties
Add Properties : Residue counts against arbirtrary alphabet
- nUnk
Number of unknown residues
- nA, nB, …
Character counts
- pA, pB, …
Character frequencies
- Parameters
alphabet (string) – List of characters in alphabet
-
addProperties
(other)¶ add properties.
-
loadSequence
(sequence, seqtype='na')¶ load sequence properties from a sequence.
-
class
SequenceProperties.
SequencePropertiesEntropy
(alphabet, pseudocounts=0)¶ Bases:
SequenceProperties.SequencePropertiesCounts
Add properties : Entropy of a sequence
- entropy
Entropy of the sequence
- Parameters
alphabet (string) – List of characters in alphabet
pseudocounts (int) – Pseudo-counts to add
-
addProperties
(other)¶ add properties.
-
getEntropy
(usage=None)¶ return entropy of a source in terms of a reference usage.
Also called conditional entropy or encoding cost.