Genomics.py - Tools for working with genomic data¶
- Tags
Python
Reference¶
-
Genomics.
parse_region_string
(s)¶ parse a genomic region string.
Returns tuple of contig, start, end. Missing values are None.
-
Genomics.
reverse_complement
(s)¶ reverse complement a sequence.
>>> complement("ACATACATACTA") 'TAGTATGTATGT'
- Returns
- Return type
string
-
Genomics.
GetHID
(sequence)¶ returns a hash value for a sequence.
The hash value is computed using md5 and converted into printable characters.
>>> GetHID("ACATACATACTA") 'trcPGx9VNT36XMlG0XvUBQ'
- Returns
- Return type
A hash value
-
Genomics.
String2Location
(s)¶ convert a string to location information.
>>> String2Location("chr1:12:15") ('chr1', '+', 12, 15)
- Returns
contig (string)
strand (string)
start (int)
end (int)
-
Genomics.
readContigSizes
(infile)¶ read sizes of contigs from file.
-
Genomics.
forceForwardCoordinates
(start, end, strand, length)¶ return forward coordinates.
If strand is negative, the coordinates in a and b will be converted. If they are on the positive strand, they will be returned as is.
-
Genomics.
CountGeneFeatures
(first_position, alignment, genomic_sequence=None, border_stop_codon=0, stop_codons=('TAG', 'TAA', 'TGA'))¶ calculate number of genomic features in a peptide to genome alignment.
Note that codons can be split, for example:
S 0 2 5 0 2 I 0 17541 3 0 2 S 1 2 5 0 2 I 0 27979 3 0 2 S 1 2
- Parameters
first_position (int) – Start of alignment on genome.
alignment (string) – Alignment in CIGAR format, for example from exonerate_.
genomic_sequence (string) – Genomic sequence for alignment
border_stop_codon (int) – Number of codons that are ignored at the edges of match regions. border_stop_codon should be divisible by three.
stop_codons (list) – List of stop codons
- Returns
nintrons (int) – Number of introns
nframeshifts (int) – Number of frameshifts in aligment.
ngaps (int) – Number of gaps in aligment.
nsplit (int) – Number of codons split by introns in alignment.
nstopcodons (int) – Number of stop codons in alignment.
disruptions (list) – List of disruptions in the prediction. Each disruption is a tuple of ( “stop|frameshift”, position in protein, position in cds, position on genomic sequence).
-
Genomics.
Alignment2String
(alignment)¶ convert a tuple alignment to an alignment string.
-
Genomics.
String2Alignment
(source)¶ convert an alignment string to a tuple alignment.
-
Genomics.
GetAlignmentLength
(alignment)¶ return Alignment length
-
Genomics.
Alignment2ExonBoundaries
(alignment, query_from=0, sbjct_from=0, add_stop_codon=1)¶ extract exon coordinates from a peptide2genome alignment.
- Parameters
- Returns
exons – A list of exons. Each exon is a tuple of (query_from, query_pos, frame, sbjct_from, sbjct_pos, ali)
- Return type
list
-
Genomics.
RemoveFrameShiftsFromAlignment
(row_ali, col_ali, gap_char='-')¶ remove frame shifts in an alignment.
Frameshifts are gaps are 1, 2, 4, or 5 residues long.
>>> RemoveFrameShiftsFromAlignment("ABC-EFG", "AB-DEFG") ('ABEFG', 'ABEFG')
- Parameters
row_ali (string) – Alignment string of row.
col_ali (string) – Alignment string of column.
gap_char (string) – Gap character to identify aligments.
- Returns
new_row_ali (string) – New alignment string for row
new_col_ali (string) – New aligment string for column
-
Genomics.
MaskStopCodons
(sequence, stop_codons=('TAG', 'TAA', 'TGA'))¶ mask stop codons in a sequence.
Stop codons are masked with
NNN
.- Parameters
sequence (string) – Nucleotide sequence to mask.
stop_codons (string) – List of known stop codons.
- Returns
masked_sequence
- Return type
string
-
Genomics.
Alignment2DNA
(alignment, query_from=0, sbjct_from=0)¶ convert a peptide2genome alignment to a nucleotide2nucleotide alignment.
Instead of peptide coordinates, the alignment will be in codon coordinates.
- Parameters
- Returns
alignment – The alignment as an alignlib.AlignmentVector object.
- Return type
-
Genomics.
encodeGenotype
(code)¶ encode genotypes like GG, GA into a one-letter code. The returned code is lower case if code[0] < code[1], otherwise it is uppercase.
-
Genomics.
decodeGenotype
(code)¶ decode single letter genotypes like m, M into two letters. This is the reverse operation to
encodeGenotype()
.
-
Genomics.
resolveAmbiguousNA
(code)¶ resolve ambiguous nucleic acid letters.
-
Genomics.
resolveReverseAmbiguousNA
(genotype)¶ map a genotype to a single letter amino acid amiguous code, for example, CT -> Y.
-
Genomics.
GetMapAA2Codons
()¶ returns a map of amino acids to codons
No stop codons. .
-
Genomics.
MapCodon2AA
(codon, is_seleno=False, ignore_n=True)¶ map a codon to an amino acid using the standard translation tables
The mapping returns gaps as gaps and will return an amino acid for incomplete codons if there is unambiguous mapping.
If
is_seleno
is set, the codon is translated for a selenoprotein.If
ignore_n
is set, codons withn
are returned as?
in order to distinguish them from stop codons.Amino acids are returned as upper-case letters.
-
Genomics.
Alignment2PeptideAlignment
(alignment, query_from=0, sbjct_from=0, genomic_sequence=None)¶ convert a Peptide2DNA aligment to a Peptide2Peptide alignment.
How to handle frameshifts?
-
Genomics.
translate
(sequence, is_seleno=False, prefer_lowercase=True, ignore_n=False)¶ convert DNA sequence to a peptide sequence
If
is_seleno
is set, “TGA” codons are treated as encoding for selenocysteine.If
ignore_n
is set, codons withn
are returned as?
in order to distinguish them from stop codons.
-
Genomics.
TranslateDNA2Protein
(*args, **kwargs)¶ convert a DNA sequence to a peptide sequence. keep case.
deprecated - use
translate()
instead.
-
Genomics.
Alignment2CDNA
(alignment, query_from=0, sbjct_from=0, genome=None, remove_frameshifts=0)¶ build cDNA sequence from genomic fragment and return alignment of query to it.
-
Genomics.
Exons2Alignment
(exons)¶ build an cigar alignment string from a list of exons.
-
Genomics.
AlignmentProtein2CDNA
(src, exons1=None, exons2=None)¶ convert a peptide alignment to a nucleotide alignment.
multiplies coordinates with 3. Insert introns.
Note: alignment starts at 1
-
Genomics.
GetDegenerateSites
(seq1, seq2, degeneracy=4, position=3)¶ returns two new sequenes containing only degenerate sites.
Only unmutated positions are counted.
-
class
Genomics.
SequencePairInfo
¶ Bases:
object
the first characters are ACGT.
-
getGCContent
()¶ return GC content.
-
-
class
Genomics.
SequencePairInfoCodons
¶ Bases:
Genomics.SequencePairInfo
the first characters are ACGT.
-
Genomics.
AlignedPair2SubstitutionMatrix
(seq1, seq2, alphabet)¶ given a pair of sequences, calculate a substitution matrix for the given alphabet.
-
Genomics.
CalculatePairIndices
(seq1, seq2, gap_char='-', with_codons=False)¶ returns number of idential and transitions/transversions substitutions in the alignment.
If with-codons = True, synonymous and nonsynonymous changes will be recorded as well. The routine assumes no frame-shifts and will count more than one change as non-synonymous.
-
Genomics.
makeSubstitutionMatrix
(type='EMBOSS')¶ make alignator with DNA substitution matrix.
EMBOSS style matrix: identity = 5 mismatch = -4 gop = -16 gep = -4
ClustalW style matrix: match = 1 mismatch = 0 gop = -10 gep = -0.1
-
Genomics.
CalculateRCSUValuesFromCounts
(counts, pseudo_counts=0)¶ calculate RCSU values for codons.
RCSU = relative frequency / uniform frequency
-
Genomics.
CalculateCodonFrequenciesFromCounts
(counts, pseudo_counts=0)¶ calculate codon frequencies from codon counts per amino acid. pseudo_counts are added if desired.
-
Genomics.
CalculateCAIWeightsFromCounts
(counts, pseudo_counts=0)¶ calculate CAI weights from codon counts. pseudo_counts are added if desired.
-
Genomics.
IsJunk
(contig)¶ returns true, if contigs is likely to be junk.
This is done by name matching. Junk contigs contain either one of the following:
random, unknown, chrU, chU.
-
Genomics.
CountCodons
(sequence)¶ count the codons in a sequence.
-
Genomics.
GetUniformCodonUsage
()¶ get list of frequencies for codons expected for uniform codon usage.
-
Genomics.
GetBiasedCodonUsage
(bias=1.0)¶ get list of frequencies for codons according to some bias.
The first codon for each aa is the most biased, all others are less biased.
The ratio determines the relative bias between the first and all other codons. 0.0 is no bias, 1.0 is complete bias.
-
Genomics.
convertStrand
(strand)¶ convert various strand notations into [+-.].
-
Genomics.
GetIntronType
(sequence, both_strands=False)¶ return intron type for an intronic sequence.
If both_strands is True, both strands are checked.
-
Genomics.
printPrettyAlignment
(seq1, *args)¶ print a pretty alignment.
-
Genomics.
ReadPeptideSequences
(infile, filter=None, as_array=False, regex_identifier=None)¶ read peptide sequence from fasta infile.
-
Genomics.
ParseFasta2Hash
(infile, filter=None, regex_identifier=None)¶ read fasta formatted sequences file and build a hash.
Keys are all characters before the first whitespace in the description line.
Previously, if the key contained a “:”, everything before the “:” was removed. This is not true any more.
Use array for higher space efficiency.
If regex_identifier is given, this is used to extract the identifier from the fasta description line.