Stats.py - statistical utility functions

Tags

Python

Code

Stats.getSignificance(pvalue, thresholds=[0.05, 0.01, 0.001])

return cartoon of significance of a p-Value.

class Stats.Result

Bases: object

allow both member and dictionary access.

Stats.doLogLikelihoodTest(complex_ll, complex_np, simple_ll, simple_np, significance_threshold=0.05)

perform log-likelihood test between model1 and model2.

Stats.doBinomialTest(p, sample_size, observed, significance_threshold=0.05)

perform a binomial test.

Given are p: the probability of the NULL hypothesis, the sample_size and the number of observed counts.

Stats.doChiSquaredTest(matrix, significance_threshold=0.05)

perform chi-squared test on a matrix.

The observed/expected values are in rows, the categories are in columns, for example:

set

protein_coding

intronic

intergenic

observed

92

90

194

expected

91

10

15

If there are only two categories (one degrees of freedom) the Yates correction is applied. For each entry (observed-expected), the value 0.5 is subtracted ignoring the sign of the difference.

The test throws an exception if

1. one or more expected categories are less than 1 (it does not matter what the observed values are)

  1. more than one-fifth of expected categories are less than 5

Stats.doPearsonChiSquaredTest(p, sample_size, observed, significance_threshold=0.05)

perform a pearson chi squared test.

Given are p: the probability of the NULL hypothesis, the sample_size and the number of observed counts.

For large sample sizes, this test is a continuous approximation to the binomial test.

class Stats.DistributionalParameters(values=None, format='%6.4f', mode='float')

Bases: object

a collection of distributional parameters. Available properties are:

mMean, mMedian, mMin, mMax, mSampleStd, mSum, mCounts

This method is deprecated - use Summary instead.

updateProperties(values)

update properties.

If values is an vector of strings, each entry will be converted to float. Entries that can not be converted are ignored.

getZScore(value)

return zscore for value.

setFormat(format)

set number format.

getHeaders()

returns header of column separated values.

getHeader()

returns header of column separated values.

class Stats.Summary(values=None, format='%6.4f', mode='float', allow_empty=True)

Bases: Stats.Result

a collection of distributional parameters. Available properties are:

mean, median, min, max, samplestd, sum, counts

getHeaders()

returns header of column separated values.

getHeader()

returns header of column separated values.

Stats.doFDRPython(pvalues, vlambda=None, pi0_method='smoother', fdr_level=None, robust=False, smooth_df=3, smooth_log_pi0=False, pi0=None, plot=False)

modeled after code taken from http://genomics.princeton.edu/storeylab/qvalue/linux.html.

I did not like the error handling so I translated most to python.

Compute FDR after method by Storey et al. (2002).

class Stats.CorrelationTest(s_result=None, method=None)

Bases: object

coefficient is r, not r squared

Stats.filterMasked(xvals, yvals, missing=('na', 'Nan', None, ''), dtype=<class 'float'>)

convert xvals and yvals to numpy array skipping pairs with one or more missing values.

Stats.doCorrelationTest(xvals, yvals)

compute correlation between x and y.

Raises a value-error if there are not enough observations.

Stats.getPooledVariance(data)

return pooled variance from a list of tuples (sample_size, variance).

Stats.computeROC(values)

return a roc curve for values. Values is a sorted list of (value, bool) pairs.

Deprecated - use getPerformance instead

returns a list of (FPR,TPR) tuples.

class Stats.PairedTTest(statistic, pvalue)

Bases: tuple

Create new instance of PairedTTest(statistic, pvalue)

property pvalue

Alias for field number 1

property statistic

Alias for field number 0

Stats.doPairedTTest(vals1, vals2)

perform paired t-test.

vals1 and vals2 need to contain the same number of elements.

Stats.doWelchsTTest(n1, mean1, std1, n2, mean2, std2, alpha=0.05)

Welch’’s approximate t-test for the difference of two means of heteroscedasctic populations.

This functions does a two-tailed test.

see PMID: 12016052

Parameters
n1int

number of variates in sample 1

n2int

number of variates in sample 2

mean1float

mean of sample 1

mean2float

mean of sample 2

std1float

standard deviation of sample 1

std2float

standard deviation of sample 2

returns a WelchTTest

Stats.getAreaUnderCurve(xvalues, yvalues)

compute area under curve from a set of discrete x,y coordinates using trapezoids.

This is only as accurate as the density of points.

Stats.getSensitivityRecall(values)

return sensitivity/selectivity.

Values is a sorted list of (value, bool) pairs.

Deprecated - use getPerformance instead

class Stats.ROCResult(value, pred, tp, fp, tn, fn, tpr, fpr, tnr, fnr, rtpr, rfnr)

Bases: tuple

Create new instance of ROCResult(value, pred, tp, fp, tn, fn, tpr, fpr, tnr, fnr, rtpr, rfnr)

property fn

Alias for field number 5

property fnr

Alias for field number 9

property fp

Alias for field number 3

property fpr

Alias for field number 7

property pred

Alias for field number 1

property rfnr

Alias for field number 11

property rtpr

Alias for field number 10

property tn

Alias for field number 4

property tnr

Alias for field number 8

property tp

Alias for field number 2

property tpr

Alias for field number 6

property value

Alias for field number 0

Stats.getPerformance(values, skip_redundant=True, false_negatives=False, bin_by_value=True, monotonous=False, multiple=False, increasing=True, total_positives=None, total_false_negatives=None)

compute performance estimates for a list of (score, flag) tuples in values.

Values is a sorted list of (value, bool) pairs.

If the option false-negative is set, the input is +/- or 1/0 for a true positive or false negative, respectively.

TP: true positives FP: false positives TPR: true positive rate = true_positives / predicted P: predicted FPR: false positive rate = false positives / predicted value: value

Stats.doMannWhitneyUTest(xvals, yvals)

apply the Mann-Whitney U test to test for the difference of medians.

Stats.adjustPValues(pvalues, method='fdr', n=None)

returns an array of adjusted pvalues

Reimplementation of p.adjust in the R package.

p: numeric vector of p-values (possibly with ‘NA’s). Any other R is coerced by ‘as.numeric’.

method: correction method. Valid values are:

n: number of comparisons, must be at least ‘length(p)’; only set this (to non-default) when you know what you are doing

For more information, see the documentation of the p.adjust method in R.

Stats.savitzky_golay(y, window_size, order, deriv=0, rate=1)

Smooth (and optionally differentiate) data with a Savitzky-Golay filter. The Savitzky-Golay filter removes high frequency noise from data. It has the advantage of preserving the original shape and features of the signal better than other types of filtering approaches, such as moving averages techniques.

Parameters
  • y (array_like, shape (N,)) – the values of the time history of the signal.

  • window_size (int) – the length of the window. Must be an odd integer number.

  • order (int) – the order of the polynomial used in the filtering. Must be less then window_size - 1.

  • deriv (int) – the order of the derivative to compute (default = 0 means only smoothing)

Returns

ys – the smoothed signal (or it’s n-th derivative).

Return type

ndarray, shape (N)

Notes

The Savitzky-Golay is a type of low-pass filter, particularly suited for smoothing noisy data. The main idea behind this approach is to make for each point a least-square fit with a polynomial of high order over a odd-sized window centered at the point.

Examples

t = np.linspace(-4, 4, 500) y = np.exp( -t**2 ) + np.random.normal(0, 0.05, t.shape) ysg = savitzky_golay(y, window_size=31, order=4) import matplotlib.pyplot as plt plt.plot(t, y, label=’Noisy signal’) plt.plot(t, np.exp(-t**2), ‘k’, lw=1.5, label=’Original signal’) plt.plot(t, ysg, ‘r’, label=’Filtered signal’) plt.legend() plt.show()

References

1

A. Savitzky, M. J. E. Golay, Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Analytical Chemistry, 1964, 36 (8), pp 1627-1639.

2

Numerical Recipes 3rd Edition: The Art of Scientific Computing W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery Cambridge University Press ISBN-13: 9780521880688