Stats.py - statistical utility functions¶
- Tags
Python
Code¶
-
Stats.
getSignificance
(pvalue, thresholds=[0.05, 0.01, 0.001])¶ return cartoon of significance of a p-Value.
-
Stats.
doLogLikelihoodTest
(complex_ll, complex_np, simple_ll, simple_np, significance_threshold=0.05)¶ perform log-likelihood test between model1 and model2.
-
Stats.
doBinomialTest
(p, sample_size, observed, significance_threshold=0.05)¶ perform a binomial test.
Given are p: the probability of the NULL hypothesis, the sample_size and the number of observed counts.
-
Stats.
doChiSquaredTest
(matrix, significance_threshold=0.05)¶ perform chi-squared test on a matrix.
The observed/expected values are in rows, the categories are in columns, for example:
set
protein_coding
intronic
intergenic
observed
92
90
194
expected
91
10
15
If there are only two categories (one degrees of freedom) the Yates correction is applied. For each entry (observed-expected), the value 0.5 is subtracted ignoring the sign of the difference.
The test throws an exception if
1. one or more expected categories are less than 1 (it does not matter what the observed values are)
more than one-fifth of expected categories are less than 5
-
Stats.
doPearsonChiSquaredTest
(p, sample_size, observed, significance_threshold=0.05)¶ perform a pearson chi squared test.
Given are p: the probability of the NULL hypothesis, the sample_size and the number of observed counts.
For large sample sizes, this test is a continuous approximation to the binomial test.
-
class
Stats.
DistributionalParameters
(values=None, format='%6.4f', mode='float')¶ Bases:
object
a collection of distributional parameters. Available properties are:
mMean, mMedian, mMin, mMax, mSampleStd, mSum, mCounts
This method is deprecated - use
Summary
instead.-
updateProperties
(values)¶ update properties.
If values is an vector of strings, each entry will be converted to float. Entries that can not be converted are ignored.
-
getZScore
(value)¶ return zscore for value.
-
setFormat
(format)¶ set number format.
-
getHeaders
()¶ returns header of column separated values.
-
getHeader
()¶ returns header of column separated values.
-
-
class
Stats.
Summary
(values=None, format='%6.4f', mode='float', allow_empty=True)¶ Bases:
Stats.Result
a collection of distributional parameters. Available properties are:
mean, median, min, max, samplestd, sum, counts
-
getHeaders
()¶ returns header of column separated values.
-
getHeader
()¶ returns header of column separated values.
-
-
Stats.
doFDRPython
(pvalues, vlambda=None, pi0_method='smoother', fdr_level=None, robust=False, smooth_df=3, smooth_log_pi0=False, pi0=None, plot=False)¶ modeled after code taken from http://genomics.princeton.edu/storeylab/qvalue/linux.html.
I did not like the error handling so I translated most to python.
Compute FDR after method by Storey et al. (2002).
-
class
Stats.
CorrelationTest
(s_result=None, method=None)¶ Bases:
object
coefficient is r, not r squared
-
Stats.
filterMasked
(xvals, yvals, missing=('na', 'Nan', None, ''), dtype=<class 'float'>)¶ convert xvals and yvals to numpy array skipping pairs with one or more missing values.
-
Stats.
doCorrelationTest
(xvals, yvals)¶ compute correlation between x and y.
Raises a value-error if there are not enough observations.
-
Stats.
getPooledVariance
(data)¶ return pooled variance from a list of tuples (sample_size, variance).
-
Stats.
computeROC
(values)¶ return a roc curve for values. Values is a sorted list of (value, bool) pairs.
Deprecated - use getPerformance instead
returns a list of (FPR,TPR) tuples.
-
class
Stats.
PairedTTest
(statistic, pvalue)¶ Bases:
tuple
Create new instance of PairedTTest(statistic, pvalue)
-
property
pvalue
¶ Alias for field number 1
-
property
statistic
¶ Alias for field number 0
-
property
-
Stats.
doPairedTTest
(vals1, vals2)¶ perform paired t-test.
vals1 and vals2 need to contain the same number of elements.
-
Stats.
doWelchsTTest
(n1, mean1, std1, n2, mean2, std2, alpha=0.05)¶ Welch’’s approximate t-test for the difference of two means of heteroscedasctic populations.
This functions does a two-tailed test.
see PMID: 12016052
- Parameters
- n1int
number of variates in sample 1
- n2int
number of variates in sample 2
- mean1float
mean of sample 1
- mean2float
mean of sample 2
- std1float
standard deviation of sample 1
- std2float
standard deviation of sample 2
returns a WelchTTest
-
Stats.
getAreaUnderCurve
(xvalues, yvalues)¶ compute area under curve from a set of discrete x,y coordinates using trapezoids.
This is only as accurate as the density of points.
-
Stats.
getSensitivityRecall
(values)¶ return sensitivity/selectivity.
Values is a sorted list of (value, bool) pairs.
Deprecated - use getPerformance instead
-
class
Stats.
ROCResult
(value, pred, tp, fp, tn, fn, tpr, fpr, tnr, fnr, rtpr, rfnr)¶ Bases:
tuple
Create new instance of ROCResult(value, pred, tp, fp, tn, fn, tpr, fpr, tnr, fnr, rtpr, rfnr)
-
property
fn
¶ Alias for field number 5
-
property
fnr
¶ Alias for field number 9
-
property
fp
¶ Alias for field number 3
-
property
fpr
¶ Alias for field number 7
-
property
pred
¶ Alias for field number 1
-
property
rfnr
¶ Alias for field number 11
-
property
rtpr
¶ Alias for field number 10
-
property
tn
¶ Alias for field number 4
-
property
tnr
¶ Alias for field number 8
-
property
tp
¶ Alias for field number 2
-
property
tpr
¶ Alias for field number 6
-
property
value
¶ Alias for field number 0
-
property
-
Stats.
getPerformance
(values, skip_redundant=True, false_negatives=False, bin_by_value=True, monotonous=False, multiple=False, increasing=True, total_positives=None, total_false_negatives=None)¶ compute performance estimates for a list of
(score, flag)
tuples in values.Values is a sorted list of (value, bool) pairs.
If the option false-negative is set, the input is +/- or 1/0 for a true positive or false negative, respectively.
TP: true positives FP: false positives TPR: true positive rate = true_positives / predicted P: predicted FPR: false positive rate = false positives / predicted value: value
-
Stats.
doMannWhitneyUTest
(xvals, yvals)¶ apply the Mann-Whitney U test to test for the difference of medians.
-
Stats.
adjustPValues
(pvalues, method='fdr', n=None)¶ returns an array of adjusted pvalues
Reimplementation of p.adjust in the R package.
p: numeric vector of p-values (possibly with ‘NA’s). Any other R is coerced by ‘as.numeric’.
method: correction method. Valid values are:
n: number of comparisons, must be at least ‘length(p)’; only set this (to non-default) when you know what you are doing
For more information, see the documentation of the p.adjust method in R.
-
Stats.
savitzky_golay
(y, window_size, order, deriv=0, rate=1)¶ Smooth (and optionally differentiate) data with a Savitzky-Golay filter. The Savitzky-Golay filter removes high frequency noise from data. It has the advantage of preserving the original shape and features of the signal better than other types of filtering approaches, such as moving averages techniques.
- Parameters
y (array_like, shape (N,)) – the values of the time history of the signal.
window_size (int) – the length of the window. Must be an odd integer number.
order (int) – the order of the polynomial used in the filtering. Must be less then window_size - 1.
deriv (int) – the order of the derivative to compute (default = 0 means only smoothing)
- Returns
ys – the smoothed signal (or it’s n-th derivative).
- Return type
ndarray, shape (N)
Notes
The Savitzky-Golay is a type of low-pass filter, particularly suited for smoothing noisy data. The main idea behind this approach is to make for each point a least-square fit with a polynomial of high order over a odd-sized window centered at the point.
Examples
t = np.linspace(-4, 4, 500) y = np.exp( -t**2 ) + np.random.normal(0, 0.05, t.shape) ysg = savitzky_golay(y, window_size=31, order=4) import matplotlib.pyplot as plt plt.plot(t, y, label=’Noisy signal’) plt.plot(t, np.exp(-t**2), ‘k’, lw=1.5, label=’Original signal’) plt.plot(t, ysg, ‘r’, label=’Filtered signal’) plt.legend() plt.show()
References
- 1
A. Savitzky, M. J. E. Golay, Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Analytical Chemistry, 1964, 36 (8), pp 1627-1639.
- 2
Numerical Recipes 3rd Edition: The Art of Scientific Computing W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery Cambridge University Press ISBN-13: 9780521880688