K-Sample¶
Non-parametric K-Sample Test¶
-
class
hyppo.ksample.
KSample
(indep_test, compute_distance=<function euclidean>, bias=False)[source]¶ Class for calculating the k-sample test statistic and p-value.
A k-sample test tests equality in distribution among groups. Groups can be of different sizes, but generally have the same dimensionality. There are not many non-parametric k-sample tests, but this version cleverly leverages the power of some of the implemented independence tests to test this equality of distribution.
Parameters: indep_test : {"CCA", "Dcorr", "HHG", "RV", "Hsic", "MGC"}
A string corresponding to the desired independence test from
mgc.independence
. This is not case sensitive.compute_distance : callable(), optional (default: euclidean)
A function that computes the distance among the samples within each data matrix. Set to None if x and y are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form
compute_distance(x)
where x is the data matrix for which pairwise distances are calculated.bias : bool (default: False)
Whether or not to use the biased or unbiased test statistics. Only applies to
Dcorr
andHsic
.Notes
The ideas behind this can be found in an upcoming paper:
The k-sample testing problem can be thought of as a generalization of the two sample testing problem. Define \(\{ u_i \stackrel{iid}{\sim} F_U,\ i = 1, ..., n \}\) and \(\{ v_j \stackrel{iid}{\sim} F_V,\ j = 1, ..., m \}\) as two groups of samples deriving from different distributions with the same dimensionality. Then, problem that we are testing is thus,
\[\begin{split}H_0: F_U &= F_V \\ H_A: F_U &\neq F_V\end{split}\]The closely related independence testing problem can be generalized similarly: Given a set of paired data \(\{\left(x_i, y_i \right) \stackrel{iid}{\sim} F_{XY}, \ i = 1, ..., N\}\), the problem that we are testing is,
\[\begin{split}H_0: F_{XY} &= F_X F_Y \\ H_A: F_{XY} &\neq F_X F_Y\end{split}\]By manipulating the inputs of the k-sample test, we can create concatenated versions of the inputs and another label matrix which are necessarily paired. Then, any nonparametric test can be performed on this data.
-
test
(*args, reps=1000, workers=1, auto=True)[source]¶ Calculates the k-sample test statistic and p-value.
Parameters: *args : ndarrays
Variable length input data matrices. All inputs must have the same number of samples. That is, the shapes must be (n, p) and (m, p) where n and m are the number of samples and p are the number of dimensions. Alternatively, inputs can be distance matrices, where the shapes must all be (n, n).
reps : int, optional (default: 1000)
The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.
workers : int, optional (default: 1)
The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.
auto : bool (default: True)
Automatically uses fast approximation when sample size and size of array is greater than 20. If True, and sample size is greater than 20, a fast chi2 approximation will be run. Parameters
reps
andworkers
are irrelevant in this case. Only applies toDcorr
andHsic
.Returns: stat : float
The computed k-Sample statistic.
pvalue : float
The computed k-Sample p-value.
Examples
>>> import numpy as np >>> from hyppo.ksample import KSample >>> x = np.arange(7) >>> y = x >>> z = np.arange(10) >>> stat, pvalue = KSample("Dcorr").test(x, y) >>> '%.3f, %.1f' % (stat, pvalue) '-0.136, 1.0'
-