pandera.api.hypotheses.Hypothesis.two_sample_ttestĀ¶
- classmethod Hypothesis.two_sample_ttest(sample1, sample2, groupby=None, relationship='equal', alpha=0.01, equal_var=True, nan_policy='propagate', **kwargs)[source]Ā¶
Calculate a t-test for the means of two samples.
Perform a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.
- Parameters:
sample1 (
str
) ā The first sample group to test. For Column and SeriesSchema hypotheses, refers to the level in the groupby column. For DataFrameSchema hypotheses, refers to column in the DataFrame.sample2 (
str
) ā The second sample group to test. For Column and SeriesSchema hypotheses, refers to the level in the groupby column. For DataFrameSchema hypotheses, refers to column in the DataFrame.groupby (
Union
[str
,List
[str
],Callable
,None
]) āIf a string or list of strings is provided, then these columns are used to group the Column Series by groupby. If a callable is passed, the expected signature is DataFrame -> DataFrameGroupby. The function has access to the entire dataframe, but the Column.name is selected from this DataFrameGroupby object so that a SeriesGroupBy object is passed into fn.
Specifying this argument changes the fn signature to: dict[str|tuple[str], Series] -> bool|pd.Series[bool]
Where specific groups can be obtained from the input dict.
relationship (
str
) ā Represents what relationship conditions are imposed on the hypothesis test. Available relationships are: āgreater_thanā, āless_thanā, ānot_equalā, and āequalā. For example, group1 greater_than group2 specifies an alternative hypothesis that the mean of group1 is greater than group 2 relative to a null hypothesis that they are equal.alpha (
float
) ā (Default value = 0.01) The significance level; the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.01 indicates a 1% risk of concluding that a difference exists when there is no actual difference.equal_var (
bool
) ā (Default value = True) If True (default), perform a standard independent 2 sample test that assumes equal population variances. If False, perform Welchās t-test, which does not assume equal population variancenan_policy (
str
) ā Defines how to handle when input returns nan, one of {āpropagateā, āraiseā, āomitā}, (Default value = āpropagateā). For more details see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
- Example:
The the built-in class method to do a two-sample t-test.
>>> import pandas as pd >>> import pandera as pa >>> >>> >>> schema = pa.DataFrameSchema({ ... "height_in_feet": pa.Column( ... float, [ ... pa.Hypothesis.two_sample_ttest( ... sample1="A", ... sample2="B", ... groupby="group", ... relationship="greater_than", ... alpha=0.05, ... equal_var=True), ... ]), ... "group": pa.Column(str) ... }) >>> df = ( ... pd.DataFrame({ ... "height_in_feet": [8.1, 7, 5.2, 5.1, 4], ... "group": ["A", "A", "B", "B", "B"] ... }) ... ) >>> schema.validate(df)[["height_in_feet", "group"]] height_in_feet group 0 8.1 A 1 7.0 A 2 5.2 B 3 5.1 B 4 4.0 B
- Return type: