pandera.api.hypotheses.Hypothesis#
- class pandera.api.hypotheses.Hypothesis(test, samples=None, groupby=None, relationship='equal', alpha=None, test_kwargs=None, relationship_kwargs=None, name=None, error=None, raise_warning=False, n_failure_cases=None, title=None, description=None, statistics=None, strategy=None, **check_kwargs)[source]#
Special type of
Check
that defines hypothesis tests on data.Perform a hypothesis test on a Series or DataFrame.
- Parameters
test (
Callable
) – The hypothesis test function. It should take one or more arrays as positional arguments and return a test statistic and a p-value. The arrays passed into the test function are determined by thesamples
argument.samples (
Union
[str
,List
[str
],None
]) –for Column or SeriesSchema hypotheses, this refers to the group keys in the groupby column(s) used to group the Series into a dict of Series. The samples column(s) are passed into the test function as positional arguments.
For DataFrame-level hypotheses, samples refers to a column or multiple columns to pass into the test function. The samples column(s) are passed into the test function as positional arguments.
groupby (
Union
[str
,List
[str
],Callable
,None
]) –If a string or list of strings is provided, then these columns are used to group the Column Series by groupby. If a callable is passed, the expected signature is DataFrame -> DataFrameGroupby. The function has access to the entire dataframe, but the Column.name is selected from this DataFrameGroupby object so that a SeriesGroupBy object is passed into the hypothesis_check function.
Specifying this argument changes the fn signature to: dict[str|tuple[str], Series] -> bool|pd.Series[bool]
Where specific groups can be obtained from the input dict.
relationship (
Union
[str
,Callable
]) –Represents what relationship conditions are imposed on the hypothesis test. A function or lambda function can be supplied.
Available built-in relationships are: “greater_than”, “less_than”, “not_equal” or “equal”, where “equal” is the null hypothesis.
If callable, the input function signature should have the signature
(stat: float, pvalue: float, **kwargs)
where stat is the hypothesis test statistic, pvalue assesses statistical significance, and **kwargs are other arguments supplied via the **relationship_kwargs argument.Default is “equal” for the null hypothesis.
alpha (
Optional
[float
]) – significance level, if applicable to the hypothesis check.test_kwargs (dict) – Keyword arguments to be supplied to the test.
relationship_kwargs (dict) – Keyword arguments to be supplied to the relationship function. e.g. alpha could be used to specify a threshold in a t-test.
raise_warning (
bool
) – if True, raise a SchemaWarning and do not throw exception instead of raising a SchemaError for a specific check. This option should be used carefully in cases where a failing check is informational and shouldn’t stop execution of the program.n_failure_cases (
Optional
[int
]) – report the first n unique failure cases. If None, report all failure cases.title (
Optional
[str
]) – A human-readable label for the check.description (
Optional
[str
]) – An arbitrary textual description of the check.statistics (
Optional
[Dict
[str
,Any
]]) – kwargs to pass into the check function. These values are serialized and represent the constraints of the checks.strategy (
Optional
[SearchStrategy
]) – A hypothesis strategy, used for implementing data synthesis strategies for this check.check_kwargs – key-word arguments to pass into
check_fn
- Examples
Define a two-sample hypothesis test using scipy.
>>> import pandas as pd >>> import pandera as pa >>> >>> from scipy import stats >>> >>> schema = pa.DataFrameSchema({ ... "height_in_feet": pa.Column(float, [ ... pa.Hypothesis( ... test=stats.ttest_ind, ... samples=["A", "B"], ... groupby="group", ... # assert that the mean height of group "A" is greater ... # than that of group "B" ... relationship=lambda stat, pvalue, alpha=0.1: ( ... stat > 0 and pvalue / 2 < alpha ... ), ... # set alpha criterion to 5% ... relationship_kwargs={"alpha": 0.05} ... ) ... ]), ... "group": pa.Column(str), ... }) >>> df = ( ... pd.DataFrame({ ... "height_in_feet": [8.1, 7, 5.2, 5.1, 4], ... "group": ["A", "A", "B", "B", "B"] ... }) ... ) >>> schema.validate(df)[["height_in_feet", "group"]] height_in_feet group 0 8.1 A 1 7.0 A 2 5.2 B 3 5.1 B 4 4.0 B
See here for more usage details.
Attributes
RELATIONSHIPS
BACKEND_REGISTRY
CHECK_FUNCTION_REGISTRY
REGISTERED_CUSTOM_CHECKS
Methods
Perform a hypothesis test on a Series or DataFrame.
Calculate a t-test for the mean of one sample.
Calculate a t-test for the means of two samples.
Validate pandas DataFrame or Series.