DataFrame Schemas¶
The DataFrameSchema
class enables the specification of a schema
that verifies the columns and index of a pandas DataFrame
object.
The DataFrameSchema
object consists of Column
s and an Index
(if applicable).
import pandera as pa
from pandera import Column, DataFrameSchema, Check, Index
schema = DataFrameSchema(
{
"column1": Column(int),
"column2": Column(float, Check(lambda s: s < -1.2)),
# you can provide a list of validators
"column3": Column(str, [
Check(lambda s: s.str.startswith("value")),
Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
]),
},
index=Index(int),
strict=True,
coerce=True,
)
You can refer to DataFrame Models to see how to define dataframe schemas using the alternative pydantic/dataclass-style syntax.
Column Validation¶
A Column
must specify the properties of a
column in a dataframe object. It can be optionally verified for its data type,
[null values] or
duplicate values. The column can be coerce
d into the specified type, and the
[required] parameter allows control over whether or not the column is allowed to
be missing.
Similarly to pandas, the data type can be specified as:
a string alias, as long as it is recognized by pandas.
a python type:
int
,float
,double
,bool
,str
a pandas extension type: it can be an instance (e.g
pd.CategoricalDtype(["a", "b"])
) or a class (e.gpandas.CategoricalDtype
) if it can be initialized with default values.a pandera
DataType
: it can also be an instance or a class.
Important
You can learn more about how data type validation works Data Type Validation.
Column checks allow for the DataFrameâs values to be
checked against a user-provided function. Check
objects also support
grouping by a different column so that the user can make
assertions about subsets of the column of interest.
Column Hypotheses enable you to perform statistical hypothesis tests on a DataFrame in either wide or tidy format. See Hypothesis Testing for more details.
Null Values in Columns¶
By default, SeriesSchema/Column objects assume that values are not
nullable. In order to accept null values, you need to explicitly specify
nullable=True
, or else youâll get an error.
import numpy as np
import pandas as pd
import pandera as pa
from pandera import Check, Column, DataFrameSchema
df = pd.DataFrame({"column1": [5, 1, np.nan]})
non_null_schema = DataFrameSchema({
"column1": Column(float, Check(lambda x: x > 0))
})
try:
non_null_schema.validate(df)
except pa.errors.SchemaError as exc:
print(exc)
non-nullable series 'column1' contains null values:
2 NaN
Name: column1, dtype: float64
Setting nullable=True
allows for null values in the corresponding column.
null_schema = DataFrameSchema({
"column1": Column(float, Check(lambda x: x > 0), nullable=True)
})
null_schema.validate(df)
column1 | |
---|---|
0 | 5.0 |
1 | 1.0 |
2 | NaN |
To learn more about how the nullable check interacts with data type checks, see here.
Coercing Types on Columns¶
If you specify Column(dtype, ..., coerce=True)
as part of the
DataFrameSchema definition, calling schema.validate
will first
coerce the column into the specified dtype
before applying validation
checks.
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema
df = pd.DataFrame({"column1": [1, 2, 3]})
schema = DataFrameSchema({"column1": Column(str, coerce=True)})
validated_df = schema.validate(df)
assert isinstance(validated_df.column1.iloc[0], str)
Note
Note the special case of integers columns not supporting nan
values. In this case, schema.validate
will complain if coerce == True
and null values are allowed in the column.
df = pd.DataFrame({"column1": [1., 2., 3, np.nan]})
schema = DataFrameSchema({
"column1": Column(int, coerce=True, nullable=True)
})
try:
schema.validate(df)
except pa.errors.SchemaError as exc:
print(exc)
Error while coercing 'column1' to type int64: Could not coerce <class 'pandas.core.series.Series'> data_container into type int64:
index failure_case
0 3 NaN
The best way to handle this case is to simply specify the column as a
Float
or Object
.
schema_object = DataFrameSchema({
"column1": Column(object, coerce=True, nullable=True)
})
schema_float = DataFrameSchema({
"column1": Column(float, coerce=True, nullable=True)
})
print(schema_object.validate(df).dtypes)
print(schema_float.validate(df).dtypes)
column1 object
dtype: object
column1 float64
dtype: object
If you want to coerce all of the columns specified in the
DataFrameSchema
, you can specify the coerce
argument with
DataFrameSchema(..., coerce=True)
. Note that this will have
the effect of overriding any coerce=False
arguments specified at
the Column
or Index
level.
Required Columns¶
By default all columns specified in the schema are required, meaning
that if a column is missing in the input DataFrame an exception will be
thrown. If you want to make a column optional, specify required=False
in the column constructor:
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema
df = pd.DataFrame({"column2": ["hello", "pandera"]})
schema = DataFrameSchema({
"column1": Column(int, required=False),
"column2": Column(str)
})
schema.validate(df)
column2 | |
---|---|
0 | hello |
1 | pandera |
Since required=True
by default, missing columns would raise an error:
schema = DataFrameSchema({
"column1": Column(int),
"column2": Column(str),
})
try:
schema.validate(df)
except pa.errors.SchemaError as exc:
print(exc)
column 'column1' not in dataframe. Columns in dataframe: ['column2']
Stand-alone Column Validation¶
In addition to being used in the context of a DataFrameSchema
, Column
objects can also be used to validate columns in a dataframe on its own:
import pandas as pd
import pandera as pa
df = pd.DataFrame({
"column1": [1, 2, 3],
"column2": ["a", "b", "c"],
})
column1_schema = pa.Column(int, name="column1")
column2_schema = pa.Column(str, name="column2")
# pass the dataframe as an argument to the Column object callable
df = column1_schema(df)
validated_df = column2_schema(df)
# or explicitly use the validate method
df = column1_schema.validate(df)
validated_df = column2_schema.validate(df)
# use the DataFrame.pipe method to validate two columns
df.pipe(column1_schema).pipe(column2_schema)
column1 | column2 | |
---|---|---|
0 | 1 | a |
1 | 2 | b |
2 | 3 | c |
For multi-column use cases, the DataFrameSchema
is still recommended, but if you have one or a small number of columns to verify,
using Column
objects by themselves is appropriate.
Column Regex Pattern Matching¶
In the case that your dataframe has multiple columns that share common
statistical properties, you might want to specify a regex pattern that matches
a set of meaningfully grouped columns that have str
names.
import numpy as np
import pandas as pd
import pandera as pa
categories = ["A", "B", "C"]
np.random.seed(100)
dataframe = pd.DataFrame({
"cat_var_1": np.random.choice(categories, size=100),
"cat_var_2": np.random.choice(categories, size=100),
"num_var_1": np.random.uniform(0, 10, size=100),
"num_var_2": np.random.uniform(20, 30, size=100),
})
schema = pa.DataFrameSchema({
"num_var_.+": pa.Column(
float,
checks=pa.Check.greater_than_or_equal_to(0),
regex=True,
),
"cat_var_.+": pa.Column(
pa.Category,
checks=pa.Check.isin(categories),
coerce=True,
regex=True,
),
})
schema.validate(dataframe).head()
cat_var_1 | cat_var_2 | num_var_1 | num_var_2 | |
---|---|---|---|---|
0 | A | A | 6.804147 | 24.743304 |
1 | A | C | 3.684308 | 22.774633 |
2 | A | C | 5.911288 | 28.416588 |
3 | C | A | 4.790627 | 21.951250 |
4 | C | B | 4.504166 | 28.563142 |
You can also regex pattern match on pd.MultiIndex
columns:
np.random.seed(100)
dataframe = pd.DataFrame({
("cat_var_1", "y1"): np.random.choice(categories, size=100),
("cat_var_2", "y2"): np.random.choice(categories, size=100),
("num_var_1", "x1"): np.random.uniform(0, 10, size=100),
("num_var_2", "x2"): np.random.uniform(0, 10, size=100),
})
schema = pa.DataFrameSchema({
("num_var_.+", "x.+"): pa.Column(
float,
checks=pa.Check.greater_than_or_equal_to(0),
regex=True,
),
("cat_var_.+", "y.+"): pa.Column(
pa.Category,
checks=pa.Check.isin(categories),
coerce=True,
regex=True,
),
})
schema.validate(dataframe).head()
cat_var_1 | cat_var_2 | num_var_1 | num_var_2 | |
---|---|---|---|---|
y1 | y2 | x1 | x2 | |
0 | A | A | 6.804147 | 4.743304 |
1 | A | C | 3.684308 | 2.774633 |
2 | A | C | 5.911288 | 8.416588 |
3 | C | A | 4.790627 | 1.951250 |
4 | C | B | 4.504166 | 8.563142 |
Handling Dataframe Columns not in the Schema¶
By default, columns that arenât specified in the schema arenât checked.
If you want to check that the DataFrame only contains columns in the
schema, specify strict=True
:
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema
schema = DataFrameSchema(
{"column1": Column(int)},
strict=True)
df = pd.DataFrame({"column2": [1, 2, 3]})
try:
schema.validate(df)
except pa.errors.SchemaError as exc:
print(exc)
column 'column2' not in DataFrameSchema {'column1': <Schema Column(name=column1, type=DataType(int64))>}
Alternatively, if your DataFrame contains columns that are not in the schema,
and you would like these to be dropped on validation,
you can specify strict='filter'
.
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema
df = pd.DataFrame({"column1": ["drop", "me"],"column2": ["keep", "me"]})
schema = DataFrameSchema({"column2": Column(str)}, strict='filter')
schema.validate(df)
column2 | |
---|---|
0 | keep |
1 | me |
Validating the order of the columns¶
For some applications the order of the columns is important. For example:
If you want to use selection by position instead of the more common selection by label.
Machine learning: Many ML libraries will cast a Dataframe to numpy arrays, for which order becomes crucial.
To validate the order of the Dataframe columns, specify ordered=True
:
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema(
columns={"a": pa.Column(int), "b": pa.Column(int)}, ordered=True
)
df = pd.DataFrame({"b": [1], "a": [1]})
try:
schema.validate(df)
except pa.errors.SchemaError as exc:
print(exc)
column 'b' out-of-order
Validating the joint uniqueness of columns¶
In some cases you might want to ensure that a group of columns are unique:
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema(
columns={col: pa.Column(int) for col in ["a", "b", "c"]},
unique=["a", "c"],
)
df = pd.DataFrame.from_records([
{"a": 1, "b": 2, "c": 3},
{"a": 1, "b": 2, "c": 3},
])
try:
schema.validate(df)
except pa.errors.SchemaError as exc:
print(exc)
columns '('a', 'c')' not unique:
a c
0 1 3
1 1 3
To control how unique errors are reported, the report_duplicates
argument accepts:
: - exclude_first
: (default) report all duplicates except first occurence
exclude_last
: report all duplicates except last occurenceall
: report all duplicates
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema(
columns={col: pa.Column(int) for col in ["a", "b", "c"]},
unique=["a", "c"],
report_duplicates = "exclude_first",
)
df = pd.DataFrame.from_records([
{"a": 1, "b": 2, "c": 3},
{"a": 1, "b": 2, "c": 3},
])
try:
schema.validate(df)
except pa.errors.SchemaError as exc:
print(exc)
columns '('a', 'c')' not unique:
a c
1 1 3
Adding missing columns¶
When loading raw data into a form thatâs ready for data processing, itâs often
useful to have guarantees that the columns specified in the schema are present,
even if theyâre missing from the raw data. This is where itâs useful to
specify add_missing_columns=True
in your schema definition.
When you call schema.validate(data)
, the schema will add any missing columns
to the dataframe, defaulting to the default
value if supplied at the column-level,
or to NaN
if the column is nullable.
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema(
columns={
"a": pa.Column(int),
"b": pa.Column(int, default=1),
"c": pa.Column(float, nullable=True),
},
add_missing_columns=True,
coerce=True,
)
df = pd.DataFrame({"a": [1, 2, 3]})
schema.validate(df)
a | b | c | |
---|---|---|---|
0 | 1 | 1 | NaN |
1 | 2 | 1 | NaN |
2 | 3 | 1 | NaN |
Index Validation¶
You can also specify an Index
in the DataFrameSchema
.
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema, Index, Check
schema = DataFrameSchema(
columns={"a": Column(int)},
index=Index(
str,
Check(lambda x: x.str.startswith("index_"))))
df = pd.DataFrame(
data={"a": [1, 2, 3]},
index=["index_1", "index_2", "index_3"])
schema.validate(df)
a | |
---|---|
index_1 | 1 |
index_2 | 2 |
index_3 | 3 |
In the case that the DataFrame index doesnât pass the Check
.
df = pd.DataFrame(
data={"a": [1, 2, 3]},
index=["foo1", "foo2", "foo3"]
)
try:
schema.validate(df)
except pa.errors.SchemaError as exc:
print(exc)
Index 'None' failed element-wise validator number 0: <Check <lambda>> failure cases: foo1, foo2, foo3
MultiIndex Validation¶
pandera
also supports multi-index column and index validation.
MultiIndex Columns¶
Specifying multi-index columns follows the pandas
syntax of specifying
tuples for each level in the index hierarchy:
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema, Index
schema = DataFrameSchema({
("foo", "bar"): Column(int),
("foo", "baz"): Column(str)
})
df = pd.DataFrame({
("foo", "bar"): [1, 2, 3],
("foo", "baz"): ["a", "b", "c"],
})
schema.validate(df)
foo | ||
---|---|---|
bar | baz | |
0 | 1 | a |
1 | 2 | b |
2 | 3 | c |
MultiIndex Indexes¶
The MultiIndex
class allows you to define multi-index
indexes by composing a list of pandera.Index
objects.
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema(
columns={"column1": pa.Column(int)},
index=pa.MultiIndex([
pa.Index(str,
pa.Check(lambda s: s.isin(["foo", "bar"])),
name="index0"),
pa.Index(int, name="index1"),
])
)
df = pd.DataFrame(
data={"column1": [1, 2, 3]},
index=pd.MultiIndex.from_arrays(
[["foo", "bar", "foo"], [0, 1,2 ]],
names=["index0", "index1"]
)
)
schema.validate(df)
column1 | ||
---|---|---|
index0 | index1 | |
foo | 0 | 1 |
bar | 1 | 2 |
foo | 2 | 3 |
Get Pandas Data Types¶
Pandas provides a dtype
parameter for casting a dataframe to a specific dtype
schema. DataFrameSchema
provides
a dtypes
property which returns a
dictionary whose keys are column names and values are DataType
.
Some examples of where this can be provided to pandas are:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema(
columns={
"column1": pa.Column(int),
"column2": pa.Column(pa.Category),
"column3": pa.Column(bool)
},
)
df = (
pd.DataFrame.from_dict(
{
"a": {"column1": 1, "column2": "valueA", "column3": True},
"b": {"column1": 1, "column2": "valueB", "column3": True},
},
orient="index",
)
.astype({col: str(dtype) for col, dtype in schema.dtypes.items()})
.sort_index(axis=1)
)
schema.validate(df)
column1 | column2 | column3 | |
---|---|---|---|
a | 1 | valueA | True |
b | 1 | valueB | True |
DataFrameSchema Transformations¶
Once youâve defined a schema, you can then make modifications to it, both on the schema level â such as adding or removing columns and setting or resetting the index â or on the column level â such as changing the data type or checks.
This is useful for re-using schema objects in a data pipeline when additional computation has been done on a dataframe, where the column objects may have changed or perhaps where additional checks may be required.
import pandas as pd
import pandera as pa
data = pd.DataFrame({"col1": range(1, 6)})
schema = pa.DataFrameSchema(
columns={"col1": pa.Column(int, pa.Check(lambda s: s >= 0))},
strict=True)
transformed_schema = schema.add_columns({
"col2": pa.Column(str, pa.Check(lambda s: s == "value")),
"col3": pa.Column(float, pa.Check(lambda x: x == 0.0)),
})
# validate original data
data = schema.validate(data)
# transformation
transformed_data = data.assign(col2="value", col3=0.0)
# validate transformed data
transformed_schema.validate(transformed_data)
col1 | col2 | col3 | |
---|---|---|---|
0 | 1 | value | 0.0 |
1 | 2 | value | 0.0 |
2 | 3 | value | 0.0 |
3 | 4 | value | 0.0 |
4 | 5 | value | 0.0 |
Similarly, if you want dropped columns to be explicitly validated in a data pipeline:
import pandera as pa
schema = pa.DataFrameSchema(
columns={
"col1": pa.Column(int, pa.Check(lambda s: s >= 0)),
"col2": pa.Column(str, pa.Check(lambda x: x <= 0)),
"col3": pa.Column(object, pa.Check(lambda x: x == 0)),
},
strict=True,
)
schema.remove_columns(["col2", "col3"])
<Schema DataFrameSchema(columns={'col1': <Schema Column(name=col1, type=DataType(int64))>}, checks=[], parsers=[], index=None, coerce=False, dtype=None, strict=True, name=None, ordered=False, unique_column_names=Falsemetadata='None, unique_column_names=False, add_missing_columns=False)>
If during the course of a data pipeline one of your columns is moved into the
index, you can simply update the initial input schema using the
set_index()
method to create a schema for
the pipeline output.
import pandera as pa
from pandera import Column, DataFrameSchema, Check, Index
schema = DataFrameSchema(
{
"column1": Column(int),
"column2": Column(float)
},
index=Index(int, name = "column3"),
strict=True,
coerce=True,
)
schema.set_index(["column1"], append = True)
<Schema DataFrameSchema(columns={'column2': <Schema Column(name=column2, type=DataType(float64))>}, checks=[], parsers=[], index=<Schema MultiIndex(indexes=[<Schema Index(name=column3, type=DataType(int64))>, <Schema Index(name=column1, type=DataType(int64))>], coerce=False, strict=False, name=None, ordered=True)>, coerce=True, dtype=None, strict=True, name=None, ordered=False, unique_column_names=Falsemetadata='None, unique_column_names=False, add_missing_columns=False)>
The available methods for altering the schema are:
add_columns()
remove_columns()
update_columns()
rename_columns()
set_index()
reset_index()