Mypy#
new in 0.8.0
Pandera integrates with mypy to provide static type-linting of dataframes, relying on pandas-stubs for typing information.
pip install pandera[mypy]
Then enable the plugin in your mypy.ini
or setug.cfg
file:
[mypy]
plugins = pandera.mypy
Note
Mypy static type-linting is supported for only pandas dataframes.
Warning
This functionality is experimental 🧪. Since the
pandas-stubs type stub
annotations don’t always match the official
pandas effort to support type annotations),
installing the `pandera[mypy]
extra may yield false positives in your
pandas code, many of which are are documented in tests/mypy/modules
.
We encourage beta users to file an issue
if they find any false positives or negatives being reported by mypy
.
A list of such issues can be found here.
In the example below, we define a few schemas to see how type-linting with pandera works.
from typing import cast
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
class Schema(pa.SchemaModel):
id: Series[int]
name: Series[str]
class SchemaOut(pa.SchemaModel):
age: Series[int]
class AnotherSchema(pa.SchemaModel):
id: Series[int]
first_name: Series[str]
The mypy linter will complain if the output type of the function body doesn’t match the function’s return signature.
def fn(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[SchemaOut]) # mypy okay
def fn_pipe_incorrect_type(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[AnotherSchema]) # mypy error
# error: Argument 1 to "pipe" of "NDFrame" has incompatible type "Type[DataFrame[Any]]"; # noqa
# expected "Union[Callable[..., DataFrame[SchemaOut]], Tuple[Callable[..., DataFrame[SchemaOut]], str]]" [arg-type] # noqa
def fn_assign_copy(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30) # mypy error
# error: Incompatible return value type (got "pandas.core.frame.DataFrame",
# expected "pandera.typing.pandas.DataFrame[SchemaOut]") [return-value]
It’ll also complain if the input type doesn’t match the expected input type.
Note that we’re using the pandera.typing.pandas.DataFrame
generic
type to define dataframes that are validated against the
SchemaModel
type variable on initialization.
schema_df = DataFrame[Schema]({"id": [1], "name": ["foo"]})
pandas_df = pd.DataFrame({"id": [1], "name": ["foo"]})
another_df = DataFrame[AnotherSchema]({"id": [1], "first_name": ["foo"]})
fn(schema_df) # mypy okay
fn(pandas_df) # mypy error
# error: Argument 1 to "fn" has incompatible type "pandas.core.frame.DataFrame"; # noqa
# expected "pandera.typing.pandas.DataFrame[Schema]" [arg-type]
fn(another_df) # mypy error
# error: Argument 1 to "fn" has incompatible type "DataFrame[AnotherSchema]";
# expected "DataFrame[Schema]" [arg-type]
To make mypy happy with respect to the return type, you can either initialize a dataframe of the expected type:
def fn_pipe_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[SchemaOut]) # mypy okay
Note
If you use the approach above with the check_types()
decorator, pandera will do its best to not to validate the dataframe twice
if it’s already been initialized with the
DataFrame[Schema](**data)
syntax.
Or use typing.cast()
to indicate to mypy that the return value of
the function is of the correct type.
def fn_cast_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return cast(DataFrame[SchemaOut], df.assign(age=30)) # mypy okay
Limitations#
An important caveat to static type-linting with pandera dataframe types is that,
since pandas dataframes are mutable objects, there’s no way for mypy
to
know whether a mutated instance of a
SchemaModel
-typed dataframe has the correct
contents. Fortunately, we can simply rely on the check_types()
decorator to verify that the output dataframe is valid.
Consider the examples below:
def fn_pipe_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[SchemaOut]) # mypy okay
def fn_cast_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return cast(DataFrame[SchemaOut], df.assign(age=30)) # mypy okay
@pa.check_types
def fn_mutate_inplace(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
out = df.assign(age=30).pipe(DataFrame[SchemaOut])
out.drop(["age"], axis=1, inplace=True)
return out # okay for mypy, pandera raises error
@pa.check_types
def fn_assign_and_get_index(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(foo=30).iloc[:3] # okay for mypy, pandera raises error
Even though the outputs of these functions are incorrect, mypy doesn’t catch
the error during static type-linting but pandera will raise a
SchemaError
or SchemaErrors
exception at runtime, depending on whether you’re doing
lazy validation or not.
@pa.check_types
def fn_cast_dataframe_invalid(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return cast(
DataFrame[SchemaOut], df
) # okay for mypy, pandera raises error