Mypy#
new in 0.8.0
Pandera integrates with mypy to provide static type-linting of dataframes, relying on pandas-stubs for typing information.
pip install pandera[mypy]
Then enable the plugin in your mypy.ini
or setug.cfg
file:
[mypy]
plugins = pandera.mypy
Note
Mypy static type-linting is supported for only pandas dataframes.
Warning
This functionality is experimental đ§Ş. Since the
pandas-stubs type stub
annotations donât always match the official
pandas effort to support type annotations),
installing the pandera[mypy]
extra may yield false positives in your
pandas code, many of which are are documented in tests/mypy/modules
(see here ).
We encourage you to file an issue
if you find any false positives or negatives being reported by mypy
.
A list of such issues can be found here.
Weâll most likely have to escalate this to the official pandas-stubs
issues .
Also, be aware that the latest pandas-stubs versions only support Python 3.8+. So, if you are using Python 3.7, you will not face an error when installing this package, but pip will install an older version of pandas-stubs with outdated type annotations.
In the example below, we define a few schemas to see how type-linting with pandera works.
from typing import cast
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
class Schema(pa.DataFrameModel):
id: Series[int]
name: Series[str]
class SchemaOut(pa.DataFrameModel):
age: Series[int]
class AnotherSchema(pa.DataFrameModel):
id: Series[int]
first_name: Series[str]
The mypy linter will complain if the output type of the function body doesnât match the functionâs return signature.
def fn(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[SchemaOut]) # mypy okay
def fn_pipe_incorrect_type(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[AnotherSchema]) # mypy error
# error: Argument 1 to "pipe" of "NDFrame" has incompatible type "Type[DataFrame[Any]]"; # noqa
# expected "Union[Callable[..., DataFrame[SchemaOut]], Tuple[Callable[..., DataFrame[SchemaOut]], str]]" [arg-type] # noqa
def fn_assign_copy(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30) # mypy error
# error: Incompatible return value type (got "pandas.core.frame.DataFrame",
# expected "pandera.typing.pandas.DataFrame[SchemaOut]") [return-value]
Itâll also complain if the input type doesnât match the expected input type.
Note that weâre using the pandera.typing.pandas.DataFrame
generic
type to define dataframes that are validated against the
DataFrameModel
type variable on initialization.
schema_df = DataFrame[Schema]({"id": [1], "name": ["foo"]})
pandas_df = pd.DataFrame({"id": [1], "name": ["foo"]})
another_df = DataFrame[AnotherSchema]({"id": [1], "first_name": ["foo"]})
fn(schema_df) # mypy okay
fn(pandas_df) # mypy error
# error: Argument 1 to "fn" has incompatible type "pandas.core.frame.DataFrame"; # noqa
# expected "pandera.typing.pandas.DataFrame[Schema]" [arg-type]
fn(another_df) # mypy error
# error: Argument 1 to "fn" has incompatible type "DataFrame[AnotherSchema]";
# expected "DataFrame[Schema]" [arg-type]
To make mypy happy with respect to the return type, you can either initialize a dataframe of the expected type:
def fn_pipe_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[SchemaOut]) # mypy okay
Note
If you use the approach above with the check_types()
decorator, pandera will do its best to not to validate the dataframe twice
if itâs already been initialized with the
DataFrame[Schema](**data)
syntax.
Or use typing.cast()
to indicate to mypy that the return value of
the function is of the correct type.
def fn_cast_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return cast(DataFrame[SchemaOut], df.assign(age=30)) # mypy okay
Limitations#
An important caveat to static type-linting with pandera dataframe types is that,
since pandas dataframes are mutable objects, thereâs no way for mypy
to
know whether a mutated instance of a
DataFrameModel
-typed dataframe has the correct
contents. Fortunately, we can simply rely on the check_types()
decorator to verify that the output dataframe is valid.
Consider the examples below:
def fn_pipe_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(age=30).pipe(DataFrame[SchemaOut]) # mypy okay
def fn_cast_dataframe(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return cast(DataFrame[SchemaOut], df.assign(age=30)) # mypy okay
@pa.check_types
def fn_mutate_inplace(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
out = df.assign(age=30).pipe(DataFrame[SchemaOut])
out.drop(columns="age", inplace=True)
return out # okay for mypy, pandera raises error
@pa.check_types
def fn_assign_and_get_index(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return df.assign(foo=30).iloc[:3] # mypy error
Even though the outputs of these functions are incorrect, mypy doesnât catch
the error during static type-linting but pandera will raise a
SchemaError
or SchemaErrors
exception at runtime, depending on whether youâre doing
lazy validation or not.
@pa.check_types
def fn_cast_dataframe_invalid(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
return cast(