Dropping Invalid Rows#
New in version 0.16.0
If you wish to use the validation step to remove invalid data, you can pass the
drop_invalid_rows=True
argument to the schema
object on creation. On schema.validate()
,
if a data-level check fails, then that row which caused the failure will be removed from the dataframe
when it is returned.
drop_invalid
will prevent data-level schema errors being raised and will instead
remove the rows which causes the failure.
This functionality is available on DataFrameSchema
, SeriesSchema
, Column
,
as well as DataFrameModel
schemas.
Dropping invalid rows with DataFrameSchema
:
import pandas as pd
import pandera as pa
from pandera import Check, Column, DataFrameSchema
df = pd.DataFrame({"counter": ["1", "2", "3"]})
schema = DataFrameSchema(
{"counter": Column(int, checks=[Check(lambda x: x >= 3)])},
drop_invalid_rows=True,
)
schema.validate(df, lazy=True)
Dropping invalid rows with SeriesSchema
:
import pandas as pd
import pandera as pa
from pandera import Check, SeriesSchema
series = pd.Series(["1", "2", "3"])
schema = SeriesSchema(
int,
checks=[Check(lambda x: x >= 3)],
drop_invalid_rows=True,
)
schema.validate(series, lazy=True)
Dropping invalid rows with Column
:
import pandas as pd
import pandera as pa
from pandera import Check, Column
df = pd.DataFrame({"counter": ["1", "2", "3"]})
schema = Column(
int,
name="counter",
drop_invalid_rows=True,
checks=[Check(lambda x: x >= 3)]
)
schema.validate(df, lazy=True)
Dropping invalid rows with DataFrameModel
:
import pandas as pd
import pandera as pa
from pandera import Check, DataFrameModel, Field
class MySchema(DataFrameModel):
counter: int = Field(in_range={"min_value": 3, "max_value": 5})
class Config:
drop_invalid_rows = True
MySchema.validate(
pd.DataFrame({"counter": [1, 2, 3, 4, 5, 6]}), lazy=True
)
Note
In order to use drop_invalid_rows=True
, lazy=True
must
be passed to the schema.validate()
. Lazy Validation enables all schema
errors to be collected and raised together, meaning all invalid rows can be dropped together.
This provides clear API for ensuring the validated dataframe contains only valid data.