Data Format Conversion#
new in 0.9.0
The class-based API provides configuration options for converting data to/from
supported serialization formats in the context of
check_types()
-decorated functions.
Note
Currently, pandera.typing.pandas.DataFrame
is the only data
type that supports this feature.
Consider this simple example:
import pandera as pa
from pandera.typing import DataFrame, Series
class InSchema(pa.DataFrameModel):
str_col: Series[str] = pa.Field(unique=True, isin=[*"abcd"])
int_col: Series[int]
class OutSchema(InSchema):
float_col: pa.typing.Series[float]
@pa.check_types
def transform(df: DataFrame[InSchema]) -> DataFrame[OutSchema]:
return df.assign(float_col=1.1)
With the schema type annotations and
check_types()
decorator, the transform
function validates DataFrame inputs and outputs according to the InSchema
and OutSchema
definitions.
But what if your input data is serialized in parquet format, and you want to
read it into memory, validate the DataFrame, and then pass it to a downstream
function for further analysis? Similarly, what if you want the output of
transform
to be a list of dictionary records instead of a pandas DataFrame?
The to/from_format
Configuration Options#
To easily fulfill the use cases described above, you can implement the
read/write logic by hand, or you can configure schemas to do so. We can first
define a subclass of InSchema
with additional configuration so that our
transform
function can read data directly from parquet files or buffers:
class InSchemaParquet(InSchema):
class Config:
from_format = "parquet"
Then, we define subclass of OutSchema
to specify that transform
should output a list of dictionaries representing the rows of the output
dataframe.
class OutSchemaDict(OutSchema):
class Config:
to_format = "dict"
to_format_kwargs = {"orient": "records"}
Note that the {to/from}_format_kwargs
configuration option should be
supplied with a dictionary of key-word arguments to be passed into the
respective pandas {to/from}_format
method.
Finally, we redefine our transform
function:
@pa.check_types
def transform(df: DataFrame[InSchemaParquet]) -> DataFrame[OutSchemaDict]:
return df.assign(float_col=1.1)
We can test this out using a buffer to store the parquet file.
Note
A string or path-like object representing the filepath to a parquet file
would also be a valid input to transform
.
import io
import json
buffer = io.BytesIO()
data = pd.DataFrame({"str_col": [*"abc"], "int_col": range(3)})
data.to_parquet(buffer)
buffer.seek(0)
dict_output = transform(buffer)
print(json.dumps(dict_output, indent=4))
[
{
"str_col": "a",
"int_col": 0,
"float_col": 1.1
},
{
"str_col": "b",
"int_col": 1,
"float_col": 1.1
},
{
"str_col": "c",
"int_col": 2,
"float_col": 1.1
}
]
Custom Converters with Callables#
In addition to specifying a literal string argument for from_format
a
generic callable that returns a pandas dataframe can be passed. For example,
pd.read_excel
, pd.read_sql
, or pd.read_gbq
. Depending on the function
passed, some of the kwargs arguments may be required rather than optional in
from_format_kwargs
(pd.read_sql
requires a connection object).
A callable can also be an argument for the to_format
parameter, with the
additional, optional, to_format_buffer
parameter. Some pandas dataframe writing
methods, such as pd.to_pickle
, have a required path argument, that must be
either a string file path or a bytes object. An example for writing data to a
pickle file would be:
import tempfile
def custom_to_pickle(data, *args, **kwargs):
return data.to_pickle(*args, **kwargs)
def custom_to_pickle_buffer():
"""Create a named temporary file handle to write the pickle file."""
return tempfile.NamedTemporaryFile()
class OutSchemaPickleCallable(OutSchema):
class Config:
to_format = custom_to_pickle
# If provided, the output of this function will be supplied as
# the first positional argument to the ``to_format`` function.
to_format_buffer = custom_to_pickle_buffer
In this example, we use a custom_to_pickle_buffer
function as the
to_format_buffer
property, which returns a tempfile.NamedTemporaryFile()
.
This will be supplied as a positional argument to the custom_to_pickle
function.
The full set of configuration options are:
Format |
Argument |
---|---|
dict |
“dict” |
csv |
“csv” |
json |
“json” |
feather |
“feather” |
parquet |
“parquet” |
pickle |
“pickle” |
Callable |
Callable |
Takeaway#
Data Format Conversion using the {to/from}_format
configuration option
can modify the behavior of check_types()
-decorated
functions to convert input data from a particular serialization format into
a dataframe. Additionally, you can convert the output data from a dataframe to
potentially another format.
This dovetails well with the FastAPI Integration for validating the inputs and outputs of app endpoints.