Supported DataFrame LibrariesĀ¶

Pandera started out as a pandas-specific dataframe validation library, and moving forward its core functionality will continue to support pandas. However, panderaā€™s adoption has resulted in the realization that it can be a much more powerful tool by supporting other dataframe-like formats.

DataFrame Library SupportĀ¶

Pandera supports validation of the following DataFrame libraries:

Pandas

Validate pandas dataframes. This is the original dataframe library supported by pandera.

Polars

Validate Polars dataframes, the blazingly fast dataframe library.

Pyspark SQL

A data processing library for large-scale data.

Validating Pandas-like DataFramesĀ¶

Pandera provides multiple ways of scaling up data validation of pandas-like dataframes that donā€™t fit into memory. Fortunately, pandera doesnā€™t have to re-invent the wheel. Standing on shoulders of giants, it integrates with the existing ecosystem of libraries that allow you to perform validations on out-of-memory pandas-like dataframes. The following libraries are supported via panderaā€™s pandas validation backend:

Dask

Apply pandera schemas to Dask dataframe partitions.

Modin

A pandas drop-in replacement, distributed using a Ray or Dask backend.

Pyspark Pandas

The pandas-like interface exposed by pyspark.

Domain-specific Data ValidationĀ¶

The pandas ecosystem provides support for domain-specific data manipulation, and by extension pandera can provide access to data types, methods, and data container types specific to these libraries.

GeoPandas

An extension of pandas that adds geospatial data processing capabilities.

Alternative Acceleration FrameworksĀ¶

Pandera works with other dataframe-agnostic libraries that allow for distributed dataframe validation:

Fugue

Apply pandera schemas to distributed dataframe partitions with Fugue.

Note

Donā€™t see a library that you want supported? Check out the github issues to see if that library is in the roadmap. If it isnā€™t, open up a new issue to add support for it!