AetherFlow is a Python library that harnesses the power of Large Language Models (LLMs) to automate the transformation of Pandas DataFrames, making them conform to a predefined Pandera schema. This library provides an intelligent agent that analyzes schema discrepancies and applies a set of predefined tools to iteratively clean and reshape your data.
- Schema-Driven Transformation: Define your desired schema using Pandera, and let AetherFlow handle the rest.
- Instruction via Descriptions: Add
descriptionto yourDataFrameSchemaand eachColumnto provide natural language instructions directly to the AI agent, guiding complex transformations. - Intelligent AI Agent: An LLM-based agent analyzes DataFrame validation errors and decides the most appropriate action to take.
- Automated Error Resolution: Automatically identifies and suggests solutions for column names, data types, and content cleaning issues.
- Built-in Tools: Includes essential tools for DataFrame manipulation:
- Modification:
rename_column,cast_column,drop_column,fill_null_values,replace_values. - Cleaning & Formatting:
regex_replace. - Inspection:
get_dataframe_head,get_column_value_counts,get_schema_summary,get_descriptive_statistics.
- Modification:
- Iterative Process: The agent continues to iterate and apply changes until the DataFrame conforms to the schema or a maximum number of iterations is reached.
- Extensible: Easily extensible with new custom tools to handle more complex transformations.
To use AetherFlow, you'll need Python 3.9+ and the following packages:
pip install langgraph pandera pytestYou will also need to configure a Large Language Model compatible with LangChain (e.g., OpenAI, Anthropic, Google Gemini, etc.). Ensure your API keys are correctly set up as environment variables.
For example, with OpenAI:
pip install langchain-openai
export OPENAI_API_KEY="your_openai_api_key_here"Here is an example of how to use DataTransformerAgent to clean and standardize a DataFrame:
from aetherflow.dataframe_agent import DataTransformerAgent
from langchain_openai import ChatOpenAI
import pandera.pandas as pa
import pandas as pd
# Set llm
llm = llm = ChatOpenAI(
model="gpt-4.1-mini"
)
# Set target pandera schema
target_schema = pa.DataFrameSchema(
{
"full_name": pa.Column(pa.String, nullable=True, description="User's full name. Must be in UPPERCASE."),
"email": pa.Column(pa.String, checks=pa.Check.str_matches(r'^[^@]+@[^@]+\.[^@]+$'), nullable=True, description="User's email address. Must be a valid format."),
"registration_date": pa.Column(pa.DateTime, nullable=False, description="The date the user registered."),
"is_active": pa.Column("boolean", nullable=True, description="Indicates if the user account is active.")
},
description="Schema for user data. All columns must match exactly. The goal is to clean and standardize the raw data.",
strict=True,
ordered=True
)
# get initial dataframe
initial_df = pd.DataFrame({
'id': [1, 2, '3', 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Ethan'],
'email_address': ['alice@example.com', 'bob@work.com', 'charlie@mail.net', 'diana@web.org', 'invalid-email'],
'reg_date': ['2023-01-15', '2023/02/20', '2023-03-05', '2023-04-10', '2024-05-20'],
'active': [True, 'False', 1, '0', 'true']
})
# Initialize the agent
transformer = DataTransformerAgent(llm, initial_df, target_schema, memory_depth=5, verbose=True)
# Execute the agent
transformed_df = transformer.execute(max_iterations=20)
# Print the transformed DataFrame
print(transformed_df.head())
full_name email registration_date is_active
0 Alice alice@example.com 2023-01-15 True
1 Bob bob@work.com 2020-01-01 False
2 Charlie charlie@mail.net 2023-03-05 True
3 Diana diana@web.org 2023-04-10 False
4 Ethan <NA> 2024-05-20 TrueAetherFlow is built using langgraph to create a state graph that orchestrates the transformation process:
- Initial State: The graph begins with the initial
DataFrame, the targetSchema, and an empty message history. - "Agent" Node:
- It validates the current
DataFrameagainst the targetSchema. - If validation is successful, the process ends.
- If there are errors, the agent analyzes the discrepancies (missing/extra column names, incorrect data types, content validation errors).
- It builds a detailed prompt for the LLM, including the action history, current errors, and the descriptions provided in the Pandera schema (for both the overall
DataFrameSchemaand eachColumn). These descriptions act as direct instructions for the LLM. - The LLM (interacting with LangChain Tools) decides which tools to invoke to solve the problems. It can solve multiple problems simultaneously in a single step.
- It validates the current
- "Tool Node":
- Executes the tool selected by the LLM with the specified arguments.
- Updates the
DataFramewith the result of the operation. - Records the outcome (success or error) in the message history.
- Loop: After a tool is executed, control returns to the "Agent" node to re-evaluate the updated
DataFrameand decide the next step. This cycle continues until theDataFrameis valid or the iteration limit is reached.
Contributions are welcome! If you have ideas for new tools, improvements to the agent's logic, or bug fixes, feel free to open a pull request or an issue.