Skip to content

AetherFlow is a Python library that uses autonomous agent to automatically transform Pandas DataFrames to conform with a Pandera schema. It analyzes validation errors and applies the necessary tools to fix issues, iterating until the DataFrame adheres to the schema's rules.

License

Notifications You must be signed in to change notification settings

Jhonnyr97/AetherFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AetherFlow: LLM-Powered DataFrame Transformation

AetherFlow is a Python library that harnesses the power of Large Language Models (LLMs) to automate the transformation of Pandas DataFrames, making them conform to a predefined Pandera schema. This library provides an intelligent agent that analyzes schema discrepancies and applies a set of predefined tools to iteratively clean and reshape your data.

Features

  • Schema-Driven Transformation: Define your desired schema using Pandera, and let AetherFlow handle the rest.
  • Instruction via Descriptions: Add description to your DataFrameSchema and each Column to provide natural language instructions directly to the AI agent, guiding complex transformations.
  • Intelligent AI Agent: An LLM-based agent analyzes DataFrame validation errors and decides the most appropriate action to take.
  • Automated Error Resolution: Automatically identifies and suggests solutions for column names, data types, and content cleaning issues.
  • Built-in Tools: Includes essential tools for DataFrame manipulation:
    • Modification: rename_column, cast_column, drop_column, fill_null_values, replace_values.
    • Cleaning & Formatting: regex_replace.
    • Inspection: get_dataframe_head, get_column_value_counts, get_schema_summary, get_descriptive_statistics.
  • Iterative Process: The agent continues to iterate and apply changes until the DataFrame conforms to the schema or a maximum number of iterations is reached.
  • Extensible: Easily extensible with new custom tools to handle more complex transformations.

Installation

To use AetherFlow, you'll need Python 3.9+ and the following packages:

pip install langgraph pandera pytest

You will also need to configure a Large Language Model compatible with LangChain (e.g., OpenAI, Anthropic, Google Gemini, etc.). Ensure your API keys are correctly set up as environment variables.

For example, with OpenAI:

pip install langchain-openai
export OPENAI_API_KEY="your_openai_api_key_here"

Usage

Here is an example of how to use DataTransformerAgent to clean and standardize a DataFrame:

from aetherflow.dataframe_agent import DataTransformerAgent
from langchain_openai import ChatOpenAI
import pandera.pandas as pa
import pandas as pd
# Set llm
llm = llm = ChatOpenAI(
    model="gpt-4.1-mini"
)
# Set target pandera schema
target_schema = pa.DataFrameSchema(
    {
        "full_name": pa.Column(pa.String, nullable=True, description="User's full name. Must be in UPPERCASE."),
        "email": pa.Column(pa.String, checks=pa.Check.str_matches(r'^[^@]+@[^@]+\.[^@]+$'), nullable=True, description="User's email address. Must be a valid format."),
        "registration_date": pa.Column(pa.DateTime, nullable=False, description="The date the user registered."),
        "is_active": pa.Column("boolean", nullable=True, description="Indicates if the user account is active.")
    },
    description="Schema for user data. All columns must match exactly. The goal is to clean and standardize the raw data.",
    strict=True,
    ordered=True
)

# get initial dataframe
initial_df = pd.DataFrame({
    'id': [1, 2, '3', 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Ethan'],
    'email_address': ['alice@example.com', 'bob@work.com', 'charlie@mail.net', 'diana@web.org', 'invalid-email'],
    'reg_date': ['2023-01-15', '2023/02/20', '2023-03-05', '2023-04-10', '2024-05-20'],
    'active': [True, 'False', 1, '0', 'true']
})

# Initialize the agent
transformer = DataTransformerAgent(llm, initial_df, target_schema, memory_depth=5, verbose=True)

# Execute the agent
transformed_df = transformer.execute(max_iterations=20)

# Print the transformed DataFrame
print(transformed_df.head())
  full_name              email registration_date  is_active
0     Alice  alice@example.com        2023-01-15       True
1       Bob       bob@work.com        2020-01-01      False
2   Charlie   charlie@mail.net        2023-03-05       True
3     Diana      diana@web.org        2023-04-10      False
4     Ethan               <NA>        2024-05-20       True

How It Works

AetherFlow is built using langgraph to create a state graph that orchestrates the transformation process:

  1. Initial State: The graph begins with the initial DataFrame, the target Schema, and an empty message history.
  2. "Agent" Node:
    • It validates the current DataFrame against the target Schema.
    • If validation is successful, the process ends.
    • If there are errors, the agent analyzes the discrepancies (missing/extra column names, incorrect data types, content validation errors).
    • It builds a detailed prompt for the LLM, including the action history, current errors, and the descriptions provided in the Pandera schema (for both the overall DataFrameSchema and each Column). These descriptions act as direct instructions for the LLM.
    • The LLM (interacting with LangChain Tools) decides which tools to invoke to solve the problems. It can solve multiple problems simultaneously in a single step.
  3. "Tool Node":
    • Executes the tool selected by the LLM with the specified arguments.
    • Updates the DataFrame with the result of the operation.
    • Records the outcome (success or error) in the message history.
  4. Loop: After a tool is executed, control returns to the "Agent" node to re-evaluate the updated DataFrame and decide the next step. This cycle continues until the DataFrame is valid or the iteration limit is reached.

Contributing

Contributions are welcome! If you have ideas for new tools, improvements to the agent's logic, or bug fixes, feel free to open a pull request or an issue.

About

AetherFlow is a Python library that uses autonomous agent to automatically transform Pandas DataFrames to conform with a Pandera schema. It analyzes validation errors and applies the necessary tools to fix issues, iterating until the DataFrame adheres to the schema's rules.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages