A template driven Python tool for generating brute-force SQL diff scripts to compare two different database tables or datasets in the same table. Perfect for data validation, migration testing, and ensuring data consistency between production and test environments.
The generated SQL is compatible with Snowflake and ANSI SQL databases and can be configured to use complex join keys with null-safe comparisons. The result can be TABLE (or TEMP TABLE) with status codes for each column and row-level differences:
- 0: ✅ Values match as same row are in each table and are either both null or equal
- 1: Values are different and not null in both tables
- 2: Column is NULL in before table only
- 3: Column is NULL in after table only
- 4: Row missing in before table (only shown as the
_row_statusmagic column) - 5: Row missing in after table (only shown as the
_row_statusmagic column)
The tool was built to help regression test data loading pipelines that insert only now rows.
In production environment you can modify the template to create a temp table on the snowflake web UI then download the results to Excel and filter on any column > 0 to see the differences.
- Schema-aware comparisons: Automatically handles different column structures between tables
- Flexible key matching: Supports single or multiple business key columns with null-safe comparisons
- Comprehensive diff reporting: Generates status codes for each column and row-level differences
- Snowflake/ANSI SQL compatible: Generates standard SQL that works with Snowflake and other databases
- DuckDB testing support: Includes unit tests using DuckDB for validation
- Clone the repository:
git clone <your-repo-url
cd checkatron- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate- Install the package in development mode:
pip install -e .💡 Pro Tip: In VSCode, you can click on any of these commands to run them directly in your terminal!
source venv/bin/activateWant to see Checkatron in action? Try the included samples:
Run the interactive example:
python samples/run_example.pyOr use the command line directly:
python -m checkatron.diffgen \
samples/example_before.csv \
samples/example_after.csv \
--keys samples/example_keys.csv \
--out samples/example_diff.sql💡 Pro Tip: In PyCharm/VSCode, you can click the run button on these commands to execute them directly!
See the samples/ folder for example files and detailed usage instructions.
Run the tests:
python -m pytest tests/ -vWant to test Checkatron with real Snowflake tables? Follow this step-by-step tutorial:
Copy and paste this into Snowflake:
-- Create the sample tables
CREATE OR REPLACE TABLE before_table (
ACCOUNT_ID NUMBER,
PORTFOLIO_NAME VARCHAR(50),
VALUATION_DATE DATE,
BALANCE NUMBER(15,2),
STATUS VARCHAR(20)
);
CREATE OR REPLACE TABLE after_table (
ACCOUNT_ID NUMBER,
PORTFOLIO_NAME VARCHAR(50),
VALUATION_DATE DATE,
BALANCE NUMBER(15,2),
STATUS VARCHAR(20),
NEW_COLUMN NUMBER(10,2)
);Copy and paste this into Snowflake:
-- Load before table data
INSERT INTO before_table VALUES
(1001, 'PORTFOLIO_A', '2024-01-01', 10000.00, 'ACTIVE'),
(1002, 'PORTFOLIO_B', '2024-01-01', 25000.50, 'ACTIVE'),
(1003, 'PORTFOLIO_A', '2024-01-01', 5000.75, 'SUSPENDED'),
(1004, 'PORTFOLIO_C', '2024-01-01', 15000.25, 'ACTIVE'),
(1005, 'PORTFOLIO_B', '2024-01-01', 30000.00, 'ACTIVE');
-- Load after table data (with differences)
INSERT INTO after_table VALUES
(1001, 'PORTFOLIO_A', '2024-01-01', 10000.00, 'ACTIVE', 100.00),
(1002, 'PORTFOLIO_B', '2024-01-01', 25000.50, 'ACTIVE', 200.00),
(1003, 'PORTFOLIO_A', '2024-01-01', 5000.75, 'ACTIVE', 150.00),
(1004, 'PORTFOLIO_C', '2024-01-01', 15000.25, 'ACTIVE', 300.00),
(1005, 'PORTFOLIO_B', '2024-01-01', 35000.00, 'ACTIVE', 400.00),
(1006, 'PORTFOLIO_D', '2024-01-01', 7500.00, 'ACTIVE', 500.00);Copy and paste this into Snowflake:
-- Check row counts
SELECT 'BEFORE TABLE' as table_name, COUNT(*) as row_count FROM before_table
UNION ALL
SELECT 'AFTER TABLE' as table_name, COUNT(*) as row_count FROM after_table;
-- Show sample data
SELECT 'BEFORE' as source, * FROM before_table ORDER BY ACCOUNT_ID
UNION ALL
SELECT 'AFTER' as source, * FROM after_table ORDER BY ACCOUNT_ID;Now export the table schemas to create your CSV files:
-
Export before_table schema:
DESCRIBE TABLE before_table;
- Click "Download" in Snowflake
- Save as
samples/example_before.csv
-
Export after_table schema:
DESCRIBE TABLE after_table;
- Click "Download" in Snowflake
- Save as
samples/example_after.csv
-
Create business keys file:
- Copy
samples/example_before.csv - Remove all rows except ACCOUNT_ID, PORTFOLIO_NAME, VALUATION_DATE
- Save as
samples/example_keys.csv
- Copy
Now use checkatron to generate the comparison SQL:
python -m checkatron.diffgen \
samples/example_before.csv \
samples/example_after.csv \
--keys samples/example_keys.csv \
--out samples/generated_diff.sqlCopy the contents of samples/generated_diff.sql and paste it into Snowflake.
Copy and paste this into Snowflake:
-- Summary of differences
SELECT
COUNT(*) as total_rows,
SUM(CASE WHEN _row_status = 0 THEN 1 ELSE 0 END) as matching_rows,
SUM(CASE WHEN _row_status > 0 THEN 1 ELSE 0 END) as different_rows,
SUM(CASE WHEN _row_status = 4 THEN 1 ELSE 0 END) as missing_in_before,
SUM(CASE WHEN _row_status = 5 THEN 1 ELSE 0 END) as missing_in_after
FROM diff_result;
-- Show rows with differences
SELECT * FROM diff_result WHERE _row_status > 0;- 6 total rows (5 from before + 1 new in after)
- 2 rows with differences (1003: status changed, 1005: balance changed)
- 1 new row (1006: missing in before table)
- NEW_COLUMN showing status 2 for all rows (NULL in before only)
SELECT * FROM diff_result;Then click "Download" in Snowflake and filter on any column > 0 to see differences.
Generate a SQL diff script comparing two tables:
python -m checkatron.diffgen \
before_schema.csv \
after_schema.csv \
--keys business_keys.csv \
--out diff_script.sqlbefore_csv: CSV file fromDESCRIBE TABLEcommand (before state)after_csv: CSV file fromDESCRIBE TABLEcommand (after state)--keys: CSV file containing only the business key columns--before_where: Optional WHERE clause for filtering before table--after_where: Optional WHERE clause for filtering after table--before_table: Override table name for before table--after_table: Override table name for after table--out: Output SQL file (default: diff.sql)
The tool expects CSV files in the format produced by Snowflake's DESCRIBE TABLE command:
name,type,kind,null?,default,primary key,unique key,check,expression,comment,policy name,privacy domain
ACCOUNT_ID,NUMBER,,,,,,,,,,
PORTFOLIO_NAME,VARCHAR,,,,,,,,,,
VALUATION_DATE,DATE,,,,,,,,,,
BALANCE,NUMBER,,,,,,,,,,
Create a CSV with the same header but only the columns that serve as business keys:
name,type,kind,null?,default,primary key,unique key,check,expression,comment,policy name,privacy domain
ACCOUNT_ID,NUMBER,,,,,,,,,,
PORTFOLIO_NAME,VARCHAR,,,,,,,,,,
VALUATION_DATE,DATE,,,,,,,,,,
The tool generates a temporary table with status codes for each column:
- 0: Values match
- 1: Values are different
- 2: Column is NULL in before table only
- 3: Column is NULL in after table only
- 4: Row missing in before table
- 5: Row missing in after table
python -m checkatron.diffgen \
prod_schema.my_table.csv \
test_schema.my_table.csv \
--keys business_keys.csv \
--out prod_vs_test_diff.sqlpython -m checkatron.diffgen \
my_table_2024_01_01.csv \
my_table_2024_01_02.csv \
--keys business_keys.csv \
--before_where "VALUATION_DATE = '2024-01-01'" \
--after_where "VALUATION_DATE = '2024-01-02'" \
--out daily_diff.sqlpython -m checkatron.diffgen \
schema_a.csv \
schema_b.csv \
--keys business_keys.csv \
--before_table "PROD.SCHEMA_A.TABLE_X" \
--after_table "TEST.SCHEMA_B.TABLE_Y" \
--out custom_names_diff.sql# Run all tests
python -m pytest tests/ -v
# Run specific test
python -m pytest tests/test_diffgen.py::test_simple_same_table -vcheckatron/
├── __init__.py # Package initialization
├── diffgen.py # Main SQL generation logic
└── templates/ # Jinja2 SQL templates
└── full_compare.sql.j2
samples/ # Example files and scripts
├── README.md # Samples documentation
├── example_before.csv # Sample before schema
├── example_after.csv # Sample after schema
├── example_keys.csv # Sample business keys
└── run_example.py # Interactive example script
tests/
├── __init__.py
└── test_diffgen.py # Unit tests using DuckDB
setup.py # Package configuration
requirements.txt # Dependencies
- New SQL Templates: Add Jinja2 templates in the
templates/directory - Additional Options: Extend the argument parser in
diffgen.py - New Test Cases: Add tests in
tests/test_diffgen.py
- Jinja2: Template engine for SQL generation
- pytest: Testing framework
- duckdb: In-memory database for testing
"No module named 'checkatron'"
# Make sure you've activated the virtual environment and installed the package
source venv/bin/activate && pip install -e ."Permission denied" on venv creation
# Try using python3 explicitly
python3 -m venv venv"Command not found: python"
# On macOS/Linux, try python3
python3 -m venv venv- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
[Your License Here]
For issues and questions, please open an issue on GitHub.