This project uses dbt with Databricks, a unified analytics platform, to transform and model e-commerce data (e.g., Jaffle Shop dataset) in the cloud. This README provides step-by-step instructions to migrate from the DuckDB version to Databricks, including setting up Databricks Free Edition and configuring dbt.
- Python: Version 3.8 or higher.
- pip: Python package manager.
- Operating System: Windows, macOS, or Linux.
- Terminal: Command-line interface (e.g., PowerShell on Windows, Terminal on macOS/Linux).
- Databricks Account: Free Edition account (no credit card required).
- Optional: Git for cloning the project repository.
- Visit databricks.com/try-databricks
- Sign up for Free Edition (no Azure/AWS/GCP account needed)
- Complete the registration process
- Access your Databricks workspace
Once in your workspace:
-
Create a SQL Warehouse (if not automatically created):
- Go to SQL Warehouses in the sidebar
- Click Create SQL Warehouse
- Choose Serverless (recommended for Free Edition)
- Note the HTTP Path (you'll need this later)
-
Get your connection details:
- Host: Your workspace URL (e.g.,
your-workspace.cloud.databricks.com) - HTTP Path: Found in SQL Warehouse connection details (e.g.,
/sql/1.0/warehouses/abc123def456) - Token: Create in User Settings → Developer → Access Tokens
- Host: Your workspace URL (e.g.,
mkdir dbt_databricks_project
cd dbt_databricks_project
python -m venv .venvActivate the virtual environment:
- On Windows:
.venv\Scripts\activate
- On macOS/Linux:
source .venv/bin/activate
Install the dbt core package and the Databricks adapter:
pip install dbt-core dbt-databricksIf starting fresh, initialize the project:
dbt init my_databricks_project
cd my_databricks_projectCreate or update your profiles.yml file (in ~/.dbt/ or project root):
default:
target: dev
outputs:
dev:
type: databricks
catalog: workspace
schema: default
host: <your-workspace-url>
http_path: <your-http-path>
token: <your-access-token>Replace the placeholders:
<your-workspace-url>: e.g.,dbc-3b7b1a98-b6df.cloud.databricks.com<your-http-path>: e.g.,/sql/1.0/warehouses/61c7488cd0db2884<your-access-token>: Your personal access token
Check that dbt and the Databricks adapter are installed:
dbt --versionExpected output includes:
Core:
- installed: 1.9.5
Plugins:
- databricks: 1.10.3
Test the connection:
dbt debugYou should see "Connection test: [OK]" in the output.
Before running dbt, create the raw tables in your Databricks workspace. Run this SQL in a Databricks notebook or SQL editor:
-- Create schema if it doesn't exist
CREATE SCHEMA IF NOT EXISTS workspace.main;
-- Create raw_customers table
CREATE OR REPLACE TABLE workspace.main.raw_customers (
id STRING,
name STRING
) USING DELTA;
-- Create raw_orders table
CREATE OR REPLACE TABLE workspace.main.raw_orders (
id STRING,
customer STRING,
ordered_at TIMESTAMP,
store_id STRING,
subtotal DECIMAL(10,2),
tax_paid DECIMAL(10,2),
order_total DECIMAL(10,2)
) USING DELTA;
-- Create other raw tables (raw_items, raw_products, raw_stores, raw_suppliers)
-- [Full SQL provided in repository]
-- Insert sample data
INSERT INTO workspace.main.raw_customers VALUES
('1', 'John Doe'),
('2', 'Jane Smith'),
('3', 'Bob Johnson');
-- [Additional sample data inserts provided in repository]Load seed data (if any), run models, and test:
dbt seed # If you have seed files
dbt run # Run transformations
dbt test # Test data qualityUse Databricks SQL editor or notebooks to explore:
-- View all schemas
SHOW SCHEMAS;
-- Preview bronze tables
SELECT * FROM workspace.bronze.bronze_customers LIMIT 5;
-- Check gold layer results
SELECT * FROM workspace.gold.gold_customer_summary LIMIT 10;
-- Data quality metrics
SELECT
COUNT(*) AS total_customers,
AVG(total_spend) AS avg_spend,
MAX(total_orders) AS max_orders
FROM workspace.gold.gold_customer_summary;- Seeds: CSV files (if any) in
seeds/, loaded into Databricks - Models: SQL files in
models/bronze/,models/silver/,models/gold/for data transformation - Database: Delta tables in Databricks workspace
- Configuration:
dbt_project.yml: Defines schema structure and Delta Lake settingsprofiles.yml: Configures Databricks connectionmodels/bronze/_sources.yml: Defines source tables inworkspace.mainmodels/gold/schema.yml: Defines model metadata and tests
| Aspect | DuckDB | Databricks |
|---|---|---|
| Connection | Local file | Cloud workspace |
| Storage | Local disk | Delta Lake |
| Compute | Single-threaded | Distributed |
| File Format | Native DuckDB | Delta tables |
| Schema Management | Simple | Three-level namespace (catalog.schema.table) |
| Collaboration | Local only | Real-time collaboration |
- "Catalog not found": Ensure your catalog exists and update
databasefield in sources - "Table not found": Verify raw tables exist in Databricks with correct names
- "Authentication failed": Check your token and ensure it has necessary permissions
- "SQL Warehouse not available": Ensure your SQL Warehouse is running
- Tables created with
default_prefix: Add custom schema macro (see repository) - Source not found: Verify
_sources.ymlreferences correct catalog and schema
- Use Serverless SQL Warehouses for development
- Consider partitioning for large tables
- Enable auto-optimize for Delta tables
- Use clustering for frequently queried columns
If migrating from the DuckDB version:
- Update
dbt_project.ymlto usedatabricksprofile - Replace
dbt-duckdbwithdbt-databricksin dependencies - Update source configuration for three-level namespace
- Create raw tables in Databricks (SQL provided above)
- Your model SQL should work with minimal changes
- Databricks Free Edition Documentation
- dbt-databricks Documentation
- Databricks Learning - Free courses
- Delta Lake Documentation
For issues, share the output of dbt debug, dbt run, or Databricks SQL queries.