Analytics Engineering Framework - Data Model

Note: For a comprehensive installation guide of all the AEF repositories together, please look here.

Analytics engineers lay the foundation for others to organize, transform, and document data using software engineering principles. Providing easy to use data platforms that empower data practitioners to independently build data pipelines in a standardized and scalable way, and answer their own data-driven questions.

This repository is designed for managing your data models (BigQuery, BigLake, etc.) and deploying them from various sources (SQL files, Dataform repositories, etc.). Additionally, it facilitates defining and deploying Dataplex metadata (lakes, zones, assets, annotations, policy tags, etc.).

While usable independently, this tool is optimized as a component within a comprehensive Analytics Engineering Framework comprising:

Orchestration Framework: Maintained by Analytics Engineers to provide seamless, extensible orchestration and execution infrastructure.
(This repository) Data Model: Directly used by end data practitioners to manage data models, schemas, and Dataplex metadata.
Data Orchestration: Directly used by end data practitioners to define and deploy data pipelines using levels, threads, and steps.
Data Transformation: Directly used by end data practitioners to define, store, and deploy data transformations.

Concepts

This reference Data Model management repository is your central hub for streamlined data model definition. It deploys dataform repositories and integrates with BigQuery metadata and Dataplex to enable data governance, discoverability, and access control. Here's what it offers:

Creates or references Dataform repositories form given third-party Git repositories.
Creates or references BigQuery datasets based on given parameters or the variables you have in the dataform settings files like dataform.json you have in Dataform repositories.
Creates or references GCS buckets containing DDLs.
Creates or references GCS data buckets.
Creates Dataplex lakes and zones and register data assets (GCS buckets or BigQuery Datasets) accordingly based on given parameters.
Defines tag templates, policy taxonomies, governance rules and annotations to be applied to data assets based on given parameter files.

Referenced Dataform Repositories

If your data model includes datasets referenced in Dataform repositories, you can choose to create those datasets as part of this data model deployment. To do so, set the Terraform variable create_dataform_repositories to true. This controls whether the Dataform scripts found in the repositories will be created alongside other Terraform resources. If set to false, Dataform repositories should be created as an additional step in your CI/CD pipeline.

If you decide to use this repository to create datasets referenced in your Dataform repositories, define those datasets in the dataform.json file using the following format:

{
  ...
  "vars": {
    "dataset_id_<DATASET_IDENTIFIER>":"<YOUR_DATASET_NAME>",
    "dataset_projectid_<DATASET_IDENTIFIER>":"<YOUR_DATASET_PROJECT>",
    "dataset_location_<DATASET_IDENTIFIER>":"<YOUR_DATASET_LOCATION>",
    ...
  }
}

Usage

1. Dataplex:

Familiarize yourself with this concepts.
Define metadata in the following .yaml files:
- Asset Annotations
- Dataplex Lakes, Zones, and Assets
- Policy Taxonomies and Tags
- Data Catalog Tag Templates

└── metadata
│   ├── annotations
│   │   └── annotations.yaml
│   ├── lakes
│   │   └── lakes.yaml
│   ├── policy_taxonomies
│   │   └── policy_taxonomies.yaml
│   └── tag_templates
│       └── tag_templates.yaml

2. Tracking your DDLs

Option 1 - Dataplex Auto-Discovery

Dataplex has been configured to auto-create BigQuery tables for data files present in the data_buckets.
- Some examples are present in the demo-data sub-folder here.
The discovery job runs once per hour, but you can override this by changing the trigger to a custom Cron job scheduled a few minutes from the current time.

Option 2 - Create a custom external table using SQL

Depending on your configuration on the run_ddls_in_buckets and ddl_buckets parameters, .sql files in the referenced DDL buckets will run (at tfe plan/apply time).
- Following sample-data example you could place your DDL files in your repo at the path aef-data-model/sample-data/gcs-files/<YOUR_DDL>.sql to keep track in your repository of the explicit DDLs in environment.
- Or just place your DLLs in your referenced DDL buckets, DDLs there will run if possible (at tfe plan/apply).

Option 3 - Use dataform to track your DDLs

Depending on your configuration on the compile_dataform_repositories and execute_dataform_repositories parameters, .sqlx files with ddl dataform execution tag will run (at tfe plan/apply time).
- Reference your dataform repositories.
- Add your DLLs to your dataform repositories, as the example here.

3. Terraform

While this repository can be used to keep track of your data model and metadata, the provided terraform code can be used to control deployment or just to reference it, so you can deploy it as another step in your CI/CD pipeline.

Define your terraform variables. It is recommended creating a .tfvars file.

name	description	type	required	default
include_metadata_in_tfe_deployment	Controls whether metadata is deployed alongside Terraform resources. If false Metadata can be deployed as a next step in a CICD pipeline.	bool	false	-
overwrite_metadata	Whether to overwrite existing Dataplex (Cortex Datamesh) metadata.	string	true	false
create_dataform_datasets	Controls whether the datasets found in the dataform.json files in the repositories will be created alongside Terraform resources. If false datasets should be created otherwise.	bool	false	-
create_ddl_buckets_datasets	Controls whether the datasets referenced in the GCS DDL buckets will be created alongside Terraform resources. If false datasets should be created otherwise.	bool	false	-
create_dataform_repositories	Controls whether the dataform scripts found in the repositories will be created alongside Terraform resources. If false dataform repositories should be created as an additional step in the CICD pipeline.	bool	false	-
compile_dataform_repositories	Controls whether the dataform scripts found in the repositories will be compiled alongside Terraform resources. If false dataform repositories should be compiled as an additional step in the CICD pipeline.	bool	false	-
execute_dataform_repositories	Controls whether the dataform scripts found in the repositories will be executed alongside Terraform resources. If false dataform repositories should be executed as an additional step in the CICD pipeline.	bool	false	-
domain	Your organization or domain name, organization if centralized data management, domain name if one repository for each data domain in a Data mesh environment.	string	true	-
project	Project where the the dataform repositories, the Dataplex metadata, and other resources will be created.	string	true	-
region	Region where the datasets from the dataform.json files, the dataform repositories, the Dataplex metadata, and other resources will be created.	string	true	-
dataform_repositories	Dataform repository remote settings required to attach the repository to a remote repository.	map(object({...}))	false	{}
dataform_repositories_git_token	Git token to access the dataform repositories, it will be stored as a secret in secret manager, and it will be used to connect and read the dataform.json to create the datasets.	string (sensitive)	true	-
create_data_buckets	Controls whether the referenced data buckets will be created. If false referenced buckets should exist.	bool	false	-
data_buckets	Data buckets.	map(object({...}))	false	{}
create_ddl_buckets	Controls whether the referenced buckets containing DDLs will be created. If false referenced buckets should exist.	bool	false	-
run_ddls_in_buckets	Controls whether the .sql files in the referenced DDL buckets should be run.	bool	false	-
ddl_buckets	Buckets containing .sql DDL scripts to be executed on Terraform deploy, It can be of flavors: bigquery, TODO	map(object({...}))	false	{}

Example

    project = "<PROJECT_ID>"
    region  = "<REGION>"
    domain  = "<BUSINESS_DOMAIN_NAME>"
    
    include_metadata_in_tfe_deployment = true
    
    create_dataform_repositories    = true
    compile_dataform_repositories   = true
    execute_dataform_repositories   = true
    create_dataform_datasets        = true
    dataform_repositories_git_token = "<GIT_TOKEN>"
    dataform_repositories           = {
      sample-repo-1 = {
        remote_repo_url = "<GIT_HUB_REPOSITORY_URL>"
        secret_name     = "<SECRET_NAME>"
      },
      ...
    }
    
    create_data_buckets = false
    data_buckets        = {
      data-bucket-1 = {
        name          = "<DATA_BUCKET_NAME>"
        region        = "<DATA_BUCKET_REGION>"
        project       = "<DATA_BUCKET_PROJECT_ID>"
        dataplex_lake = "<DATA_BUCKET_LAKE_NAME>"
        dataplex_zone = "<DATA_BUCKET_ZONE_NAME>"
      },
      ...
    }
    
    create_ddl_buckets  = false
    run_ddls_in_buckets = true
    ddl_buckets         = {
      ddl-bucket-1 = {
        bucket_name          = "<DDL_BUCKET_NAME>"
        bucket_region        = "<DDL_BUCKET_REGION>"
        bucket_project       = "<DDL_BUCKET_PROJECT_ID>"
        ddl_flavor           = "bigquery"
        ddl_project_id       = "<PROJECT_ID_TO_REPLACE_IN_DDL_FILES>"
        ddl_dataset_id       = "<DATASET_ID_TO_REPLACE_IN_DDL_FILES>"
        ddl_data_bucket_name = "<DATA_BUCKET_NAME_TO_REPLACE_IN_DDL_FILES>"
        ddl_connection_name  = "<CONNECTION_NAME_TO_REPLACE_IN_DDL_FILES>"
      }, 
      ...
    }

Run Terraform Plan / Apply using the variables you defined.

terraform plan -var-file="prod.tfvars"

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
cicd-deployers		cicd-deployers
metadata		metadata
sample-data		sample-data
terraform		terraform
.gitignore		.gitignore
.gitmodules		.gitmodules
AEF_repositories_data_model.png		AEF_repositories_data_model.png
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
data_model&metadata_trategy.png		data_model&metadata_trategy.png
renovate.json		renovate.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analytics Engineering Framework - Data Model

Concepts

Referenced Dataform Repositories

Usage

1. Dataplex:

2. Tracking your DDLs

Option 1 - Dataplex Auto-Discovery

Option 2 - Create a custom external table using SQL

Option 3 - Use dataform to track your DDLs

3. Terraform

Example

Run Terraform Plan / Apply using the variables you defined.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Analytics Engineering Framework - Data Model

Concepts

Referenced Dataform Repositories

Usage

1. Dataplex:

2. Tracking your DDLs

Option 1 - Dataplex Auto-Discovery

Option 2 - Create a custom external table using SQL

Option 3 - Use dataform to track your DDLs

3. Terraform

Example

Run Terraform Plan / Apply using the variables you defined.

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages