Bug/databricks sql incremental (#144)

* MagicBot/2024-q3-updates * update pkgs * update run_models * update changelog * update yml * regen docs * update readme * review updates * add consistency tests * update changelog * update yml * Apply suggestions from code review Co-authored-by: Joe Markiewicz <[email protected]> * update macro dispatch * regen docs * Update packages.yml --------- Co-authored-by: Joe Markiewicz <[email protected]>
fivetran · Aug 1, 2024 · 86b93f6 · 86b93f6
1 parent 946f459
commit 86b93f6
Show file tree

Hide file tree

Showing 20 changed files with 193 additions and 60 deletions.
diff --git a/.buildkite/hooks/pre-command b/.buildkite/hooks/pre-command
@@ -22,4 +22,6 @@ export CI_SNOWFLAKE_DBT_WAREHOUSE=$(gcloud secrets versions access latest --secr
 export CI_DATABRICKS_DBT_HOST=$(gcloud secrets versions access latest --secret="CI_DATABRICKS_DBT_HOST" --project="dbt-package-testing-363917")
 export CI_DATABRICKS_DBT_HTTP_PATH=$(gcloud secrets versions access latest --secret="CI_DATABRICKS_DBT_HTTP_PATH" --project="dbt-package-testing-363917")
 export CI_DATABRICKS_DBT_TOKEN=$(gcloud secrets versions access latest --secret="CI_DATABRICKS_DBT_TOKEN" --project="dbt-package-testing-363917")
-export CI_DATABRICKS_DBT_CATALOG=$(gcloud secrets versions access latest --secret="CI_DATABRICKS_DBT_CATALOG" --project="dbt-package-testing-363917")
+export CI_DATABRICKS_DBT_CATALOG=$(gcloud secrets versions access latest --secret="CI_DATABRICKS_DBT_CATALOG" --project="dbt-package-testing-363917")
+export CI_DATABRICKS_SQL_DBT_HTTP_PATH=$(gcloud secrets versions access latest --secret="CI_DATABRICKS_SQL_DBT_HTTP_PATH" --project="dbt-package-testing-363917")
+export CI_DATABRICKS_SQL_DBT_TOKEN=$(gcloud secrets versions access latest --secret="CI_DATABRICKS_SQL_DBT_TOKEN" --project="dbt-package-testing-363917")
diff --git a/.buildkite/pipeline.yml b/.buildkite/pipeline.yml
@@ -71,4 +71,19 @@ steps:
             - "CI_DATABRICKS_DBT_TOKEN"
             - "CI_DATABRICKS_DBT_CATALOG"
     commands: |
-      bash .buildkite/scripts/run_models.sh databricks
+      bash .buildkite/scripts/run_models.sh databricks
+
+  - label: ":databricks: :database: Run Tests - Databricks SQL Warehouse"
+    key: "run_dbt_databricks_sql"
+    plugins:
+      - docker#v3.13.0:
+          image: "python:3.8"
+          shell: [ "/bin/bash", "-e", "-c" ]
+          environment:
+            - "BASH_ENV=/tmp/.bashrc"
+            - "CI_DATABRICKS_DBT_HOST"
+            - "CI_DATABRICKS_SQL_DBT_HTTP_PATH"
+            - "CI_DATABRICKS_SQL_DBT_TOKEN"
+            - "CI_DATABRICKS_DBT_CATALOG"
+    commands: |
+      bash .buildkite/scripts/run_models.sh databricks-sql
diff --git a/.buildkite/scripts/run_models.sh b/.buildkite/scripts/run_models.sh
@@ -16,6 +16,14 @@ db=$1
 echo `pwd`
 cd integration_tests
 dbt deps
+if [ "$db" = "databricks-sql" ]; then
+dbt seed --vars '{hubspot_schema: hubspot_sqlw_tests}' --target "$db" --full-refresh
+dbt compile --vars '{hubspot_schema: hubspot_sqlw_tests}' --target "$db"
+dbt run --vars '{hubspot_schema: hubspot_sqlw_tests}' --target "$db" --full-refresh
+dbt test --vars '{hubspot_schema: hubspot_sqlw_tests}' --target "$db"
+dbt run --vars '{hubspot_schema: hubspot_sqlw_tests, hubspot_marketing_enabled: true, hubspot_contact_merge_audit_enabled: true, hubspot_sales_enabled: false}' --target "$db"
+dbt run --vars '{hubspot_schema: hubspot_sqlw_tests, hubspot_marketing_enabled: false, hubspot_sales_enabled: true, hubspot_merged_deal_enabled: true, hubspot__pass_through_all_columns: true, hubspot_using_all_email_events: false, hubspot_owner_enabled: false}' --target "$db"
+else
 dbt seed --target "$db" --full-refresh
 dbt compile --target "$db" --select hubspot # source does not compile at this time
 dbt run --target "$db" --full-refresh
@@ -26,5 +34,5 @@ dbt test --target "$db"
 dbt run --vars '{hubspot_marketing_enabled: true, hubspot_contact_merge_audit_enabled: true, hubspot_sales_enabled: false}' --target "$db" --full-refresh
 dbt run --vars '{hubspot_marketing_enabled: false, hubspot_sales_enabled: true, hubspot_merged_deal_enabled: true, hubspot__pass_through_all_columns: true, hubspot_using_all_email_events: false, hubspot_owner_enabled: false}' --target "$db" --full-refresh
 dbt test --target "$db"
-
+fi
 dbt run-operation fivetran_utils.drop_schemas_automation --target "$db"
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,21 @@
+# dbt_hubspot v0.18.0
+
+[PR #144](https://github.com/fivetran/dbt_hubspot/pull/144) includes the following updates:
+
+## 🚨 Breaking Changes 🚨
+> ⚠️ Since the following changes result in the table format changing, we recommend running a `--full-refresh` after upgrading to this version to avoid possible incremental failures.
+
+- For Databricks All-Purpose clusters, incremental models will now be materialized using the delta table format (previously parquet).
+  - Delta tables are generally more performant than parquet and are also more widely available for Databricks users. This will also prevent compilation issues on customers' managed tables.
+
+- For Databricks SQL Warehouses, incremental materialization will not be used due to the incompatibility of the `insert_overwrite` strategy.
+
+## Under the Hood
+- The `is_incremental_compatible` macro has been added and will return `true` if the target warehouse supports our chosen incremental strategy.
+  - This update was applied as there have been other Databricks runtimes discovered (ie. an endpoint and external runtime) which do not support the `insert_overwrite` incremental strategy used. 
+- Added integration testing for Databricks SQL Warehouse.
+- Added consistency tests for `hubspot__daily_ticket_history`.
+
 # dbt_hubspot v0.17.2
 [PR #142](https://github.com/fivetran/dbt_hubspot/pull/142) includes the following updates:
 

diff --git a/README.md b/README.md
@@ -60,20 +60,22 @@ dispatch:
 ```
 
 ### Database Incremental Strategies 
-Some of the models (`+hubspot__daily_ticket_history`) in this package are materialized incrementally. We have chosen `insert_overwrite` as the default strategy for **BigQuery** and **Databricks** databases, as it is only available for these dbt adapters. For **Snowflake**, **Redshift**, and **Postgres** databases, we have chosen `delete+insert` as the default strategy.
+Many of the models in this package are materialized incrementally, so we have configured our models to work with the different strategies available to each supported warehouse.
 
-`insert_overwrite` is our preferred incremental strategy because it will be able to properly handle updates to records that exist outside the immediate incremental window. That is, because it leverages partitions, `insert_overwrite` will appropriately update existing rows that have been changed upstream instead of inserting duplicates of them--all without requiring a full table scan.
+For **BigQuery** and **Databricks All Purpose Cluster runtime** destinations, we have chosen `insert_overwrite` as the default strategy, which benefits from the partitioning capability. 
+> For Databricks SQL Warehouse destinations, models are materialized as tables without support for incremental runs.
 
-`delete+insert` is our second-choice as it resembles `insert_overwrite` but lacks partitions. This strategy works most of the time and appropriately handles incremental loads that do not contain changes to past records. However, if a past record has been updated and is outside of the incremental window, `delete+insert` will insert a duplicate record. 😱
-> Because of this, we highly recommend that **Snowflake**, **Redshift**, and **Postgres** users periodically run a `--full-refresh` to ensure a high level of data quality and remove any possible duplicates.
+For **Snowflake**, **Redshift**, and **Postgres** databases, we have chosen `delete+insert` as the default strategy.  
+
+> Regardless of strategy, we recommend that users periodically run a `--full-refresh` to ensure a high level of data quality.
 
 ## Step 2: Install the package
 Include the following hubspot package version in your `packages.yml` file:
 > TIP: Check [dbt Hub](https://hub.getdbt.com/) for the latest installation instructions or [read the dbt docs](https://docs.getdbt.com/docs/package-management) for more information on installing packages.
 ```yaml
 packages:
   - package: fivetran/hubspot
-    version: [">=0.17.0", "<0.18.0"] # we recommend using ranges to capture non-breaking changes automatically
+    version: [">=0.18.0", "<0.19.0"] # we recommend using ranges to capture non-breaking changes automatically
 
 ```
 Do **NOT** include the `hubspot_source` package in this file. The transformation package itself has a dependency on it and will install the source package as well.
@@ -315,7 +317,7 @@ This dbt package is dependent on the following dbt packages. Please be aware tha
 ```yml
 packages:
     - package: fivetran/hubspot_source
-      version: [">=0.14.0", "<0.15.0"]
+      version: [">=0.15.0", "<0.16.0"]
 
     - package: fivetran/fivetran_utils
       version: [">=0.4.0", "<0.5.0"]

diff --git a/dbt_project.yml b/dbt_project.yml
@@ -1,5 +1,5 @@
 name: 'hubspot'
-version: '0.17.2'
+version: '0.18.0'
 
 config-version: 2
 require-dbt-version: [">=1.3.0", "<2.0.0"]

diff --git a/docs/catalog.json b/docs/catalog.json