Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug/string agg limit #133

Merged
merged 15 commits into from
Nov 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,18 @@
# dbt_jira v0.19.0
[PR #133](https://github.com/fivetran/dbt_jira/pull/133) contains the following updates:

## Breaking Changes
- This change is marked as breaking due to its impact on Redshift configurations.
- For Redshift users, comment data aggregated under the `conversations` field in the `jira__issue_enhanced` table is now disabled by default to prevent consistent errors related to Redshift's varchar length limits.
- If you wish to re-enable `conversations` on Redshift, set the `jira_include_conversations` variable to `true` in your `dbt_project.yml`.

## Under the Hood
- Updated the `comment` seed data to ensure conversations are correctly disabled for Redshift by default.
- Renamed the `jira_is_databricks_sql_warehouse` macro to `jira_is_incremental_compatible`, which was updated to return `true` if the Databricks runtime is an all-purpose cluster (previously it checked only for a SQL warehouse runtime) or if the target is any other non-Databricks-supported destination.
- This update addresses Databricks runtimes (e.g., endpoints and external runtimes) that do not support the `insert_overwrite` incremental strategy used in the `jira__daily_issue_field_history` and `int_jira__pivot_daily_field_history` models.
- For Databricks users, the `jira__daily_issue_field_history` and `int_jira__pivot_daily_field_history` models will now apply the incremental strategy only if running on an all-purpose cluster. All other Databricks runtimes will not utilize an incremental strategy.
- Added consistency tests for the `jira__project_enhanced` and `jira__user_enhanced` models.

# dbt_jira v0.18.0
[PR #131](https://github.com/fivetran/dbt_jira/pull/131) contains the following updates:
## Breaking Changes
Expand Down
28 changes: 22 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ Include the following jira package version in your `packages.yml` file:
```yaml
packages:
- package: fivetran/jira
version: [">=0.18.0", "<0.19.0"]
version: [">=0.19.0", "<0.20.0"]

```
### Step 3: Define database and schema variables
Expand All @@ -82,14 +82,30 @@ vars:
Your Jira connector may not sync every table that this package expects. If you do not have the `SPRINT`, `COMPONENT`, or `VERSION` tables synced, add the respective variables to your root `dbt_project.yml` file. Additionally, if you want to remove comment aggregations from your `jira__issue_enhanced` model, add the `jira_include_comments` variable to your root `dbt_project.yml`:
```yml
vars:
jira_using_sprints: false # Disable if you do not have the sprint table or do not want sprint-related metrics reported
jira_using_components: false # Disable if you do not have the component table or do not want component-related metrics reported
jira_using_versions: false # Disable if you do not have the versions table or do not want versions-related metrics reported
jira_using_priorities: false # disable if you are not using priorities in Jira
jira_include_comments: false # This package aggregates issue comments so that you have a single view of all your comments in the jira__issue_enhanced table. This can cause limit errors if you have a large dataset. Disable to remove this functionality.
jira_using_sprints: false # Enabled by default. Disable if you do not have the sprint table or do not want sprint-related metrics reported.
jira_using_components: false # Enabled by default. Disable if you do not have the component table or do not want component-related metrics reported.
jira_using_versions: false # Enabled by default. Disable if you do not have the versions table or do not want versions-related metrics reported.
jira_using_priorities: false # Enabled by default. Disable if you are not using priorities in Jira.
jira_include_comments: false # Enabled by default. Disabling will remove the aggregation of comments via the `count_comments` and `conversations` columns in the `jira__issue_enhanced` table.
```

### (Optional) Step 5: Additional configurations

#### Controlling conversation aggregations in `jira__issue_enhanced`

The `dbt_jira` package offers variables to enable or disable conversation aggregations in the `jira__issue_enhanced` table. These settings allow you to manage the amount of data processed and avoid potential performance or limit issues with large datasets.

- `jira_include_conversations`: Controls only the `conversation` [column](https://github.com/fivetran/dbt_jira/blob/main/models/jira.yml#L125-L127) in the `jira__issue_enhanced` table.
- Default: Disabled for Redshift due to string size constraints; enabled for other supported warehouses.
- Setting this to `false` removes the `conversation` column but retains the `count_comments` field if `jira_include_comments` is still enabled. This is useful if you want a comment count without the full conversation details.

In your `dbt_project.yml` file:

```yml
vars:
jira_include_conversations: false/true # Disabled by default for Redshift; enabled for other supported warehouses.
```

#### Define daily issue field history columns
The `jira__daily_issue_field_history` model generates historical data for the columns specified by the `issue_field_history_columns` variable. By default, the only columns tracked are `status`, `status_id`, and `sprint`, but all fields found in the Jira `FIELD` table's `field_name` column can be included in this model. The most recent value of any tracked column is also captured in `jira__issue_enhanced`.

Expand Down
2 changes: 1 addition & 1 deletion dbt_project.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: 'jira'
version: '0.18.0'
version: '0.19.0'
config-version: 2
require-dbt-version: [">=1.3.0", "<2.0.0"]
vars:
Expand Down
2 changes: 1 addition & 1 deletion docs/catalog.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/manifest.json

Large diffs are not rendered by default.

7 changes: 3 additions & 4 deletions integration_tests/dbt_project.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
name: 'jira_integration_tests'
version: '0.18.0'
version: '0.19.0'
config-version: 2
profile: 'integration_tests'

vars:
# Comment out the below when generating docs
issue_field_history_columns: ['summary', 'story points', 'components']
jira_source:
jira_schema: jira_integrations_tests_41
jira_comment_identifier: "comment"
Expand All @@ -28,9 +30,6 @@ vars:
jira_user_identifier: "user"
jira_version_identifier: "version"

# Comment out the below when generating docs
issue_field_history_columns: ['summary', 'story points', 'components']

models:
jira:
+schema: "{{ 'jira_integrations_tests_sqlw' if target.name == 'databricks-sql' else 'jira' }}"
Expand Down
1 change: 1 addition & 0 deletions integration_tests/seeds/comment.csv

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,15 @@
enabled=var('fivetran_validation_tests_enabled', false)
) }}

{# Exclude columns that depend on calculations involving the current time in seconds or aggregate strings in a random order, as they will differ between runs. #}
{% set exclude_columns = ['open_duration_seconds', 'any_assignment_duration_seconds', 'last_assignment_duration_seconds'] %}
with prod as (
select *
select {{ dbt_utils.star(from=ref('jira__issue_enhanced'), except=exclude_columns) }}
from {{ target.schema }}_jira_prod.jira__issue_enhanced
),

dev as (
select *
select {{ dbt_utils.star(from=ref('jira__issue_enhanced'), except=exclude_columns) }}
from {{ target.schema }}_jira_dev.jira__issue_enhanced
),

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@

{{ config(
tags="fivetran_validations",
enabled=var('fivetran_validation_tests_enabled', false)
) }}

{# Exclude columns that depend on calculations involving the current time in seconds or aggregate strings in a random order, as they will differ between runs. #}
{% set exclude_columns = ['avg_age_currently_open_seconds', 'avg_age_currently_open_assigned_seconds', 'median_age_currently_open_seconds', 'median_age_currently_open_assigned_seconds', 'epics', 'components'] %}

with prod as (
select {{ dbt_utils.star(from=ref('jira__project_enhanced'), except=exclude_columns) }}
from {{ target.schema }}_jira_prod.jira__project_enhanced
),

dev as (
select {{ dbt_utils.star(from=ref('jira__project_enhanced'), except=exclude_columns) }}
from {{ target.schema }}_jira_dev.jira__project_enhanced
),

prod_not_in_dev as (
-- rows from prod not found in dev
select * from prod
except distinct
select * from dev
),

dev_not_in_prod as (
-- rows from dev not found in prod
select * from dev
except distinct
select * from prod
),

final as (
select
*,
'from prod' as source
from prod_not_in_dev

union all -- union since we only care if rows are produced

select
*,
'from dev' as source
from dev_not_in_prod
)

select *
from final
49 changes: 49 additions & 0 deletions integration_tests/tests/consistency/consistency_user_enhanced.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@

{{ config(
tags="fivetran_validations",
enabled=var('fivetran_validation_tests_enabled', false)
) }}

{# Exclude columns that depend on calculations involving the current time in seconds or aggregate strings in a random order, as they will differ between runs. #}
{% set exclude_columns = ['avg_age_currently_open_seconds', 'median_age_currently_open_seconds', 'projects'] %}

with prod as (
select {{ dbt_utils.star(from=ref('jira__user_enhanced'), except=exclude_columns) }}
from {{ target.schema }}_jira_prod.jira__user_enhanced
),

dev as (
select {{ dbt_utils.star(from=ref('jira__user_enhanced'), except=exclude_columns) }}
from {{ target.schema }}_jira_dev.jira__user_enhanced
),

prod_not_in_dev as (
-- rows from prod not found in dev
select * from prod
except distinct
select * from dev
),

dev_not_in_prod as (
-- rows from dev not found in prod
select * from dev
except distinct
select * from prod
),

final as (
select
*,
'from prod' as source
from prod_not_in_dev

union all -- union since we only care if rows are produced

select
*,
'from dev' as source
from dev_not_in_prod
)

select *
from final
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
{% macro jira_is_databricks_sql_warehouse() %}
{% macro jira_is_incremental_compatible() %}
{% if target.type in ('databricks') %}
{% set re = modules.re %}
{% set path_match = target.http_path %}
{% set regex_pattern = "sql/.+/warehouses/" %}
{% set regex_pattern = "sql/protocol" %}
{% set match_result = re.search(regex_pattern, path_match) %}
{% if match_result %}
{{ return(True) }}
{% else %}
{{ return(False) }}
{% endif %}
{% elif target.type in ('bigquery','snowflake','postgres','redshift','sqlserver') %}
{{ return(True) }}
{% else %}
{{ return(False) }}
{% endif %}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{{
config(
materialized='table' if jira.jira_is_databricks_sql_warehouse() else 'incremental',
materialized='incremental' if jira_is_incremental_compatible() else 'table',
partition_by = {'field': 'valid_starting_at_week', 'data_type': 'date'}
if target.type not in ['spark','databricks'] else ['valid_starting_at_week'],
cluster_by = ['valid_starting_at_week'],
Expand Down Expand Up @@ -176,4 +176,4 @@ final as (
)

select *
from final
from final
12 changes: 8 additions & 4 deletions models/intermediate/int_jira__issue_comments.sql
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,18 @@ agg_comments as (

select
comment.issue_id,
{{ fivetran_utils.string_agg( "comment.created_at || ' - ' || jira_user.user_display_name || ': ' || comment.body", "'\\n'" ) }} as conversation,
count(comment.comment_id) as count_comments

from
comment
{%- if var('jira_include_conversations', False if target.type == 'redshift' else True) %}
,{{ fivetran_utils.string_agg(
"comment.created_at || ' - ' || jira_user.user_display_name || ': ' || comment.body",
"'\\n'" ) }} as conversation
{% endif %}

from comment
join jira_user on comment.author_user_id = jira_user.user_id

group by 1
)

select * from agg_comments
select * from agg_comments
2 changes: 1 addition & 1 deletion models/intermediate/int_jira__issue_join.sql
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ join_issue as (
{% endif %}

{% if var('jira_include_comments', True) %}
,issue_comments.conversation
{{ ',issue_comments.conversation' if var('jira_include_conversations', False if target.type == 'redshift' else True) }}
,coalesce(issue_comments.count_comments, 0) as count_comments
{% endif %}

Expand Down
1 change: 1 addition & 0 deletions models/jira.yml
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ models:
- name: conversation
description: >
Line-separated list of comments made on this issue, including the timestamp and author name of each comment.
(Disabled by default for Redshift.)
- name: count_comments
description: The number of comments made on this issues.
- name: first_assigned_at
Expand Down
2 changes: 1 addition & 1 deletion models/jira__daily_issue_field_history.sql
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{{
config(
materialized='table' if jira.jira_is_databricks_sql_warehouse() else 'incremental',
materialized='incremental' if jira_is_incremental_compatible() else 'table',
partition_by = {'field': 'date_week', 'data_type': 'date'}
if target.type not in ['spark', 'databricks'] else ['date_week'],
cluster_by = ['date_week'],
Expand Down