Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-634] [Bug] Redshift fails to run several threads #501

Closed
2 tasks done
Tracked by #742
sphinks opened this issue Jun 20, 2023 · 8 comments
Closed
2 tasks done
Tracked by #742

[ADAP-634] [Bug] Redshift fails to run several threads #501

sphinks opened this issue Jun 20, 2023 · 8 comments
Labels
bug Something isn't working ra3_node issues relating to ra3 node support

Comments

@sphinks
Copy link

sphinks commented Jun 20, 2023

Is this a new bug in dbt-redshift?

  • I believe this is a new bug in dbt-redshift
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

I have found several issues about performance degradation while using RA3 node. However, I'm expecting failure of dbt run if I switch from 1 thread to several.

My DBT profile:

    dev:
      type: redshift
      host: redshift-host

      user: "{{ env_var('DBT_USER') }}"
      password: "{{ env_var('DBT_PASSWORD') }}"
      port: 5439
      dbname: dev
      schema:  dev
      threads: 4
      ra3_node: true

The error I get:

18:08:46    could not open relation with OID 615797

That happens not at the same model, sometimes the first model could fail. Other time it could be model at the end of list.
What helps a little bit - setup connect_timeout: 999999. I have found that solution at some bug. It still fails, but happens not at the very begining.

What I'm using:
Redshift RA3 cluster
DBT - 1.3.2
dbt-redshift - 1.3.0

I try to switch to the latest version of dbt (1.5.1) and dbt-redshift (1.5.5), not luck, but the error message became ' The read operation timed out'.

Expected Behavior

DBT runs in parallel with out any issues.

Steps To Reproduce

  1. Setup dbt project.
  2. Start with specified DBT profile and 1 thread.
  3. Switch to several threads and rerun project.

Relevant log output

No response

Environment

- OS:Windows Server 2016 (the same issue happens with remote dbt running on Linux OS)
- Python:Python 3.9.13
- dbt-core: 1.3.2
- dbt-redshift: 1.3.0

Additional Context

No response

@sphinks sphinks added bug Something isn't working triage labels Jun 20, 2023
@github-actions github-actions bot changed the title [Bug] Redshift fails to run several threads [ADAP-634] [Bug] Redshift fails to run several threads Jun 20, 2023
@sphinks
Copy link
Author

sphinks commented Jun 21, 2023

Another finding here:
Parallel run works great with any model except models with reference to source. If I run parallel run with staging layer (simple sql to bring the table into DBT layout with {{ source('some_landing', 'table') }} -I get that issue. If i run everything except those: works great!
Not sure if it give any clue, but hope it will narrow the scope a little bit.

I also find the issue: #427 that I suppose could be the reason why my task run with 1.5.1 did not work. So I try to rollback to 1.4.6 version and still no luck.

@dataders
Copy link
Contributor

@sphinks, you're right -- this is some strange behavior, you've certainly piqued my interest!

clarification questions:

  • you're seeing this issue on versions 1.3, 1.4, and 1.5 of the adapter?
  • does this timeout/latency issue persist after the first attempt to parallelize? Do the second and third time you try running with multi-threads fail in the same way as the first time?
  • could you perhaps fork https://github.com/dbt-labs/jaffle_shop and provide the exact commands to reproduce the error? specifically I'd love to see commands with --select and --threads so it's exactly reproducible

@dataders
Copy link
Contributor

in the dbt Community Slack's #db-redshift channel, someone suggested that you may be encountering an issue with autoWLM.

Is AutoWLM enabled or disabled? if it is enabled, disabled it, configure your queues manually, and try it again, and see if the problem goes away.

@sphinks
Copy link
Author

sphinks commented Jun 23, 2023

@dataders thanks for you attention to the issue.

  1. I could see the issue on all versions: 1.3, 1.4, 1.5. The only difference - message that returned during the error. 1.3 - could not open relation with OID and after 1.4 version it returns The read operation timed out. I suppose it is just better error handling in new versions of dbt-redshfit.
  2. Issue happens every time I try to run in parallel. It does not fail on the same model and does not take the same amount of time. It fails during the first iteration almost every time. I run in 4 threads, so during first iteration every thread take the first model to execute some of thread will fail with first model. Looks like it happens randomly. What is notice if I reduce number of threads and increase to max value (connect_timeout: 999999) it could run longer (sometimes much longer). With 2 threads you could get a fails on 3 iteration. It gives an idea that threads are not allowed to access some shared resource at the same time: less thread we have less chances to hit that limit.
  3. I try to use https://github.com/dbt-labs/jaffle_shop to reproduce the errors, but there are significant difference: it does not use source macro in staging and it has only 3 staging that are pretty small. It does not reproduce an error for some reason (models too small to create concurrent environment or because they do not use source macro - at least that difference I could see).
    About autoWLM: that could be the case, but something consufing me here. I believe a lot of DBT users able to run DBT with autoWLM and have no issue. Is it some case of bad luck of autoWLM if that helps? :) I will to switch it, but it is not so easy as we have pretty heavy used project and that change could cause some unexpected issues.

@sphinks
Copy link
Author

sphinks commented Jun 26, 2023

Here is another detail that I have figured out: the issues happens with models that involves 2 databases. I have landing layer in one database and staging layer in another database. Could it be the root cause of issue with paralell run over dbt?

@dataders
Copy link
Contributor

@sphinks the multi-db is an interesting scenario. I talked with some engineers at Redshift, they recommend that you open an AWS support ticket for this issue. once you do, they can grab the ticket if you share the case id here.

@dataders dataders removed the triage label Jun 27, 2023
@sphinks
Copy link
Author

sphinks commented Jun 29, 2023

@dataders I believe, I have found the issue: it is combination of cross-database setup and grant post hook. In my project I has a post-hook that is doing "grant select on all tables in schema...". It means that operation is done after each model is built. I suppose having several threads trying to do that operation on cross-database level - lead to issue. I do not know exactly what is wrong here, but switch to "grant select on table ..." helped.

Most interesting aspect that the same post hook is working in parallel run inside 1 database. That is confusing.

@sphinks
Copy link
Author

sphinks commented Jun 30, 2023

Closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ra3_node issues relating to ra3 node support
Projects
None yet
Development

No branches or pull requests

2 participants