You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your idea related to a problem? Please describe.
I'm running a batch computing job in an account that runs thousands of such jobs a day. The basic format of each job is to read in data via an Athena query, do some processing, and then write out json outputs. For an arbitrary subset of these jobs, identical queries may already have been run, but we're performing different operations on the data they provide and are producing different results.
In order to save on costs and time, I'd like to ensure that queries don't get rerun for those duplicate cases. I've looked into the athena_cache_settings option for read_sql_query, but the mechanism of that caching is missing some desirable properties:
I'd like there to be no limit on the lookback distance. The queries we're reusing may be tens or hundreds of thousands of executions back.
I'd prefer not to make many calls to the Athena API to do the linear search for a matching query.
Describe the solution you'd like
I think the easiest way to meet these requirements is to utilize the ClientRequestToken parameter available in StartQueryExecution (https://docs.aws.amazon.com/athena/latest/APIReference/API_StartQueryExecution.html). Callers to read_sql_query are then given the option to pass their own token through. If the token matches, they simply get served the existing results.
Is your idea related to a problem? Please describe.
I'm running a batch computing job in an account that runs thousands of such jobs a day. The basic format of each job is to read in data via an Athena query, do some processing, and then write out json outputs. For an arbitrary subset of these jobs, identical queries may already have been run, but we're performing different operations on the data they provide and are producing different results.
In order to save on costs and time, I'd like to ensure that queries don't get rerun for those duplicate cases. I've looked into the
athena_cache_settings
option forread_sql_query
, but the mechanism of that caching is missing some desirable properties:Describe the solution you'd like
I think the easiest way to meet these requirements is to utilize the
ClientRequestToken
parameter available inStartQueryExecution
(https://docs.aws.amazon.com/athena/latest/APIReference/API_StartQueryExecution.html). Callers toread_sql_query
are then given the option to pass their own token through. If the token matches, they simply get served the existing results.Example usage:
The text was updated successfully, but these errors were encountered: