Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: deletes fail if there are too many ongoing mutations on any replica in the cluster #6351

Merged
merged 5 commits into from
Oct 1, 2024

Conversation

kylemumma
Copy link
Member

@kylemumma kylemumma commented Sep 26, 2024

this is for jira ticket sns-2876

This PR makes it so that a delete request will fail if there are too many ongoing mutations on the table we are trying to delete from. "too many ongoing mutations" means that no replica can have over 5 ongoing. the number 5 can be changed.

  • delete_from_storage now raises a TooManyOngoingMutationsError if any replica has over 5 mutations ongoing for the given table.
  • our http api catches this exception and returns a 503 to the user stating the delete cant be completed and try again.

I wasn't sure how to write tests for this, please lmk if you have an idea how to test.
I verified by hand on our clickhouse sandbox that the queries look good and work.

@kylemumma kylemumma marked this pull request as ready for review September 26, 2024 20:20
@kylemumma kylemumma requested a review from a team as a code owner September 26, 2024 20:20
@kylemumma kylemumma changed the title init feat: deletes fail if there are too many ongoing mutations on any replica in the cluster Sep 26, 2024
@with_span()
def delete_from_storage(
storage: WritableTableStorage,
columns: Dict[str, list[Any]],
attribution_info: Mapping[str, Any],
max_ongoing_mutations: int = 5,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this number should be a runtime config option

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but can still use 5 as the default for the config

Comment on lines 87 to 105

# fail if too many mutations ongoing
cluster = storage.get_cluster()
if cluster.is_single_node():
query = f"""
SELECT count() as cnt
FROM system.mutations
WHERE table IN ({", ".join(map(repr, delete_settings.tables))}) AND is_done=0
"""
else:
query = f"""
SELECT max(cnt)
FROM (
SELECT hostname() as host, count() as cnt
FROM clusterAllReplicas('{cluster.get_clickhouse_cluster_name()}', 'system', mutations)
WHERE table IN ({", ".join(map(repr, delete_settings.tables))}) AND is_done=0
GROUP BY host
)
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this piece that determines the amount of mutations its own function. Mock it in a test. then you can test this functionality

.results[0][0]
)
if num_ongoing_mutations > max_ongoing_mutations:
raise TooManyOngoingMutationsError()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this exception be raised with a message saying how many mutations are ongoing? Possibly log the exception with relevant details like how many concurrent mutations are allowed etc.?

FROM (
SELECT hostname() as host, count() as cnt
FROM clusterAllReplicas('{cluster.get_clickhouse_cluster_name()}', 'system', mutations)
WHERE table IN ({", ".join(map(repr, delete_settings.tables))}) AND is_done=0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will combine the counts for the tables right? Can we group by table and check the max of the results? So the max of the max for the cluster query, and it would be the max of the single node

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you're asking for is what its currently doing. The max ongoing mutations, for this specific table, across all nodes.

The inner query will give something like
node-1-1, 4
node-1-2, 3
node-1-3, 0

then the outer query will give 4

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if there are two tables for a node it will combine them

SELECT count(), table
FROM system.mutations
WHERE table in ('outcomes_raw_local', 'outcomes_hourly_local')
GROUP BY table
Screenshot 2024-09-27 at 2 29 02 PM

If you don't group by table, the total count is 19

This is for one node. This won't be a problem for the search issues table, but I'm just saying if we ever need to delete from multiple tables it will combine the counts

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see what you mean now, fixed.

@kylemumma
Copy link
Member Author

  • I made MAX_ONGOING_MUTATIONS_FOR_DELETE a runtime config option
  • I added how many mutation are ongoing and the max allowed to the error message
  • I added a test (it doesnt rly test the sql but it tests the other logic. I wouldnt know how to test the sql bc it would involve fake data in clickhouse)

Copy link

codecov bot commented Sep 30, 2024

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
1 1 0 0
View the top 1 failed tests by shortest run time
 tests.web.rpc.v1alpha.test_trace_item_attribute_values
Stack Traces | 0s run time
No failure message available

To view individual test run time comparison to the main branch, go to the Test Analytics Dashboard

@kylemumma kylemumma merged commit a5c470e into master Oct 1, 2024
30 checks passed
@kylemumma kylemumma deleted the krm/delete-limit-mutations branch October 1, 2024 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants