Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds citus_pause_node udf #7089

Merged
merged 69 commits into from
Sep 1, 2023
Merged

Adds citus_pause_node udf #7089

merged 69 commits into from
Sep 1, 2023

Conversation

gurkanindibay
Copy link
Contributor

@gurkanindibay gurkanindibay commented Jul 29, 2023

DESCRIPTION: Presenting citus_pause_node UDF enabling pausing by node_id.

citus_pause_node takes a node_id parameter and fetches all the shards in that node and puts AccessExclusiveLock on all the shards inside that node. With this lock, insert is disabled, until citus_pause_node transaction is closed.

@codecov
Copy link

codecov bot commented Jul 31, 2023

Codecov Report

Merging #7089 (43d961f) into main (4a1a549) will increase coverage by 0.00%.
The diff coverage is 89.28%.

@@           Coverage Diff           @@
##             main    #7089   +/-   ##
=======================================
  Coverage   93.20%   93.21%           
=======================================
  Files         274      274           
  Lines       59211    59230   +19     
=======================================
+ Hits        55187    55210   +23     
+ Misses       4024     4020    -4     



BackgroundWorkerHandle *
CheckBackgroundWorkerToObtainLocks(int32 lock_cooldown)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function needs a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment

}

permutation "s1-begin" "s1-pause-node" "s2-begin" "s2-insert-distributed" "s2-end" "s1-end"
permutation "s1-begin" "s1-pause-node-force" "s2-begin" "s2-insert-distributed" "s2-end" "s1-end"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's show the difference between force and non-force using these tests:

-- s1-pause-node should block
"s1-begin" "s2-begin" "s2-insert-distributed" "s1-pause-node" "s2-end" "s1-end"

-- s1-pause-node-force
"s1-begin" "s2-begin" "s2-insert-distributed" "s1-pause-node-force"(*) "s1-end"  "s2-end"

Copy link
Contributor Author

@gurkanindibay gurkanindibay Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not make it work this way. I changed the order. Could you check the results?

Comment on lines 155 to 168
-- Set statement_timeout for the session (in milliseconds)
SET client_min_messages = 'notice';

-- Variable to track if the INSERT statement was successful
DO $$
BEGIN

-- Execute the INSERT statement
insert into employee values(11,'e11',3);

END;

$$
LANGUAGE plpgsql;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This insert step can now contain only a single insert statement afaict (same for the insert-reference and delete-distributed step below).

Suggested change
-- Set statement_timeout for the session (in milliseconds)
SET client_min_messages = 'notice';
-- Variable to track if the INSERT statement was successful
DO $$
BEGIN
-- Execute the INSERT statement
insert into employee values(11,'e11',3);
END;
$$
LANGUAGE plpgsql;
insert into employee values(11,'e11',3);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 77 to 102
SET client_min_messages = 'notice';
DO $$
DECLARE
v_shard_id int;
v_node_id int;
v_node_name text;
v_node_port int;
BEGIN
--The first message in the block is being printed on the top of the code block. So adding a dummy message
--to make sure that the first message is printed in correct place.
raise notice '';
-- Get the shard id for the distribution column
SELECT get_shard_id_for_distribution_column('employee', 3) into v_shard_id;

--Get the node id for the shard id
SELECT nodename,nodeport into v_node_name,v_node_port FROM citus_shards WHERE shardid = v_shard_id limit 1;
raise notice 'node name is %',v_node_name;
raise notice 'node port is %',v_node_port;

-- Get the node id for the shard id
SELECT nodeid into v_node_id FROM pg_dist_node WHERE nodename = v_node_name and nodeport = v_node_port limit 1;


-- Pause the node
perform pg_catalog.citus_pause_node_within_txn(v_node_id) ;
END;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these notices are very helpful now. Let's remove them.

Suggested change
SET client_min_messages = 'notice';
DO $$
DECLARE
v_shard_id int;
v_node_id int;
v_node_name text;
v_node_port int;
BEGIN
--The first message in the block is being printed on the top of the code block. So adding a dummy message
--to make sure that the first message is printed in correct place.
raise notice '';
-- Get the shard id for the distribution column
SELECT get_shard_id_for_distribution_column('employee', 3) into v_shard_id;
--Get the node id for the shard id
SELECT nodename,nodeport into v_node_name,v_node_port FROM citus_shards WHERE shardid = v_shard_id limit 1;
raise notice 'node name is %',v_node_name;
raise notice 'node port is %',v_node_port;
-- Get the node id for the shard id
SELECT nodeid into v_node_id FROM pg_dist_node WHERE nodename = v_node_name and nodeport = v_node_port limit 1;
-- Pause the node
perform pg_catalog.citus_pause_node_within_txn(v_node_id) ;
END;
DO $$
DECLARE
v_shard_id int;
v_node_id int;
v_node_name text;
v_node_port int;
BEGIN
-- Get the shard id for the distribution column
SELECT get_shard_id_for_distribution_column('employee', 3) into v_shard_id;
--Get the node id for the shard id
SELECT nodename,nodeport into v_node_name,v_node_port FROM citus_shards WHERE shardid = v_shard_id limit 1;
-- Get the node id for the shard id
SELECT nodeid into v_node_id FROM pg_dist_node WHERE nodename = v_node_name and nodeport = v_node_port limit 1;
-- Pause the node
perform pg_catalog.citus_pause_node_within_txn(v_node_id) ;
END;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 60 to 69
BEGIN
select nextval('pg_dist_node_nodeid_seq')::int into v_node_id;
select citus_pause_node_within_txn(v_node_id) ;
EXCEPTION
WHEN SQLSTATE 'P0002' THEN
GET STACKED DIAGNOSTICS v_exception_message = MESSAGE_TEXT;
v_expected_exception_message := 'node ' || v_node_id || ' not found';
if v_exception_message = v_expected_exception_message then
RAISE NOTICE 'Node not found.';
end if;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when you don't transform the ERROR to a NOTICE like this:

Suggested change
BEGIN
select nextval('pg_dist_node_nodeid_seq')::int into v_node_id;
select citus_pause_node_within_txn(v_node_id) ;
EXCEPTION
WHEN SQLSTATE 'P0002' THEN
GET STACKED DIAGNOSTICS v_exception_message = MESSAGE_TEXT;
v_expected_exception_message := 'node ' || v_node_id || ' not found';
if v_exception_message = v_expected_exception_message then
RAISE NOTICE 'Node not found.';
end if;
BEGIN
select nextval('pg_dist_node_nodeid_seq')::int into v_node_id;
select citus_pause_node_within_txn(v_node_id) ;

I think we'll see the error in the test output, which would be a fine way to test, and reduced the testing code.

Copy link
Contributor Author

@gurkanindibay gurkanindibay Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node id is a changing value and error message includes node id. Therefore, I'm adding this exception to have a constant error message

permutation "s1-begin" "s2-begin" "s1-pause-node" "s2-insert-reference" "s1-end" "s2-end"
permutation "s1-begin" "s1-pause-node" "s1-pause-node" "s1-end"
permutation "s1-begin" "s1-node-not-found" "s1-end"
permutation "s1-begin" "s2-begin" "s2-insert-distributed" "s1-pause-node-force"(*) "s2-end" "s1-end"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be swapped to test that the "force" argument kills the s2 query

Suggested change
permutation "s1-begin" "s2-begin" "s2-insert-distributed" "s1-pause-node-force"(*) "s2-end" "s1-end"
permutation "s1-begin" "s2-begin" "s2-insert-distributed" "s1-pause-node-force"(*) "s1-end" "s2-end"

Copy link
Contributor Author

@gurkanindibay gurkanindibay Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test isolation_setup              ... ok           31 ms
test isolation_cluster_management ... ok           77 ms
test isolation_citus_pause_node   ... FAILED    11001 ms

In that case it takes 10 seconds to test the case. If we use so, I need to give a timeout; e.g. 1000 ms

Here in this scneario, there are two problems

  1. tests take too much time and set statement_timeout does not work
  2. I need to put two expected test result file since in some cases below value is printed twice
FATAL:  terminating connection due to administrator command 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used lock_cooldown parameter with the suggestion of @JelteF and it worked

@gurkanindibay gurkanindibay merged commit b8bded6 into main Sep 1, 2023
37 checks passed
@gurkanindibay gurkanindibay deleted the citus_pause_node branch September 1, 2023 08:39
francisjodi pushed a commit that referenced this pull request Nov 13, 2023
DESCRIPTION: Presenting citus_pause_node UDF enabling pausing by
node_id.

citus_pause_node takes a node_id parameter and fetches all the shards in
that node and puts AccessExclusiveLock on all the shards inside that
node. With this lock, insert is disabled, until citus_pause_node
transaction is closed.

---------

Co-authored-by: Hanefi Onaldi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants