Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-pick from main to release-13.0 #7840

Merged
merged 5 commits into from
Jan 12, 2025
Merged

Conversation

naisila
Copy link
Member

@naisila naisila commented Jan 10, 2025

No description provided.

Copy link

codecov bot commented Jan 10, 2025

Codecov Report

Attention: Patch coverage is 81.81818% with 6 lines in your changes missing coverage. Please review.

Please upload report for BASE (release-13.0@3e924db). Learn more about missing BASE report.

Additional details and impacted files
@@               Coverage Diff               @@
##             release-13.0    #7840   +/-   ##
===============================================
  Coverage                ?   89.47%           
===============================================
  Files                   ?      274           
  Lines                   ?    59967           
  Branches                ?     7506           
===============================================
  Hits                    ?    53656           
  Misses                  ?     4167           
  Partials                ?     2144           

@naisila naisila marked this pull request as ready for review January 12, 2025 18:49
emelsimsek and others added 5 commits January 12, 2025 22:05
…ceiver timeouts during large shard splits. (#7229)

DESCRIPTION: Send keepalive messages during the logical replication
phase of large shard splits to avoid timeouts.

During the logical replication part of the shard split process, split
decoder filters out the wal records produced by the initial copy. If the
number of wal records is big, then split decoder ends up processing for
a long time before sending out any wal records through pgoutput. Hence
the wal receiver may time out and restarts repeatedly causing our split
driver code catch up logic to fail.

Notes:

1. If the wal_receiver_timeout is set to a very small number e.g. 600ms,
it may time out before receiving the keepalives. My tests show that this
code works best when the` wal_receiver_timeout `is set to 1minute, which
is the default value.

2. Once a logical replication worker time outs, a new one gets launched.
The new logical replication worker sets the pg_stat_subscription columns
to initial values. E.g. the latest_end_lsn is set to 0. Our driver logic
in `WaitForGroupedLogicalRepTargetsToCatchUp` can not handle LSN value
to go back. This is the main reason for it to get stuck in the infinite
loop.

(cherry picked from commit e9035f6)
When executing a prepared CALL, which is not pure SQL but available with
some drivers like npgsql and jpgdbc, Citus entered a code path where a
plan is not defined, while trying to increase its cost. Thus SIG11 when
plan is a NULL pointer.

Fix by only increasing plan cost when plan is not null.

However, it is a bit suspicious to get here with a NULL plan and maybe a
better change will be to not call
ShardPlacementForFunctionColocatedWithDistTable() with a NULL plan at
all (in call.c:134)

bug hit with for example:
```
CallableStatement proc = con.prepareCall("{CALL p(?)}");
proc.registerOutParameter(1, java.sql.Types.BIGINT);
proc.setInt(1, -100);
proc.execute();
```

where `p(bigint)` is a distributed "function" and the param the
distribution key (also in a distributed table), see #7242 for details

Fixes #7242

(cherry picked from commit 0678a2f)
This fixes #7230.

First of all, using HeapTupleHeaderGetDatumLength(heapTuple) is
definetly wrong, it gives a number that's 4 times less than the correct
tuple size (heapTuple.t_len). See

https://github.com/postgres/postgres/blob/REL_16_0/src/include/access/htup_details.h#L455-L456

https://github.com/postgres/postgres/blob/REL_16_0/src/include/varatt.h#L279

https://github.com/postgres/postgres/blob/REL_16_0/src/include/varatt.h#L225-L226

When I fixed it, the limit_intermediate_size test failed, so I tried to
understand what's going on there. In original commit fd546cf these
queries were supposed to fail. Then in b3af63c three of the queries that
were supposed to fail suddenly worked and tests were changed to pass
without understanding why the output had changed or how to keep test
testing what it had to test. Even comments saying that these queries
should fail were left untouched. Commit message gives no clue about why
exactly test has changed:

> It seems that when we use adaptive executor instead of task tracker,
we
> exceed the intermediate result size less in the test. Therefore
updated
> the tests accordingly.

Then 3fda2c3 also blindly raised the limit for one of the queries to
keep it working:

3fda2c3#diff-a9b7b617f9dfd345318cb8987d5897143ca1b723c87b81049bbadd94dcc86570R19

When in fe3caf3 that HeapTupleHeaderGetDatumLength(heapTuple) call was
finally added, one of those test queries became failing again.

The other two of them now also failing after the fix. I don't understand
how exactly the calculation of "intermediate result size" that is
limited by citus.max_intermediate_result_size had changed through
b3af63c and fe3caf3, but these numbers are now closer to what
they originally were when this limitation was added in
fd546cf. So these queries should fail, like in the original
version of the limit_intermediate_size test.

Co-authored-by: Karina Litskevich <[email protected]>
(cherry picked from commit 20dc58c)
LoadShardList is called twice, which is not neccessary, and there is no
need to sort the shard placement list since we only want to know the list
length.

(cherry picked from commit 8e979f7)
This change refactors the code by using generate_qualified_relation_name
from id instead of using a sequence of functions to generate the
relation name.

Fixes #6602

(cherry picked from commit ee11492)
@naisila naisila force-pushed the release-13.0-naisila branch from 73610dd to ddef972 Compare January 12, 2025 19:09
@naisila naisila merged commit ddef972 into release-13.0 Jan 12, 2025
153 of 155 checks passed
@naisila naisila deleted the release-13.0-naisila branch January 12, 2025 19:35
@naisila naisila changed the title Bump pg versions cherry-pick PR Jan 12, 2025
@naisila naisila changed the title cherry-pick PR cherry-pick from main to release-13.0 PR Jan 12, 2025
@naisila naisila changed the title cherry-pick from main to release-13.0 PR Cherry-pick from main to release-13.0 Jan 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants