Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"pg_probackup-15 archive-get" ignores archived wals while recovering to latest state #621

Open
gsl23 opened this issue May 29, 2024 · 2 comments

Comments

@gsl23
Copy link

gsl23 commented May 29, 2024

Hi guys ! I'm testing wal archive recovery and cant to understand why archive-get just lose some last wals :

postgresql.auto.conf:

 Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.

# recovery settings added by pg_probackup restore of backup SE7QA1 at '2024-05-29 15:19:25+05'
## recovery settings
recovery_target_timeline = 'current'
recovery_target_action = 'pause'
restore_command = '"/usr/bin/pg_probackup-15" archive-get -B "/backup/postgres" --instance "pgw-db-p02" --wal-file-path=%p --wal-file-name=%f  --remote-host=bkp_host'

wals in pg_probackup archive :

-rw-------. 1 postgres postgres    95587 May 29 10:41 0000000A0000000000000031.gz
-rw-------. 1 postgres postgres    95648 May 29 11:11 0000000A0000000000000032.gz
-rw-------. 1 postgres postgres    96702 May 29 11:42 0000000A0000000000000033.gz
-rw-------. 1 postgres postgres   106265 May 29 12:12 0000000A0000000000000034.gz
-rw-------. 1 postgres postgres    95951 May 29 12:41 0000000A0000000000000035.gz
-rw-------. 1 postgres postgres    79297 May 29 12:44 0000000A0000000000000036.gz
-rw-------. 1 postgres postgres 16777216 May 29 12:44 0000000A0000000000000037.partial
-rw-------. 1 postgres postgres      378 May 28 19:11 0000000A.history
-rw-------. 1 postgres postgres   128597 May 29 13:14 0000000B0000000000000037.gz
-rw-------. 1 postgres postgres    79741 May 29 13:18 0000000B0000000000000038.gz
-rw-------. 1 postgres postgres    83013 May 29 13:25 0000000B0000000000000039.gz
-rw-------. 1 postgres postgres    98973 May 29 13:55 0000000B000000000000003A.gz
-rw-------. 1 postgres postgres   100455 May 29 14:25 0000000B000000000000003B.gz
-rw-------. 1 postgres postgres   101462 May 29 14:55 0000000B000000000000003C.gz
-rw-------. 1 postgres postgres   117154 May 29 15:25 0000000B000000000000003D.gz
-rw-------. 1 postgres postgres      422 May 29 12:44 0000000B.history

postgresql.log while recovery :

2024-05-29 15:24:58 +05 [400240]: [1-1]: INFO: pg_probackup archive-get WAL file: 0000000A0000000000000037, remote: ssh, threads: 1/1, batch: 1
2024-05-29 15:24:59 +05 [400240]: [1-1]: INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 145ms
2024-05-29 15:24:59.040 +05 [400156] LOG:  restored log file "0000000A0000000000000037" from archive
2024-05-29 15:24:59.052 +05 [400156] LOG:  invalid resource manager ID 114 at 0/370000A0
2024-05-29 15:24:59.052 +05 [400156] LOG:  redo done at 0/37000028 system usage: CPU: user: 0.02 s, system: 0.17 s, elapsed: 5.91 s
2024-05-29 15:24:59.052 +05 [400156] LOG:  last completed transaction was at log time 2024-05-29 12:44:00.017459+05
2024-05-29 15:24:59 +05 [400247]: [1-1]: INFO: pg_probackup archive-get WAL file: 0000000A0000000000000037, remote: ssh, threads: 1/1, batch: 1
2024-05-29 15:24:59 +05 [400247]: [1-1]: INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 140ms
2024-05-29 15:24:59.199 +05 [400156] LOG:  restored log file "0000000A0000000000000037" from archive
2024-05-29 15:24:59 +05 [400249]: [1-1]: INFO: pg_probackup archive-get WAL file: 0000000B.history, remote: ssh, threads: 1/1, batch: 1
2024-05-29 15:24:59 +05 [400249]: [1-1]: INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 144ms
2024-05-29 15:24:59.369 +05 [400156] LOG:  restored log file "0000000B.history" from archive
2024-05-29 15:24:59 +05 [400251]: [1-1]: INFO: pg_probackup archive-get WAL file: 0000000C.history, remote: ssh, threads: 1/1, batch: 1
2024-05-29 15:24:59 +05 [400251]: [1-1]: ERROR: pg_probackup archive-get failed to deliver WAL file: 0000000C.history, time elapsed: 186ms
2024-05-29 15:24:59.566 +05 [400156] LOG:  selected new timeline ID: 12
2024-05-29 15:24:59 +05 [400254]: [1-1]: INFO: pg_probackup archive-get WAL file: 0000000A.history, remote: ssh, threads: 1/1, batch: 1
2024-05-29 15:24:59 +05 [400254]: [1-1]: INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 139ms
2024-05-29 15:24:59.784 +05 [400156] LOG:  restored log file "0000000A.history" from archive
2024-05-29 15:24:59.793 +05 [400156] LOG:  archive recovery complete

Why it is ended on 0000000A0000000000000037 last completed transaction was at log time 2024-05-29 12:44:00.017459+05 ??
Where is
-rw-------. 1 postgres postgres 79741 May 29 13:18 0000000B0000000000000038.gz
-rw-------. 1 postgres postgres 83013 May 29 13:25 0000000B0000000000000039.gz
etc.... ???

@gsl23
Copy link
Author

gsl23 commented May 29, 2024

At 2024-05-29 12:44 standby instance have been promoted to primary :

2024-05-29 12:44:17.079 +05 [33061] LOG:  replication terminated by primary server
2024-05-29 12:44:17.079 +05 [33061] DETAIL:  End of WAL reached on timeline 10 at 0/370000A0.
2024-05-29 12:44:17.079 +05 [33061] FATAL:  could not send end-of-streaming message to primary: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
        no COPY in progress
cp: cannot stat '/pgsql/data/../wal_archive/0000000B.history': No such file or directory
cp: cannot stat '/pgsql/data/../wal_archive/0000000A0000000000000037': No such file or directory
2024-05-29 12:44:17.087 +05 [33048] LOG:  invalid resource manager ID 114 at 0/370000A0
2024-05-29 12:44:17.109 +05 [350162] FATAL:  could not connect to the primary server: connection to server at "192.168.130.23", port 5432 failed: FATAL:  the
cp: cannot stat '/pgsql/data/../wal_archive/0000000B.history': No such file or directory
2024-05-29 12:44:17.112 +05 [33048] LOG:  waiting for WAL to become available at 0/370000B8
2024-05-29 12:44:17 +05 [350158]: [1-1]: INFO: pg_probackup archive-push WAL file: 0000000A0000000000000036, threads: 1/1, batch: 1/1, compression: zlib
2024-05-29 12:44:17 +05 [350158]: [1-1]: INFO: pg_probackup archive-push completed successfully, pushed: 1, skipped: 0, time elapsed: 66ms
cp: cannot stat '/pgsql/data/../wal_archive/0000000A0000000000000037': No such file or directory
2024-05-29 12:44:17.887 +05 [33048] LOG:  received promote request
2024-05-29 12:44:17.887 +05 [33048] LOG:  redo done at 0/37000028 system usage: CPU: user: 0.75 s, system: 0.83 s, elapsed: 63143.22 s
2024-05-29 12:44:17.887 +05 [33048] LOG:  last completed transaction was at log time 2024-05-29 12:44:00.017459+05
cp: cannot stat '/pgsql/data/../wal_archive/0000000A0000000000000037': No such file or directory
cp: cannot stat '/pgsql/data/../wal_archive/0000000B.history': No such file or directory
2024-05-29 12:44:17.898 +05 [33048] LOG:  selected new timeline ID: 11
cp: cannot stat '/pgsql/data/../wal_archive/0000000A.history': No such file or directory
2024-05-29 12:44:18.037 +05 [33048] LOG:  archive recovery complete
2024-05-29 12:44:18.068 +05 [33046] LOG:  checkpoint starting: force
2024-05-29 12:44:18.075 +05 [33042] LOG:  database system is ready to accept connections
2024-05-29 12:44:18.082 +05 [350180] LOG:  pg_cron scheduler started
2024-05-29 12:44:18 +05 [350176]: [1-1]: INFO: pg_probackup archive-push WAL file: 0000000B.history, threads: 1/1, batch: 1/1, compression: zlib
2024-05-29 12:44:18 +05 [350176]: [1-1]: INFO: pg_probackup archive-push completed successfully, pushed: 1, skipped: 0, time elapsed: 6ms
2024-05-29 12:44:18 +05 [350187]: [1-1]: INFO: pg_probackup archive-push WAL file: 0000000A0000000000000037.partial, threads: 1/1, batch: 1/1, compression: zl
2024-05-29 12:44:18 +05 [350187]: [1-1]: INFO: pg_probackup archive-push completed successfully, pushed: 1, skipped: 0, time elapsed: 78ms
2024-05-29 12:44:19.011 +05 [33046] LOG:  checkpoint complete: wrote 34 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.897 s, sync=0.011
2024-05-29 12:44:19.011 +05 [33046] LOG:  checkpoint starting: immediate force wait
2024-05-29 12:44:19.060 +05 [33046] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.009 s, sync=0.001 s
2024-05-29 12:45:00.007 +05 [350180] LOG:  cron job 1 starting: 

Do we have any chances to apply later wals after this occasion ?

@gsl23
Copy link
Author

gsl23 commented May 29, 2024

Looks like it is pg_probackup archive-get problem.

I copied all needed wals manually to server local dir and setted up restore by cp :

# restore_command = '"/usr/bin/pg_probackup-15" archive-get -B "/backup/postgres" --instance "pgw-db-p02" --wal-file-path=%p --wal-file-name=%f'
restore_command = 'cp /pgsql/data/../wal_archive2/%f %p'

And now its recovered forward after 0000000B0000000000000037 without any problems:

2024-05-29 17:32:07.935 +05 [439475] LOG:  restored log file "0000000A0000000000000032" from archive
2024-05-29 17:32:08.025 +05 [439475] LOG:  restored log file "0000000A0000000000000033" from archive
2024-05-29 17:32:08.115 +05 [439475] LOG:  restored log file "0000000A0000000000000034" from archive
2024-05-29 17:32:08.193 +05 [439475] LOG:  restored log file "0000000A0000000000000035" from archive
2024-05-29 17:32:08.271 +05 [439475] LOG:  restored log file "0000000A0000000000000036" from archive
2024-05-29 17:32:08.355 +05 [439475] LOG:  restored log file "0000000B0000000000000037" from archive
2024-05-29 17:32:08.436 +05 [439475] LOG:  restored log file "0000000B0000000000000038" from archive
2024-05-29 17:32:08.528 +05 [439475] LOG:  restored log file "0000000B0000000000000039" from archive
2024-05-29 17:32:08.606 +05 [439475] LOG:  restored log file "0000000B000000000000003A" from archive
2024-05-29 17:32:08.693 +05 [439475] LOG:  restored log file "0000000B000000000000003B" from archive
2024-05-29 17:32:08.770 +05 [439475] LOG:  restored log file "0000000B000000000000003C" from archive
2024-05-29 17:32:08.865 +05 [439475] LOG:  restored log file "0000000B000000000000003D" from archive
2024-05-29 17:32:08.946 +05 [439475] LOG:  restored log file "0000000B000000000000003E" from archive
2024-05-29 17:32:09.011 +05 [439475] LOG:  recovery stopping before commit of transaction 8004, time 2024-05-29 15:46:00.001085+05
2024-05-29 17:32:09.011 +05 [439475] LOG:  pausing at the end of recovery
2024-05-29 17:32:09.011 +05 [439475] HINT:  Execute pg_wal_replay_resume() to promote.
2024-05-29 17:33:27.613 +05 [439475] LOG:  redo done at 0/3E00FB88 system usage: CPU: user: 0.02 s, system: 0.20 s, elapsed: 81.25 s
2024-05-29 17:33:27.613 +05 [439475] LOG:  last completed transaction was at log time 2024-05-29 15:45:00.024303+05

So, my test case :

  1. Set up standby and pg_probackup with wal archiving
  2. Take full backup from standby
  3. Promote standby
  4. Try to restore by using full backup(before promotion) and apply wals to timeline after promotion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant