Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: (Atomic) copy fails to detect errors and restarts itself in the middle, causing duplicate key errors #17864

Open
wiebeytec opened this issue Feb 25, 2025 · 0 comments
Labels
Needs Triage This issue needs to be correctly labelled and triaged Type: Bug

Comments

@wiebeytec
Copy link
Contributor

wiebeytec commented Feb 25, 2025

Overview of the Issue

When the atomic copy failed because of #17862, the result was that it got confused. It though it was done, so it ended, but then the calling context still saw a table list to do, and restarted itself. It then quickly errorred because of duplicate key reasons.

Rohit said:

The root cause seems to be that the source grpc serr := vc.vr.sourceVStreamer.VStreamTables(ctx, func(resp *binlogdatapb.VStreamTablesResponse) error returned without an error or the context expiring which should only happen if all tables had finished streaming. It then found the additional uncopied tables in the _vt.copy_state and assumed it was just starting the atomic copy.

(in vcopier_atomic.go)

What's also interesting to note: is that as you can see below from my start parameters, the copy phase duration is 5 minutes. This translates into the source tablet (by GetVReplicationMaxExecutionTimeQueryHint() I guess) logging this:

Streaming query: select /*+ MAX_EXECUTION_TIME(300000) */ * from EventLog

So, even tough it tried to set 300000, that didn't work. It was at the 10 minute mark that it failed. See the log output later.

To summarize, I see these issues that need to be fixed:

  • The MAX_EXECUTION_TIME(300000) query hint doesn't seem to work.
  • The max_execution_time global setting may need to be verified before starting.
  • The amount of rows copied may need to be confirmed, at least in atomic mode. It restarting all the way from the beginning was a bug, but I'm glad it did, because it did prevent data loss. The only reason I noticed the error, was because it restarted all the way from the start. Had the table list been progressively popped so it continued with the next table, I would not have noticed.

Reproduction Steps

Perform an atomic copy of a large enough table with a value for max_execution_time on the source short enough for the source to abort the query.

We used MySQL 8.0.35 at AWS RDS as source (to move away from).

Tablet start params:

vttablet \
--topo_implementation etcd2 \
--topo_global_server_address 172.31.35.179:2379 \
--topo_global_root /vitess/global \
--vreplication_copy_phase_duration 0h5m0s \
--queryserver-config-transaction-cap 500 \
--logtostderr \
--tablet-path vic-eu-central-1-live-0000000600 \
--init_keyspace vrm2023 \
--init_shard 0 \
--init_tablet_type replica \
--tablet_hostname 172.31.39.97 \
--port 18600 \
--grpc_port 15600 \
--service_map grpc-queryservice,grpc-tabletmanager,grpc-updatestream \
--table-acl-config /opt/vitess/vttablet_acl.json \
--queryserver-config-strict-table-acl \
--enforce-tableacl-config \
--pid_file /var/lib/vitess/vt_0000000600/vttablet.pid \
--db_host rds-foobar.eu-central-1.rds.amazonaws.com \
--db_port 3306 \
--db_app_user vitess_dml \
--db_app_password password_removed \
--db_dba_user vitess_dml \
--db_dba_password password_removed \
--db_repl_user vitess_dml \
--db_repl_password password_removed \
--db_filtered_user vitess_dml \
--db_filtered_password password_removed \
--db_allprivs_user vitess_dml \
--db_allprivs_password password_removed \
--sql-max-length-errors 20 \
--sanitize_log_messages \
--queryserver-config-terse-errors \
--redact-debug-ui-queries \
--init_db_name_override legacy_mysql_db_name

Binary Version

vttablet version Version: 21.0.1 (Git revision 3d4f41db2fbc32611c7d2ea2af3dc68b9d962415 branch 'HEAD') built on Tue Dec  3 05:39:35 UTC 2024 by runner@fv-az2029-313 using go1.23.3 linux/amd64

Operating System and Environment details

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.2 LTS"

Log Fragments

-- First mention of (copious) log entry of moving 'eventLog' (you can see it's the first because #fields > 0):
feb 20 11:00:12 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: I0220 11:00:12.866786 1029525 vcopier_atomic.go:114] VStreamTablesResponse: received table eventLog, #fields 10, #rows 0, gtid MySQL56/1a654e8e-a2c3-11e7-ab96-06a079b41438:1-4845581054, lastpk <nil>

-- It thought it completed here 10 minutes later because of https://github.com/vitessio/vitess/issues/17862
feb 20 11:10:15 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: I0220 11:10:15.090405 1029525 vcopier_atomic.go:311] Deleting copy state for table eventLog
feb 20 11:10:15 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: I0220 11:10:15.131445 1029525 vcopier_atomic.go:304] Completed copy of all tables

feb 20 11:10:15 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: I0220 11:10:15.131467 1029525 vcopier_atomic.go:306] Returning from copyAll for vrm_all

feb 20 11:10:15 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: I0220 11:10:15.132133 1029525 vcopier_atomic.go:71] Starting copyAll for vrm_all

feb 20 11:10:36 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: I0220 11:10:36.573519 1029525 vcopier_atomic.go:148] starting copy phase with table SshIpToVncRelay

feb 20 11:10:36 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: I0220 11:10:36.575926 1029525 vcopier_atomic.go:285] VStreamTables failed: Duplicate entry....

feb 20 11:10:36 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: failed inserting rows

feb 20 11:10:36 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: task error

feb 20 11:10:36 vrm-vitess-mysql-vrm2024-b1 start_vttablet[1029525]: I0220 11:10:36.575997 1029525 vcopier_atomic.go:286] Returning from copyAll for vrm_all

Slack discussion

https://vitess.slack.com/archives/C0PQY0PTK/p1740042281348869

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage This issue needs to be correctly labelled and triaged Type: Bug
Projects
None yet
Development

No branches or pull requests

1 participant