You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the atomic copy failed because of #17862, the result was that it got confused. It though it was done, so it ended, but then the calling context still saw a table list to do, and restarted itself. It then quickly errorred because of duplicate key reasons.
Rohit said:
The root cause seems to be that the source grpc serr := vc.vr.sourceVStreamer.VStreamTables(ctx, func(resp *binlogdatapb.VStreamTablesResponse) error returned without an error or the context expiring which should only happen if all tables had finished streaming. It then found the additional uncopied tables in the _vt.copy_state and assumed it was just starting the atomic copy.
(in vcopier_atomic.go)
What's also interesting to note: is that as you can see below from my start parameters, the copy phase duration is 5 minutes. This translates into the source tablet (by GetVReplicationMaxExecutionTimeQueryHint() I guess) logging this:
Streaming query: select /*+ MAX_EXECUTION_TIME(300000) */ * from EventLog
So, even tough it tried to set 300000, that didn't work. It was at the 10 minute mark that it failed. See the log output later.
To summarize, I see these issues that need to be fixed:
The MAX_EXECUTION_TIME(300000) query hint doesn't seem to work.
The max_execution_time global setting may need to be verified before starting.
The amount of rows copied may need to be confirmed, at least in atomic mode. It restarting all the way from the beginning was a bug, but I'm glad it did, because it did prevent data loss. The only reason I noticed the error, was because it restarted all the way from the start. Had the table list been progressively popped so it continued with the next table, I would not have noticed.
Reproduction Steps
Perform an atomic copy of a large enough table with a value for max_execution_time on the source short enough for the source to abort the query.
We used MySQL 8.0.35 at AWS RDS as source (to move away from).
vttablet version Version: 21.0.1 (Git revision 3d4f41db2fbc32611c7d2ea2af3dc68b9d962415 branch 'HEAD') built on Tue Dec 3 05:39:35 UTC 2024 by runner@fv-az2029-313 using go1.23.3 linux/amd64
Overview of the Issue
When the atomic copy failed because of #17862, the result was that it got confused. It though it was done, so it ended, but then the calling context still saw a table list to do, and restarted itself. It then quickly errorred because of duplicate key reasons.
Rohit said:
(in
vcopier_atomic.go
)What's also interesting to note: is that as you can see below from my start parameters, the copy phase duration is 5 minutes. This translates into the source tablet (by
GetVReplicationMaxExecutionTimeQueryHint()
I guess) logging this:So, even tough it tried to set
300000
, that didn't work. It was at the 10 minute mark that it failed. See the log output later.To summarize, I see these issues that need to be fixed:
MAX_EXECUTION_TIME(300000)
query hint doesn't seem to work.max_execution_time
global setting may need to be verified before starting.Reproduction Steps
Perform an atomic copy of a large enough table with a value for
max_execution_time
on the source short enough for the source to abort the query.We used MySQL
8.0.35
at AWS RDS as source (to move away from).Tablet start params:
Binary Version
vttablet version Version: 21.0.1 (Git revision 3d4f41db2fbc32611c7d2ea2af3dc68b9d962415 branch 'HEAD') built on Tue Dec 3 05:39:35 UTC 2024 by runner@fv-az2029-313 using go1.23.3 linux/amd64
Operating System and Environment details
DISTRIB_ID=Ubuntu DISTRIB_RELEASE=24.04 DISTRIB_CODENAME=noble DISTRIB_DESCRIPTION="Ubuntu 24.04.2 LTS"
Log Fragments
Slack discussion
https://vitess.slack.com/archives/C0PQY0PTK/p1740042281348869
The text was updated successfully, but these errors were encountered: