-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After connection problems: cannot receive incremental stream: destination ... has been modified #203
Comments
your report of using mismatched zrep versions makes me think you may also
be running mismatching zfs versions.
Of the two, it is far more important to have ZFS in sync.
if they are not, i suggest you focus on standardizing zfs to the latest
practical version for you.
|
I reproduced the issue with freshly installed VMs (ubuntu server 22.04) and the latest version of
The problem only occurs when the connection is dropped after the actual data has been copied by zfs - if it drops mid-stream the next run won't face the problem. I assume this explains why in our production situation we almost only see the problem for a dataset with low activity and basically never on a much busier dataset. To reproduce I killed ssh on the target host with Another transcript with a few annotationsSuccessful run before reproducing
Write a bit of data
Next run, triggering the problemNote that I kill SSH on the target just as the process stalls briefly at the end. You killed it at the right time when
Try to do next runs, fails
Fix by renaming latest actually transported snapshot
Run again successfully
|
Thank you for taking the time to reproduce it cleanly.
Do you have any suggestions on how you think it best that Zrep handle this
oddity?
…On Wed, Aug 16, 2023 at 12:23 AM Pascal Schwarz ***@***.***> wrote:
I reproduced the issue with freshly installed VMs (ubuntu server 22.04)
and the latest version of zrep:
***@***.***:~# zfs --version
zfs-2.1.5-1ubuntu6~22.04.1
zfs-kmod-2.1.5-1ubuntu6~22.04.1
The problem only occurs when the connection is dropped after the actual
data has been copied by zfs - if it drops mid-stream the next run won't
face the problem.
To reproduce I killed ssh on the target host with kill $(ps -ef | grep
"sshd: ***@***.***" | grep -v grep | gawk '{print $2}').
Another transcript with a few annotations Successful run before
reproducing
***@***.***:~# zrep snaponly rpool/ds1
***@***.***_000014
***@***.***:~# zrep synconly rpool/ds1
sending ***@***.***_000014 to zrep-target:rpool/ds1
send from @zrep_000012 to ***@***.***_000013_unsent estimated size is 100M
send from @zrep_000013_unsent to ***@***.***_000014 estimated size is 624B
total estimated size is 100M
TIME SENT SNAPSHOT ***@***.***_000013_unsent
08:02:49 3.13M ***@***.***_000013_unsent
[...]
08:03:02 97.5M ***@***.***_000013_unsent
Also running expire on zrep-target:rpool/ds1 now...
Expiring zrep snaps on rpool/ds1
Write a bit of data
***@***.***:~# dd bs=1M if=/dev/urandom of=/rpool/ds1/50M.bin count=50 conv=fsync
50+0 records in
50+0 records out
52428800 bytes (52 MB, 50 MiB) copied, 12.9225 s, 4.1 MB/s
Next run, triggering the problem
Note that I kill SSH on the target just as the process stalls briefly at
the end. You killed it at the right time when ***@***.***_000015
becomes visible a few seconds later on the target (zfs list -t snap -r
rpool/ds1).
***@***.***:~# zrep snaponly rpool/ds1
***@***.***_000015
***@***.***:~# zrep synconly rpool/ds1
sending ***@***.***_000015 to zrep-target:rpool/ds1
send from @zrep_000014 to ***@***.***_000015 estimated size is 50.1M
total estimated size is 50.1M
TIME SENT SNAPSHOT ***@***.***_000015
08:04:33 6.40M ***@***.***_000015
08:04:34 16.0M ***@***.***_000015
08:04:35 25.6M ***@***.***_000015
08:04:36 35.1M ***@***.***_000015
08:04:37 44.6M ***@***.***_000015
Connection to zrep-target closed by remote host.
Error: Problem doing sync for ***@***.***_000015. Renamed to ***@***.***_000015_unsent
Try to do next runs, fails
***@***.***:~# zrep synconly rpool/ds1
sending ***@***.***_000015_unsent to zrep-target:rpool/ds1
send from @zrep_000014 to ***@***.***_000015_unsent estimated size is 50.1M
total estimated size is 50.1M
cannot receive incremental stream: destination rpool/ds1 has been modified
since most recent snapshot
Error: Problem doing sync for ***@***.***_000015_unsent. Renamed to ***@***.***_000015_unsent_unsent
***@***.***:~# zrep snaponly rpool/ds1
***@***.***_000016
***@***.***:~# zrep synconly rpool/ds1
sending ***@***.***_000016 to zrep-target:rpool/ds1
send from @zrep_000014 to ***@***.***_000015_unsent_unsent estimated size is 50.1M
send from @zrep_000015_unsent_unsent to ***@***.***_000016 estimated size is 624B
total estimated size is 50.1M
cannot receive incremental stream: destination rpool/ds1 has been modified
since most recent snapshot
Error: Problem doing sync for ***@***.***_000016. Renamed to ***@***.***_000016_unsent
Fix by renaming latest actually transported snapshot
***@***.***:~# zfs rename ***@***.***_000015_unsent_unsent ***@***.***_000015
Run again successfully
***@***.***:~# zrep synconly rpool/ds1
sending ***@***.***_000016_unsent to zrep-target:rpool/ds1
send from @zrep_000014 to ***@***.***_000015 estimated size is 50.1M
send from @zrep_000015 to ***@***.***_000016_unsent estimated size is 624B
total estimated size is 50.1M
TIME SENT SNAPSHOT ***@***.***_000015
08:05:46 11.5M ***@***.***_000015
08:05:47 21.2M ***@***.***_000015
08:05:48 30.7M ***@***.***_000015
08:05:49 40.2M ***@***.***_000015
08:05:50 49.7M ***@***.***_000015
Also running expire on zrep-target:rpool/ds1 now...
Expiring zrep snaps on rpool/ds1
—
Reply to this email directly, view it on GitHub
<#203 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANEV6OXHWI2WGGF5HLX2B3XVRYNPANCNFSM6AAAAAA3PKCJJU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I see mostly two ways in which this could be resolved... but I am not an expert on zfs, so... Potential solutionsa) Improve detecting snapshots on the targetTo me it seems that in the output above after renaming the snapshot on the source,
If b) Don't rename to
|
First off, I should say that one needs to take my words cautiously,
since my brain is not fully immersed in zrep and zfs these days :))
That being said, the summary of the situation seems to me to be as follows:
1. You have an unreliable overall system in one way or another. So,
this isnt a bug in zrep.. this is looking for a zrep-customer-friendly
workaround
2.Your remote system is a little quirky. It is reporting "fs changed",
and refusing to update the filesystem, when if my memory is correct,
OTHER systems dont complain about that sort of thing.
I will also make the ASSUMPTION that you are using the pair of systems
as intended in the zrep design: that is to say, the remote system is
pure archive, and is not written to or modified in any way other than
by zrep.
If all of the above is true, then my suggestion to you is:
Use the "force" flag for syncing, in your standard run.
I believe there is a flag that says, "go force the sync to happen,
even if the remote side has been modified".
That should give you a viable workaround, I think?
|
I disagree on calling my suggestion above a workaround. Connection issues do happen and a replication tool intended to be used across the network should recover from them. I would consider We create readonly clones ( In the simple example above where the issue was reproduced, nothing was done on the target except for killing ssh to simulate connection loss. Our mechanism which creates & cleans up the clones has never caused a "destination has been modified" error over the years (actually we never saw it at all). The problem we are seeing now is purely due to the connection issues paired with |
Maybe a third strategy ("do nothing") next to "rename to If zrep simply keeps the snapshot but does not rename it, the system would be in the same state as it is after we manually rename the problematic snapshot on the source. Or do I miss something? ... maybe |
your most recent debug output was described as replicating the problem, so
I only looked deeply at that.
But now you say that the "destination modified" problem, is not part of
your real problem.
So then, sounds like we need a proper test case that shows your "real
problem" properly.
Regarding your suggestion of "do nothing"...
it seems a bit iffy to me.
But if you'd like to try it out, you might replace the "else" at line 1865,
with
elif [[ "$ZREP_RENAME_UNSENT" != "hold" ]] ; then
and try that out.
… Message ID: ***@***.***>
|
I have now tried with I noticed that this is not exposed as a normal CLI argument for Transcript with
|
well.. focusing on the ultimate cause behind the scenes for a minute:
If the "destination modified" thing is the main problem affecting you..
then we should probably focus on that.
I'm presuming this is because of your readonly clone making.
Even if it theoretically shouldnt... I think it is pretty clear that it is.
best case, it isnt strictly a bug, but a contention problem.
In which case, you need to create some kind of mutual exclusion lock, where
the machines are never doing (create a clone) and (zfs sync) at the same
time.
But if you are guaranteed never to overlap those operations already... then
this should be considered a ZFS bug, and you should complain to your ZFS
vendor, I think.
Message ID: ***@***.***>
… |
Could you please give insight on
It is unrelated. Again: On the VMs where I now reproduced the issue I am not doing any cloning. |
I believe ZREP_RESUME is an environment setting, because
1. it was difficult to detect whether a particular zfs implementation
supported it in the expected manner
2. if you choose to use it, you should always use it. Therefore, it
should be set in environment, or the zrep config file, rather than a
flag.
…On Fri, Aug 18, 2023 at 8:34 AM Pascal Schwarz ***@***.***> wrote:
Could you please give insight on ZREP_RESUME=1? Are there additional risks / known reasons not to use it?
I'm presuming this is because of your readonly clone making.
Even if it theoretically shouldnt... I think it is pretty clear that it is.
It is unrelated. Again: On the VMs where I now reproduced the issue I am not doing any cloning.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
We have been using
zrep
for multiple years to replicate through SSH, but in the last few months we suffer from less-than-optimal connectivity which is causing pain withzrep
.The error message is inaccurate, resulting from
zrep
being confused by its own error handling. When we encounter the issue, manual steps are needed to resolve it.Order of events
Connection problem
we encounter a connection problem during
zrep synconly
, this log is from the source where we run the commandNext run fails
This happens with all subsequent runs:
We usually do
zrep snaponly
followed byzrep synconly
to ensure consistency with systems running ontop the replicated dataset.Failing run with
ZREP_VERBOSE
Further info / workaround
Latest snapshot on target
But we see above that
zrep
thinks it needs to send from@zrep_000bc5
instead of@zrep_000bc6
.@zrep_000bc6
has been renamed to@zrep_000bc6_unsent
on the source.comparing
@zrep_000bc6
Source
Target
Fixing replication
Rename
Since
@zrep_000bc6
has actually been successfully transferred (same guid), we rename it on the source so it has the same name as on the target:Rerun
... and run it again, successfully
Version used
We are running zrep 1.9.0 on the source and 1.8.0 on the target. We are going to update (& align versions) soon, but from the changelog its not obvious whether there were any related changes.
The text was updated successfully, but these errors were encountered: