-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#64] Implement copy paste protection #245
base: YuriRomanowski/#64-refactor-markdown-scanner
Are you sure you want to change the base?
[#64] Implement copy paste protection #245
Conversation
Problem: Currently xrefcheck is not able to detect possible bad copy-pastes, when some links are referring the same file, but from the link names it seems that one of those links should refer other file. Solution: Implement check, add corresponding settings to the config.
There are two minor questions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is really well implemented 👍. Just added a few comments to discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good 👍
I'm leaving quite a lot of comments, but for such a PR that effectively adds a completely new large feature - I think this is not even so much.
src/Xrefcheck/Config/Default.hs
Outdated
@@ -67,6 +67,9 @@ scanners: | |||
# | |||
# This affects which anchors are generated for headers. | |||
flavor: #s{flavor} | |||
|
|||
# Whether copy-paste check is enabled globally. | |||
copyPasteCheckEnabled: True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about whether this should be true
by default (which may provide false positives and force the user to take manual actions) or false
(in such case the user may never notice that we provide such a feature).
And you know, after looking at this and at #250, I suspect that we should have two modes in dump-config
command depending on how much strict checks the user wants to have.
- In weak mode we will have copy-paste checks disabled, and all redirects will be considered OK.
- In eager mode we will have copy-paste checks enabled, and will treat permanent redirects as not OK and maybe something smart for temporary redirects.
@YuriRomanowski @aeqz What do you think?
No need to do it here, if we agree on this I'll create a separate ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By being consistent with the aims of this project in the README
file, I think that it should be false
by default, but I also think that then the user may never notice about the feature.
To consider other options, we could also try to add an intermediate configuration setting:
copyPasteCheck: (error|warning|disabled)
where:
error
shows them as errors.warning
shows them as warnings, which do not make the program to exit with error code.disabled
disables the feature.
and the default would be warning
.
In the case of going for the different dump-config
strictness modes, I guess that this mode argument should be mandatory. Otherwise, we would be again discussing what should be the default, and the weak one could make the user to miss some features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning shows them as warnings, which do not make the program to exit with error code.
Hm do we need this? I think that the user can look at the error decriptions and decide whether to just ignore those links, or to fix them immediately, it seems that showing them as warnings is rather useless. So, I would agree with @Martoon-00, we can just provide some modes of the default config, and using true/false for different modes perfectly suites the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understood correctly, the doubt was in deciding whether to enable this check by default, which could produce false positives in CI, or to disable it and possibly make the user to never know about the feature.
I think that providing modes of default config can preserve the same problem, because then we would have to decide whether to set a strong one as the default and produce false positives in CI, or to set a weak one and make the user to never know about some features, although it would be easier to manage.
The point in my suggestion was to consider an intermediate option that does not produce false positives in CI, but keeps showing messages that the user can see in order to know about the feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmhmhm 🤔 🤔 Both your points make sense to me, I'll need some time to digest this and produce my thoughts.
progressRef <- newIORef $ initVerifyProgress (map snd toScan) | ||
|
||
accumulated <- loopAsyncUntil (printer progressRef) do | ||
forConcurrentlyCaching toScan ifExternalThenCache $ \(file, ref) -> | ||
verifyReference config mode progressRef repoInfo' root file ref | ||
case accumulated of | ||
(, copyPasteErrors) <$> case accumulated of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmhm, I'm worried about how it's integrated into existing verification:
- We do not reflect the status in the progress bar like for local and external links verification;
- We apply no parallelization at all.
Also currently this logic is quite hanging at the end, we could extract copy-paste checking to a separate function outside of verifyRepo
and mostly avoid any duplication.
But I suppose that simultaneously resolving all the 3 things I mentioned above is impossible.
So let's discuss what is worth doing here. How would you go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I somehow missed your first comment in this PR. Let's probably discuss it here.
Regarding the progress bar, options that came to my mind:
- Add a third progress bar - not ideal, this will stop fitting into normal 80 chars width terminal.
- Run copy-paste protection check strictly after other verification, and show another progress bar after the first two are gone - not ideal, there is no paralellisation.
- Make the progress bar for local links contain the status for the copy-paste detection. This would be the best, but this may be hard to glue together. In verification we count links, and copy-paste protection - looks like it does not scale from the overall number of links, so it's not so obvious what should we count here. But this option is worth a thorough look.
- Run copy-paste protection in parallel with other verification, but display its separate progress bar only if copy-paste detection is still working after other verification is complete. This is an improved 1st option, but still sounds quite weird to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 3: I think we can't combine reference checks and copypaste checks in one progress bar, because the number of links for the latter check should be much fewer. We could of course say that check already passed for those links, but it's strange. And we will need the common storage for storing statuses of links, this will cause extra problems for the concurrency. It seems that there will be much more overhead than worth.
I mostly like option 4, most of the time the progress bar simply won't appear because this check is much faster than checking external links.
Review: formatting, partition
Review: fix config, README, CHANGES
Description
Problem: Currently xrefcheck is not able to detect possible bad
copy-pastes, when some links are referring the same file, but
from the link names it seems that one of
those links should refer other file.
Solution: Implement check, add corresponding settings to the config.
Related issue(s)
Fixes #64
✅ Checklist for your Pull Request
Ideally a PR has all of the checkmarks set.
If something in this list is irrelevant to your PR, you should still set this
checkmark indicating that you are sure it is dealt with (be that by irrelevance).
Related changes (conditional)
Tests
silently reappearing again.
Documentation
Public contracts
of Public Contracts policy.
and
Stylistic guide (mandatory)
✓ Release Checklist
package.yaml
.under the "Unreleased" section to a new section for this release version.
vX.Y.Z
.xrefcheck-action
.