Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#64] Implement copy paste protection #245

Open
wants to merge 5 commits into
base: YuriRomanowski/#64-refactor-markdown-scanner
Choose a base branch
from

Conversation

YuriRomanowski
Copy link
Contributor

@YuriRomanowski YuriRomanowski commented Dec 16, 2022

Description

Problem: Currently xrefcheck is not able to detect possible bad
copy-pastes, when some links are referring the same file, but
from the link names it seems that one of
those links should refer other file.

Solution: Implement check, add corresponding settings to the config.

Related issue(s)

Fixes #64

✅ Checklist for your Pull Request

Ideally a PR has all of the checkmarks set.

If something in this list is irrelevant to your PR, you should still set this
checkmark indicating that you are sure it is dealt with (be that by irrelevance).

Related changes (conditional)

  • Tests

    • If I added new functionality, I added tests covering it.
    • If I fixed a bug, I added a regression test to prevent the bug from
      silently reappearing again.
  • Documentation

    • I checked whether I should update the docs and did so if necessary:
  • Public contracts

    • Any modifications of public contracts comply with the Evolution
      of Public Contracts
      policy.
    • I added an entry to the changelog if my changes are visible to the users
      and
    • provided a migration guide for breaking changes if possible

Stylistic guide (mandatory)

✓ Release Checklist

  • I updated the version number in package.yaml.
  • I updated the changelog and moved everything
    under the "Unreleased" section to a new section for this release version.
  • (After merging) I edited the auto-release.
    • Change the tag and title using the format vX.Y.Z.
    • Write a summary of all user-facing changes.
    • Deselect the "This is a pre-release" checkbox at the bottom.
  • (After merging) I updated xrefcheck-action.
  • (After merging) I uploaded the package to hackage.

Problem: Currently xrefcheck is not able to detect possible bad
copy-pastes, when some links are referring the same file, but
from the link names it seems that one of
those links should refer other file.

Solution: Implement check, add corresponding settings to the config.
@YuriRomanowski YuriRomanowski changed the base branch from master to YuriRomanowski/#64-refactor-markdown-scanner December 16, 2022 12:10
@YuriRomanowski
Copy link
Contributor Author

There are two minor questions:

  1. Should we add a progress bar (or change the existing one) for copypaste checking?
  2. What link should be considered as original if there are some possible canditates?

Copy link
Contributor

@aeqz aeqz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is really well implemented 👍. Just added a few comments to discuss.

Copy link
Member

@Martoon-00 Martoon-00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good 👍

I'm leaving quite a lot of comments, but for such a PR that effectively adds a completely new large feature - I think this is not even so much.

@@ -67,6 +67,9 @@ scanners:
#
# This affects which anchors are generated for headers.
flavor: #s{flavor}

# Whether copy-paste check is enabled globally.
copyPasteCheckEnabled: True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about whether this should be true by default (which may provide false positives and force the user to take manual actions) or false (in such case the user may never notice that we provide such a feature).

And you know, after looking at this and at #250, I suspect that we should have two modes in dump-config command depending on how much strict checks the user wants to have.

  • In weak mode we will have copy-paste checks disabled, and all redirects will be considered OK.
  • In eager mode we will have copy-paste checks enabled, and will treat permanent redirects as not OK and maybe something smart for temporary redirects.

@YuriRomanowski @aeqz What do you think?

No need to do it here, if we agree on this I'll create a separate ticket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By being consistent with the aims of this project in the README file, I think that it should be false by default, but I also think that then the user may never notice about the feature.

To consider other options, we could also try to add an intermediate configuration setting:

copyPasteCheck: (error|warning|disabled)

where:

  • error shows them as errors.
  • warning shows them as warnings, which do not make the program to exit with error code.
  • disabled disables the feature.

and the default would be warning.

In the case of going for the different dump-config strictness modes, I guess that this mode argument should be mandatory. Otherwise, we would be again discussing what should be the default, and the weak one could make the user to miss some features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning shows them as warnings, which do not make the program to exit with error code.

Hm do we need this? I think that the user can look at the error decriptions and decide whether to just ignore those links, or to fix them immediately, it seems that showing them as warnings is rather useless. So, I would agree with @Martoon-00, we can just provide some modes of the default config, and using true/false for different modes perfectly suites the idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood correctly, the doubt was in deciding whether to enable this check by default, which could produce false positives in CI, or to disable it and possibly make the user to never know about the feature.

I think that providing modes of default config can preserve the same problem, because then we would have to decide whether to set a strong one as the default and produce false positives in CI, or to set a weak one and make the user to never know about some features, although it would be easier to manage.

The point in my suggestion was to consider an intermediate option that does not produce false positives in CI, but keeps showing messages that the user can see in order to know about the feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmhmhm 🤔 🤔 Both your points make sense to me, I'll need some time to digest this and produce my thoughts.

src/Xrefcheck/Verify.hs Outdated Show resolved Hide resolved
src/Xrefcheck/Verify.hs Outdated Show resolved Hide resolved
src/Xrefcheck/Verify.hs Outdated Show resolved Hide resolved
src/Xrefcheck/Verify.hs Outdated Show resolved Hide resolved
tests/configs/github-config.yaml Outdated Show resolved Hide resolved
tests/configs/github-config.yaml Outdated Show resolved Hide resolved
progressRef <- newIORef $ initVerifyProgress (map snd toScan)

accumulated <- loopAsyncUntil (printer progressRef) do
forConcurrentlyCaching toScan ifExternalThenCache $ \(file, ref) ->
verifyReference config mode progressRef repoInfo' root file ref
case accumulated of
(, copyPasteErrors) <$> case accumulated of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmhm, I'm worried about how it's integrated into existing verification:

  • We do not reflect the status in the progress bar like for local and external links verification;
  • We apply no parallelization at all.

Also currently this logic is quite hanging at the end, we could extract copy-paste checking to a separate function outside of verifyRepo and mostly avoid any duplication.

But I suppose that simultaneously resolving all the 3 things I mentioned above is impossible.
So let's discuss what is worth doing here. How would you go?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I somehow missed your first comment in this PR. Let's probably discuss it here.

Regarding the progress bar, options that came to my mind:

  1. Add a third progress bar - not ideal, this will stop fitting into normal 80 chars width terminal.
  2. Run copy-paste protection check strictly after other verification, and show another progress bar after the first two are gone - not ideal, there is no paralellisation.
  3. Make the progress bar for local links contain the status for the copy-paste detection. This would be the best, but this may be hard to glue together. In verification we count links, and copy-paste protection - looks like it does not scale from the overall number of links, so it's not so obvious what should we count here. But this option is worth a thorough look.
  4. Run copy-paste protection in parallel with other verification, but display its separate progress bar only if copy-paste detection is still working after other verification is complete. This is an improved 1st option, but still sounds quite weird to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 3: I think we can't combine reference checks and copypaste checks in one progress bar, because the number of links for the latter check should be much fewer. We could of course say that check already passed for those links, but it's strange. And we will need the common storage for storing statuses of links, this will cause extra problems for the concurrency. It seems that there will be much more overhead than worth.
I mostly like option 4, most of the time the progress bar simply won't appear because this check is much faster than checking external links.

tests/golden/check-copy-paste/second-file.md Show resolved Hide resolved
tests/golden/check-copy-paste/second-file.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Try to implement copy-paste protection checks
3 participants