Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#64] Copy paste detection in lists #102

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

Heimdell
Copy link
Contributor

@Heimdell Heimdell commented Nov 9, 2021

Description

Xrefcheck recognises a bad copy paste now:

A list with bad copy-paste:

- [a](a) e
- [b](b) e
- [c](a) e

A list that is completely fine:

- [a](a) d
- [b](b) d
- [c](c) d

The first list contains [c](a) which should be [c](c).

Only the first [...](...) construct is taken for recognition.

Related issue(s)

#64

✅ Checklist for your Pull Request

Ideally a PR has all of the checkmarks set.

If something in this list is irrelevant to your PR, you should still set this
checkmark indicating that you are sure it is dealt with (be that by irrelevance).

Related changes (conditional)

  • Tests

    • If I added new functionality, I added tests covering it.
    • If I fixed a bug, I added a regression test to prevent the bug from
      silently reappearing again.
  • Documentation

    • I checked whether I should update the docs and did so if necessary:
  • Public contracts

    • Any modifications of public contracts comply with the Evolution
      of Public Contracts
      policy.
    • I added an entry to the changelog if my changes are visible to the users
      and
    • provided a migration guide for breaking changes if possible

Stylistic guide (mandatory)

@Heimdell Heimdell force-pushed the kirill.andreev/copy-paste-in-lists branch from 36b3152 to 94748c5 Compare November 9, 2021 15:54
@Heimdell Heimdell requested a review from rvem November 9, 2021 16:18
@Heimdell Heimdell changed the title [#64] Copy paste detection in lists [WIP] [#64] Copy paste detection in lists Nov 9, 2021
}
makeLenses ''FileInfoDiff

diffToFileInfo :: FileInfoDiff -> FileInfo
diffToFileInfo (FileInfoDiff refs anchors) =
FileInfo (DList.toList refs) (DList.toList anchors)
diffToFileInfo (FileInfoDiff refs anchors pastas) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I believe here should be pastes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍝

@@ -102,26 +102,36 @@ data Anchor = Anchor
deriving stock (Show, Eq, Generic)
deriving anyclass NFData

data CopyPaste = CopyPaste
{ cpAnchorText :: Text
Copy link
Contributor

@alyoanton9 alyoanton9 Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like AnchorText and PlainText sound confusing. The issue doesn't focus on anchors, but all type of links, external and internal. Perhaps, more generic LinkText and URLText would be better?

UPD: Okay, I'm not sure if all the links were intended to check, as there is only example with files in issue description. But it's also not clear that this feature is for files only

@@ -219,10 +223,19 @@ verifyRepo

progressRef <- newIORef $ initVerifyProgress (map snd toScan)

errorss <- for (M.toList repoInfo) $ \(file, info) -> do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below

Suggested change
errorss <- for (M.toList repoInfo) $ \(file, info) -> do
errors <- for (M.toList repoInfo) $ \(file, info) -> do

Just res -> return (pos, nodes, res)
Nothing -> []

urlIsASubsequence :: CopyPaste -> Bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: AFAICS, here we check if link text is a subsequence of link URL, so if name is a URL subsequence, not URL is a subsequence of name

copyPaste (Node _ (LIST _) nodes) = do
case items of
top : rest | urlIsASubsequence top -> do
let bad = filter (not . urlIsASubsequence) rest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the acceptance criteria, we want to report copy-paste if there are two links [T1](L1) and [T2](L1) in a file, and T1 is substring of L1 modulo casing and all the non-alphanum characters, while T2 is not substring of L1 modulo the same things;

However, AFAIU here you check that the first item satisfy the law -- T1 is a subsequence of L1 -- and report all other links from the list that doesn't satisfy as invalid, so [Tn](Ln) would also be reported if Tn isn't a subsequence of Ln, e.g.:

- [Foo Bar](foo-bar) e
- [Foo Qux](foo-qux) e
- [Foo Kek](foo-kek) e
- [Just text](file) e

produces

  ➥  In file tests/markdowns/without-annotations/copy/copy-paste.md
     bad reference (local) at src:18:4-12:
       - text: ""
       - link:
       - anchor: -

     ⛂  Possibly incorrect copy-paste in list with references
        the url is file
           but the text is Just text

I am not sure if we need to consider such list items as bad copy-paste, I believe it would be better to don't take them into account at all and report only those which strictly satisfy the law from acceptance criteria -- there are two links [T1](L1) and [T2](L1) in a file, and T1 is substring of L1 modulo casing and all the non-alphanum characters, while T2 is not substring of L1 modulo the same things;

@dcastro dcastro marked this pull request as draft December 7, 2021 16:15
@dcastro dcastro changed the title [WIP] [#64] Copy paste detection in lists [#64] Copy paste detection in lists Dec 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants