-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#64] Copy paste detection in lists #102
base: master
Are you sure you want to change the base?
Conversation
36b3152
to
94748c5
Compare
} | ||
makeLenses ''FileInfoDiff | ||
|
||
diffToFileInfo :: FileInfoDiff -> FileInfo | ||
diffToFileInfo (FileInfoDiff refs anchors) = | ||
FileInfo (DList.toList refs) (DList.toList anchors) | ||
diffToFileInfo (FileInfoDiff refs anchors pastas) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I believe here should be pastes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🍝
@@ -102,26 +102,36 @@ data Anchor = Anchor | |||
deriving stock (Show, Eq, Generic) | |||
deriving anyclass NFData | |||
|
|||
data CopyPaste = CopyPaste | |||
{ cpAnchorText :: Text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like AnchorText
and PlainText
sound confusing. The issue doesn't focus on anchors, but all type of links, external and internal. Perhaps, more generic LinkText
and URLText
would be better?
UPD: Okay, I'm not sure if all the links were intended to check, as there is only example with files in issue description. But it's also not clear that this feature is for files only
@@ -219,10 +223,19 @@ verifyRepo | |||
|
|||
progressRef <- newIORef $ initVerifyProgress (map snd toScan) | |||
|
|||
errorss <- for (M.toList repoInfo) $ \(file, info) -> do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and below
errorss <- for (M.toList repoInfo) $ \(file, info) -> do | |
errors <- for (M.toList repoInfo) $ \(file, info) -> do |
Just res -> return (pos, nodes, res) | ||
Nothing -> [] | ||
|
||
urlIsASubsequence :: CopyPaste -> Bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: AFAICS, here we check if link text is a subsequence of link URL, so if name is a URL subsequence, not URL is a subsequence of name
copyPaste (Node _ (LIST _) nodes) = do | ||
case items of | ||
top : rest | urlIsASubsequence top -> do | ||
let bad = filter (not . urlIsASubsequence) rest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the acceptance criteria, we want to report copy-paste if there are two links [T1](L1) and [T2](L1) in a file, and T1 is substring of L1 modulo casing and all the non-alphanum characters, while T2 is not substring of L1 modulo the same things;
However, AFAIU here you check that the first item satisfy the law -- T1
is a subsequence of L1
-- and report all other links from the list that doesn't satisfy as invalid, so [Tn](Ln)
would also be reported if Tn
isn't a subsequence of Ln
, e.g.:
- [Foo Bar](foo-bar) e
- [Foo Qux](foo-qux) e
- [Foo Kek](foo-kek) e
- [Just text](file) e
produces
➥ In file tests/markdowns/without-annotations/copy/copy-paste.md
bad reference (local) at src:18:4-12:
- text: ""
- link:
- anchor: -
⛂ Possibly incorrect copy-paste in list with references
the url is file
but the text is Just text
I am not sure if we need to consider such list items as bad copy-paste, I believe it would be better to don't take them into account at all and report only those which strictly satisfy the law from acceptance criteria -- there are two links [T1](L1) and [T2](L1) in a file, and T1 is substring of L1 modulo casing and all the non-alphanum characters, while T2 is not substring of L1 modulo the same things;
Description
Xrefcheck recognises a bad copy paste now:
The first list contains
[c](a)
which should be[c](c)
.Only the first
[...](...)
construct is taken for recognition.Related issue(s)
#64
✅ Checklist for your Pull Request
Ideally a PR has all of the checkmarks set.
If something in this list is irrelevant to your PR, you should still set this
checkmark indicating that you are sure it is dealt with (be that by irrelevance).
Related changes (conditional)
Tests
silently reappearing again.
Documentation
Public contracts
of Public Contracts policy.
and
Stylistic guide (mandatory)