Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate ISBNs detected via regex #171

Merged
merged 3 commits into from
Jan 15, 2025
Merged

Validate ISBNs detected via regex #171

merged 3 commits into from
Jan 15, 2025

Conversation

JPrevost
Copy link
Member

@JPrevost JPrevost commented Jan 14, 2025

Why are these changes being introduced:

  • It is not possible to validate detected ISBNs via regex alone
  • We are seeing false positives for ISBNs in production that are invalid
  • We validate ISSNs, but had not gone back and added this logic to ISBN

Relevant ticket(s):

How does this address that need:

  • Follows the pattern to strip detected ISBNs in a way similar to ISSNs
  • Follows the ISBN-10 and ISBN-13 validation specifications
  • Updates tests for valid ISBNs are actually using valid ISBNs (previously, the ISBN-10s were incorrect converted to ISBN-13s)
  • Adds additional ISBN-13 examples to test positive validations
  • Adds additional ISBN-13 examples to test negative validations

Document any side effects to this change:

  • Removed "ISBN: " optional portion of the ISBN detection. It didn't feel useful and needed to be stripped out before validation (which felt like it supported to me that it wasn't part of the ISBN and shouldn't be detected)
  • Updated yardopts to include private methods. We do a lot of important work in private methods. While we shouldn't call them from outside of the instance, having them in the docs feels better as some of our docs were effectively making Classes look like black boxes without this change.
  • Some of our test ISBN-13s were invalid. The ISBN-10 list was run through an external ISBN-10 to ISBN-13 convertor. As all of those will start with 978 prefixes, I manually added some valid ISBN-13 examples to the tests that started with the other valid prefix (979).

Example query. Run in prod, this will show a detected ISBN that then fails the details lookup as it isn't actually valid. In the PR build or locally with this branch, it does not detect the invalid ISBN

{
  logSearchEvent(
    searchTerm: "Artin, M. Algebra (2nd Edition). Addison Wesley, 2010. ISBN: 9780132413771"
    sourceSystem: "playground"
  ) {
    phrase
    categories {
      name
      confidence
    }
    detectors {
      standardIdentifiers {
        kind
        value
        details {
          linkResolverUrl
        }
      }
    }
  }
}

Developer

Ticket(s)

https://mitlibraries.atlassian.net/browse/TCO-###

Accessibility

  • ANDI or Wave has been run in accordance to our guide and
    all issues introduced by these changes have been resolved or opened
    as new issues (link to those issues in the Pull Request details above)
  • There are no accessibility implications to this change

Documentation

  • Project documentation has been updated, and yard output previewed
  • No documentation changes are needed

ENV

  • All new ENV is documented in README.
  • All new ENV has been added to Heroku Pipeline, Staging and Prod.
  • ENV has not changed.

Stakeholders

  • Stakeholder approval has been confirmed
  • Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

NO migrations are included

Reviewer

Code

  • I have confirmed that the code works as intended.
  • Any CodeClimate issues have been fixed or confirmed as
    added technical debt.

Documentation

  • The commit message is clear and follows our guidelines
    (not just this pull request message).
  • The documentation has been updated or is unnecessary.
  • New dependencies are appropriate or there were no changes.

Testing

  • There are appropriate tests covering any new functionality.
  • No additional test coverage is required.

Why are these changes being introduced:

* It is not possible to validate detected ISBNs via regex alone
* We are seeing false positives for ISBNs in production that are invalid
* We validate ISSNs, but had not gone back and added this logic to ISBN

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/TCO-114

How does this address that need:

* Follows the pattern to strip detected ISBNs in a way similar to ISSNs
* Follows the ISBN-10 and ISBN-13 validation specifications
* Updates tests for valid ISBNs are actually using valid ISBNs
  (previously, the ISBN-10s were incorrect converted to ISBN-13s)
* Adds additional ISBN-13 examples to test positive validations
* Adds additional ISBN-13 examples to test negative validations

Document any side effects to this change:

* Removed "ISBN: " optional portion of the ISBN detection. It didn't feel useful
  and needed to be stripped out before validation (which felt like it supported to
  me that it wasn't part of the ISBN and shouldn't be detected)
* Updated yardopts to include private methods. We do a lot of important
  work in private methods. While we shouldn't call them from outside of
  the instance, having them in the docs feels better as some of our docs
  were effectively making Classes look like black boxes without this
  change.
* Some of our test ISBN-13s were invalid. The ISBN-10 list was run
  through an external ISBN-10 to ISBN-13 convertor. As all of those will
  start with 978 prefixes, I manually added some valid ISBN-13 examples
  to the tests that started with the other valid prefix (979).
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-171 January 14, 2025 15:49 Inactive
@JPrevost JPrevost temporarily deployed to tacos-api-pipeline-pr-171 January 14, 2025 15:53 Inactive
#
# @param digits Array. An array of strings representing each character from a detected ISBN candidate.
# @return boolean
def valid_isbn_10?(digits)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an alternate way to write this taking advantage of the reduce method. I wasn't sure which felt more clear and chose not to use reduce but wanted to show the other option to get input from code review as to which feels easier to understand (as they both do the same thing).

acc is a conventional name for the accumulator. If another name would be clearer, there is nothing special about acc.

def valid_isbn_10?(digits)
  sum = digits.each_with_index.reduce(0) do |acc, (digit, index)|
    digit = '10' if digit.casecmp('x').zero?
    acc += digit.to_i * (10 - index)
  end
  (sum % 11).zero?
end

#
# @param digits Array. An array of strings representing each character from a detected ISBN candidate.
# @return boolean
def valid_isbn_13?(digits)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comment above, this is an example using reduce for input on whether one feels easier to understand on a quick read of the method.

def valid_isbn_13?(digits)
  sum = digits.map(&:to_i).each_with_index.reduce(0) do |acc, (digit, index)|
    acc += digit * (index.even? ? 1 : 3)
  end

  (sum % 10).zero?
end

@matt-bernhardt matt-bernhardt self-assigned this Jan 15, 2025
Copy link
Member

@matt-bernhardt matt-bernhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few thoughts about this, nothing significantly blocking but I one question I wanted to get your thoughts on prior to approval. The rest is conversational.

The question / request is whether we need a test for the ISBN validation that's similar to the ISSN test at https://github.com/MITLibraries/tacos/blob/main/test/models/detector/standard_identifiers_test.rb#L93-L124 - which feels like a thoroug confirmation that the check digit validation is correct compared to testing a set of (apparently random?) values.


Beyond that comment, my thoughts about the variations on the validation logic are that I'm fine with either using the reduce method or leaving it out, although my brain latched onto the method without more easily. Having now looked up the reduce documentation, it feels similar to a ternary - a simpler syntax that's relatively easy to follow, but something to keep in mind. It's also relatively easy to look up if I find myself forgetting it.

The bit that trips me up the most, actually, is remembering that digit.casecmp('x').zero? returns either 0 or -1, so there's a bit of parsing there which I find digit.downcase == 'x' avoids entirely. Was that flagged by a linter?

Other thoughts:

  • I like the addition of private methods to the yardopts - I agree that there are private methods in our application which feel awkward to not have documented (I think we've been writing some already, but including them for future-us in this way feels right)
  • Seeing the amount of validation logic growing in the StandardIdentifiers class makes me wonder whether the validation might be more suitable in the lookup classes as an public method? That might allow this class to purely focus on the regexes and pattern matching, while leaving all the "rubber meets the road" logic for each format in its own models. For the moment I think the class is fine, at least.
  • Thank you for parsing the math on these validations - the wikipedia articles you've referenced can (as usual for wikipedia) be a bit dense to come to terms with, but I think the appliation is better for it.

@JPrevost
Copy link
Member Author

@matt-bernhardt yes, the linter suggested casecmp. It's listed as a performance cop so I'm inclined to stick with it even if it is less readable.

I'll not introduce reduce with this work. I think it's such a toss up to just go with what we both feel reads easier.

I'll add an explicit test of ISBNs with invalid check digits rather than implicitly testing that via the couple I added.

Re: refactoring StandardIdentifiers to pull validation out. Yes, that makes sense. I'm not sure exactly how to approach that yet but am in support of it. For a while I wanted to remove it entirely but I think it might make sense to keep it for the regexes as you noted and move the validation to separate classes. Maybe Refactoring the Lookup classes to not just be lookups. So LookupISBN becomes ISBN which can handle validation and lookups rather than just adding to the Lookup classes as-is. It feels if they are changing scope the naming may change too. We'll see. I like the gist of your idea though.

@JPrevost JPrevost temporarily deployed to tacos-api-pipeline-pr-171 January 15, 2025 18:41 Inactive
Copy link
Member

@matt-bernhardt matt-bernhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a performance implication significant enough that the linter flags it, then I'll deal with the complexity of parsing digit.casecmp('x').zero? . TIL.

Thanks for adding the negative tests - this all looks good to me!

:shipit:

@JPrevost JPrevost merged commit 1c8ce6d into main Jan 15, 2025
6 checks passed
@JPrevost JPrevost deleted the tco-114-validate-isbns branch January 15, 2025 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants