Validate ISBNs detected via regex #171

JPrevost · 2025-01-14T15:47:59Z

Why are these changes being introduced:

It is not possible to validate detected ISBNs via regex alone
We are seeing false positives for ISBNs in production that are invalid
We validate ISSNs, but had not gone back and added this logic to ISBN

Relevant ticket(s):

https://mitlibraries.atlassian.net/browse/TCO-114

How does this address that need:

Follows the pattern to strip detected ISBNs in a way similar to ISSNs
Follows the ISBN-10 and ISBN-13 validation specifications
Updates tests for valid ISBNs are actually using valid ISBNs (previously, the ISBN-10s were incorrect converted to ISBN-13s)
Adds additional ISBN-13 examples to test positive validations
Adds additional ISBN-13 examples to test negative validations

Document any side effects to this change:

Removed "ISBN: " optional portion of the ISBN detection. It didn't feel useful and needed to be stripped out before validation (which felt like it supported to me that it wasn't part of the ISBN and shouldn't be detected)
Updated yardopts to include private methods. We do a lot of important work in private methods. While we shouldn't call them from outside of the instance, having them in the docs feels better as some of our docs were effectively making Classes look like black boxes without this change.
Some of our test ISBN-13s were invalid. The ISBN-10 list was run through an external ISBN-10 to ISBN-13 convertor. As all of those will start with 978 prefixes, I manually added some valid ISBN-13 examples to the tests that started with the other valid prefix (979).

Example query. Run in prod, this will show a detected ISBN that then fails the details lookup as it isn't actually valid. In the PR build or locally with this branch, it does not detect the invalid ISBN

{
  logSearchEvent(
    searchTerm: "Artin, M. Algebra (2nd Edition). Addison Wesley, 2010. ISBN: 9780132413771"
    sourceSystem: "playground"
  ) {
    phrase
    categories {
      name
      confidence
    }
    detectors {
      standardIdentifiers {
        kind
        value
        details {
          linkResolverUrl
        }
      }
    }
  }
}

Developer

Ticket(s)

https://mitlibraries.atlassian.net/browse/TCO-###

Accessibility

ANDI or Wave has been run in accordance to our guide and
all issues introduced by these changes have been resolved or opened
as new issues (link to those issues in the Pull Request details above)
There are no accessibility implications to this change

Documentation

Project documentation has been updated, and yard output previewed
No documentation changes are needed

ENV

All new ENV is documented in README.
All new ENV has been added to Heroku Pipeline, Staging and Prod.
ENV has not changed.

Stakeholders

Stakeholder approval has been confirmed
Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

NO migrations are included

Reviewer

Code

I have confirmed that the code works as intended.
Any CodeClimate issues have been fixed or confirmed as
added technical debt.

Documentation

The commit message is clear and follows our guidelines
(not just this pull request message).
The documentation has been updated or is unnecessary.
New dependencies are appropriate or there were no changes.

Testing

There are appropriate tests covering any new functionality.
No additional test coverage is required.

Why are these changes being introduced: * It is not possible to validate detected ISBNs via regex alone * We are seeing false positives for ISBNs in production that are invalid * We validate ISSNs, but had not gone back and added this logic to ISBN Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TCO-114 How does this address that need: * Follows the pattern to strip detected ISBNs in a way similar to ISSNs * Follows the ISBN-10 and ISBN-13 validation specifications * Updates tests for valid ISBNs are actually using valid ISBNs (previously, the ISBN-10s were incorrect converted to ISBN-13s) * Adds additional ISBN-13 examples to test positive validations * Adds additional ISBN-13 examples to test negative validations Document any side effects to this change: * Removed "ISBN: " optional portion of the ISBN detection. It didn't feel useful and needed to be stripped out before validation (which felt like it supported to me that it wasn't part of the ISBN and shouldn't be detected) * Updated yardopts to include private methods. We do a lot of important work in private methods. While we shouldn't call them from outside of the instance, having them in the docs feels better as some of our docs were effectively making Classes look like black boxes without this change. * Some of our test ISBN-13s were invalid. The ISBN-10 list was run through an external ISBN-10 to ISBN-13 convertor. As all of those will start with 978 prefixes, I manually added some valid ISBN-13 examples to the tests that started with the other valid prefix (979).

JPrevost · 2025-01-15T14:48:09Z

app/models/detector/standard_identifiers.rb

+    #
+    # @param digits Array. An array of strings representing each character from a detected ISBN candidate.
+    # @return boolean
+    def valid_isbn_10?(digits)


There's an alternate way to write this taking advantage of the reduce method. I wasn't sure which felt more clear and chose not to use reduce but wanted to show the other option to get input from code review as to which feels easier to understand (as they both do the same thing).

acc is a conventional name for the accumulator. If another name would be clearer, there is nothing special about acc.

def valid_isbn_10?(digits) sum = digits.each_with_index.reduce(0) do |acc, (digit, index)| digit = '10' if digit.casecmp('x').zero? acc += digit.to_i * (10 - index) end (sum % 11).zero? end

JPrevost · 2025-01-15T14:49:23Z

app/models/detector/standard_identifiers.rb

+    #
+    # @param digits Array. An array of strings representing each character from a detected ISBN candidate.
+    # @return boolean
+    def valid_isbn_13?(digits)


Similar to the comment above, this is an example using reduce for input on whether one feels easier to understand on a quick read of the method.

def valid_isbn_13?(digits) sum = digits.map(&:to_i).each_with_index.reduce(0) do |acc, (digit, index)| acc += digit * (index.even? ? 1 : 3) end (sum % 10).zero? end

matt-bernhardt

I have a few thoughts about this, nothing significantly blocking but I one question I wanted to get your thoughts on prior to approval. The rest is conversational.

The question / request is whether we need a test for the ISBN validation that's similar to the ISSN test at https://github.com/MITLibraries/tacos/blob/main/test/models/detector/standard_identifiers_test.rb#L93-L124 - which feels like a thoroug confirmation that the check digit validation is correct compared to testing a set of (apparently random?) values.

Beyond that comment, my thoughts about the variations on the validation logic are that I'm fine with either using the reduce method or leaving it out, although my brain latched onto the method without more easily. Having now looked up the reduce documentation, it feels similar to a ternary - a simpler syntax that's relatively easy to follow, but something to keep in mind. It's also relatively easy to look up if I find myself forgetting it.

The bit that trips me up the most, actually, is remembering that digit.casecmp('x').zero? returns either 0 or -1, so there's a bit of parsing there which I find digit.downcase == 'x' avoids entirely. Was that flagged by a linter?

Other thoughts:

I like the addition of private methods to the yardopts - I agree that there are private methods in our application which feel awkward to not have documented (I think we've been writing some already, but including them for future-us in this way feels right)
Seeing the amount of validation logic growing in the StandardIdentifiers class makes me wonder whether the validation might be more suitable in the lookup classes as an public method? That might allow this class to purely focus on the regexes and pattern matching, while leaving all the "rubber meets the road" logic for each format in its own models. For the moment I think the class is fine, at least.
Thank you for parsing the math on these validations - the wikipedia articles you've referenced can (as usual for wikipedia) be a bit dense to come to terms with, but I think the appliation is better for it.

JPrevost · 2025-01-15T18:33:40Z

@matt-bernhardt yes, the linter suggested casecmp. It's listed as a performance cop so I'm inclined to stick with it even if it is less readable.

I'll not introduce reduce with this work. I think it's such a toss up to just go with what we both feel reads easier.

I'll add an explicit test of ISBNs with invalid check digits rather than implicitly testing that via the couple I added.

Re: refactoring StandardIdentifiers to pull validation out. Yes, that makes sense. I'm not sure exactly how to approach that yet but am in support of it. For a while I wanted to remove it entirely but I think it might make sense to keep it for the regexes as you noted and move the validation to separate classes. Maybe Refactoring the Lookup classes to not just be lookups. So LookupISBN becomes ISBN which can handle validation and lookups rather than just adding to the Lookup classes as-is. It feels if they are changing scope the naming may change too. We'll see. I like the gist of your idea though.

matt-bernhardt

If there's a performance implication significant enough that the linter flags it, then I'll deal with the complexity of parsing digit.casecmp('x').zero? . TIL.

Thanks for adding the negative tests - this all looks good to me!

mitlib temporarily deployed to tacos-api-pipeline-pr-171 January 14, 2025 15:49 Inactive

Use casecmp instead of downcase

1f8d59c

JPrevost temporarily deployed to tacos-api-pipeline-pr-171 January 14, 2025 15:53 Inactive

JPrevost requested review from jazairi and matt-bernhardt January 14, 2025 15:59

JPrevost commented Jan 15, 2025

View reviewed changes

matt-bernhardt self-assigned this Jan 15, 2025

matt-bernhardt reviewed Jan 15, 2025

View reviewed changes

Add tests for ISBNs with invalid check digits

624e6fc

JPrevost temporarily deployed to tacos-api-pipeline-pr-171 January 15, 2025 18:41 Inactive

JPrevost requested a review from matt-bernhardt January 15, 2025 18:42

matt-bernhardt approved these changes Jan 15, 2025

View reviewed changes

JPrevost merged commit 1c8ce6d into main Jan 15, 2025
6 checks passed

JPrevost deleted the tco-114-validate-isbns branch January 15, 2025 20:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate ISBNs detected via regex #171

Validate ISBNs detected via regex #171

JPrevost commented Jan 14, 2025 •

edited by matt-bernhardt

Loading

JPrevost Jan 15, 2025

JPrevost Jan 15, 2025

matt-bernhardt left a comment

JPrevost commented Jan 15, 2025

matt-bernhardt left a comment

Validate ISBNs detected via regex #171

Validate ISBNs detected via regex #171

Conversation

JPrevost commented Jan 14, 2025 • edited by matt-bernhardt Loading

Developer

Ticket(s)

Accessibility

Documentation

ENV

Stakeholders

Dependencies and migrations

Reviewer

Code

Documentation

Testing

JPrevost Jan 15, 2025

Choose a reason for hiding this comment

JPrevost Jan 15, 2025

Choose a reason for hiding this comment

matt-bernhardt left a comment

Choose a reason for hiding this comment

JPrevost commented Jan 15, 2025

matt-bernhardt left a comment

Choose a reason for hiding this comment

JPrevost commented Jan 14, 2025 •

edited by matt-bernhardt

Loading