-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Engx242 identifierpatterns #13
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has my tentative approval. Before merging, I think we should make a naming decision vis-à-vis StandardIdentifiers
and PatternDetector
(see my comment in the reference doc).
Also, I'm not sure how much rebasing you plan to do, but I agree with your Slack comment about adding Matt as coauthor, and possibly adding a note that some of these algorithms came from our TIMDEX-UI experiment. That seems like it may be useful to our future selves.
app/models/standard_identifiers.rb
Outdated
end | ||
|
||
def validate_issn(candidate) | ||
# This model is only called when the regex for an ISSN has indicated an ISSN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, I think: "model" should be "method" here.
docs/reference/patterns.md
Outdated
@@ -0,0 +1,58 @@ | |||
## Pattern detection and metadata enhancement | |||
|
|||
The PatternDetector is responsible for identifying specific patterns within the input, such as using regular expressions to detect ISSN, ISBN, DOI, and PMID. Other techniques than regular expressions may also occur here, such as doing phrase matching to identify known scientific journals, or fingerprint matching to identify librarian curated responses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Pattern Detector
is a more descriptive name for what the model and GraphQL type are trying to do, but less descriptive for the GraphQL field than StandardIdentifiers
. Would it be confusing for the field to have a different name than the model and type? Is that even possible?
This documentation also mentions an Enhancer
, which is not yet implemented in this PR. (I assume it will be soon, though, based on the experiment you shared last week.) If that model/type is not named Enhancer
, then an alternative approach could be to clarify which models/types are an example of Pattern Detector
, and which are an example of Enhancer
. In other words, Pattern Detector
and Enhancer
are how we're describing the features at a high level, whereas StandardIdentifiers
and RecordsFound
(or whatever) are the actual implementations of those features. In that case, we should make that clarification in this reference doc.
I do think we should make a decision on this before merging, because I suspect it will be confusing if the naming conventions in the docs don't match what's in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super helpful feedback and I agree. I'll give it some thought, likely ask for a synchronous chat, then make appropriate changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning towards having PatternDetectors and Enhancers be the high level concept for this type of work, and StandardIdentifiers being our first implementation of a PatternDetector and coming soon, LookupDoi (etc) being implementations of Enhancers.
I'll be adding a commit that reflects this and then reaching out for further discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documentation reads much clearer to me, and I think this approach gives us the best of both naming conventions. Thanks for incorporating that feedback. 👍
@@ -1,10 +1,10 @@ | |||
## Pattern detection and metadata enhancement | |||
|
|||
The PatternDetector is responsible for identifying specific patterns within the input, such as using regular expressions to detect ISSN, ISBN, DOI, and PMID. Other techniques than regular expressions may also occur here, such as doing phrase matching to identify known scientific journals, or fingerprint matching to identify librarian curated responses. | |||
A PatternDetector is responsible for identifying specific patterns within the input, such as using regular expressions to detect ISSN, ISBN, DOI, and PMID (implemented in our StandardIdentifiers Class). Other techniques than regular expressions may also occur as PatternDetectors, such as doing phrase matching to identify known scientific journals, or fingerprint matching to identify librarian curated responses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is non-blocking: it may be useful to write this as Pattern Detector
rather than PatternDetector
, so it's clear that we're talking about a concept and not a model/module.
Why are these changes being introduced: Detection of patterns will eventually help us with categorizing the incoming search terms, which is the ultimate goal of this application. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/ENGX-242 * https://mitlibraries.atlassian.net/browse/ENGX-241 The StandardIdentifier patterns and tests were originally developed as part of our TIMDEX UI application and have been extracted and modified for use in this application. https://github.com/MITLibraries/timdex-ui co-authored-by: Matt Bernhardt <[email protected]>
d6a51b2
to
f14464e
Compare
Why these changes are being introduced
Adds detection and return of detected ISSN, ISBN, DOI, PMID if requested via GraphQL
If the field is not requested via GraphQL, we do not do the detection.
Not included at this time is external lookups to return additional information, but we may want to discuss how that would work as part of this to ensure that need fits in cleanly with this GraphQL design
Side effects of this change
phrase
to SearchEventType even though it is technically part of the TermType for convenienceRelevant ticket(s)