diff --git a/docs/architecture-decisions/0009-define-categorization-architecture.md b/docs/architecture-decisions/0009-define-categorization-architecture.md new file mode 100644 index 0000000..abb3d51 --- /dev/null +++ b/docs/architecture-decisions/0009-define-categorization-architecture.md @@ -0,0 +1,427 @@ +# 9. Define categorization architecture + +Date: 2024-09-06 + +## Status + +Accepted + +## Context + +We need to define the data model and workflow for TACOS and its users to place search terms into categories. This +includes a discussion about how those categories themselves will be represented (and what they are), and how existing +structures like Detectors contribute to that categorization activity. + +A future decision, which should be considered now although not yet resolved, is how to enable users to validate these +categorization actions. + +### The relationship between Terms, Detectors, and Categories + +At a very high level, TACOS works according to the following flowchart: + +```mermaid +flowchart LR + + Terms + Detectors + Categories + + Terms -- are evaluated by --> Detectors + Detectors <-- are mapped to --> Categories + Categories -- get linked with --> Terms +``` + +Search terms are received from a contributing system, and are evaluated by a set of Detectors which look for specific +patterns. Those Detectors are mapped to one or more Categories. As a result of these detections and their relationship +with each category, TACOS is able to calculate the strength of the link between each term and category. + +The decision being documented here is how we achieve this relationship. + +## Options considered + +We evaluated multiple ways of implementing these relationships through prototyping, diagramming, and extensive +discussions. Each are documented here. + +Each of the options described below uses the same graphic language: + +* Terms, which flow in continuously with Search Events; +* A knowledge graph, which includes the categories, detectors, and relationships + between the two which TACOS defines and maintains, and which is consulted during categorization; and +* The linkages between these terms and the graph, which record which signals are + detected in each term, and how those signals are interpreted to place the term into a category. + +A simple way to describe the Categorization workflow would be to say that Categorization involves populating the blue +tables in the diagrams below. + +### Prototype Zero + +The simplest option to relate these elements is a single three-way join model, which would have pointers back to each +of the Term, Detector, and Category models. + +```mermaid +classDiagram + direction TB + + Term <-- Link + Detector <-- Link + Category <-- Link + + + class Term + Term: +Integer id + Term: +String phrase + + class Category + Category: +Integer id + Category: +String name + + class Link:::styleClass + Link: +Integer id + Link: +Integer term_id + Link: +Integer category_id + Link: +Integer detector_id + + class Detector + Detector: +Integer id + Detector: +String name + + + style Term fill:#000,stroke:#66c2a5,color:#66c2a5,stroke-width:4px; + + style Category fill:#000,stroke:#fc8d62,color:#fc8d62 + style Detector fill:#000,stroke:#fc8d62,color:#fc8d62 + + style Link fill:#000,stroke:#8da0cb,color:#8da0cb,stroke-dasharray: 3 5; +``` + +This option was rejected almost immediately because it does not allow for enough flexibility and would spawn far too +many extraneous records. + +### Prototype A + +The "A" prototype defined its linking records in two large models. The `Detection` model would record the relationship +between every `Term` and each detector in the application, with a field for each output. The `Categorization` model +would then build upon those detections, with a field for a calculated score according to each category. The category +with the highest score would finally be stored in the `Term` model for better performance. + +The knowledge graph in this prototype would be comparatively sparse, with models for each lookup-style detector. The +relationships between detectors and categories would be defined directly within methods in the `Categorization` model. + +```mermaid +classDiagram + direction LR + + Term --< Detection: has many + Detection <-- Categorization: based on + Categorization --> SuggestedResource: looks up + Detection --> SuggestedResource: looks up + Detection --> Journal: looks up + + + class Term + Term: +Integer id + Term: +String phrase + Term: +Enum category + + class SuggestedResource + SuggestedResource: +Integer id + SuggestedResource: +String title + SuggestedResource: +String url + SuggestedResource: +String phrase + SuggestedResource: +String fingerprint + SuggestedResource: +Enum category + SuggestedResource: calculateFingerprint() + + class Journal + Journal: +Integer id + Journal: +String title + + class Detection + Detection: +Integer id + Detection: +Integer term_id + Detection: +Integer detector_version + Detection: +Boolean DOI + Detection: +Boolean ISBN + Detection: +Boolean ISSN + Detection: +Boolean PMID + Detection: +Boolean Journal + Detection: +Boolean SuggestedResource + Detection: initialize() + Detection: setDetectionVersion() + Detection: recordDetections() + Detection: recordPatterns() + Detection: recordJournals() + Detection: recordSuggestedResource() + + class Categorization + Categorization: +Integer id + Categorization: +Integer detection_id + Categorization: +Float information_score + Categorization: +Float navigation_score + Categorization: +Float transaction_score + Categorization: initialize() + Categorization: assign() + Categorization: evaluate() + Categorization: calculateAll() + Categorization: calculateInformation() + Categorization: calculateNavigation() + Categorization: calculateTransaction() + + + style Term fill:#000,stroke:#66c2a5,color:#66c2a5,stroke-width:4px; + + style Category fill:#000,stroke:#fc8d62,color:#fc8d62 + style Detector fill:#000,stroke:#fc8d62,color:#fc8d62 + style Journal fill:#000,stroke:#fc8d62,color:#fc8d62 + style SuggestedResource fill:#000,stroke:#fc8d62,color:#fc8d62 + + style Detection fill:#000,stroke:#8da0cb,color:#8da0cb,stroke-dasharray: 3 5; + style Categorization fill:#000,stroke:#8da0cb,color:#8da0cb,stroke-dasharray: 3 5; +``` + +A benefit of this prototype is that the `Detection` and `Categorization` models would be very intuitive to work with, +and allow for repeated classification as our application evolves. Querying these models from the controller level would +be very simple. + +An area of uncertainty in this prototype was how to calculate confidence values and categorization scores for each +detector and category. We discussed multiple options for this question, but ultimately did not decide on a single +approach. + +### Prototype B + +The "B" prototype makes a different choice for recording both the knowledge graph, and the linkages to the terms flowing +into the application. The knowledge graph is more explicitly modeled in the database, with models for `Category`, +`Detectinator`, and the `DetectinatorCategory` model which maps between the two. + +Because each of these records are now separate entries, this prototype further breaks up the large models for detection +and categorization outputs. The detection result is spread across multiple records in the `TermDetectinator` and +`TermSuggestedResource` models. The final categorization process is also recorded in multiple `TermCategory` records. + +Because of this dispersion of information across multiple records, the methods needed to do the work end up being +defined in the `Term` model - shown here as methods like `evaluate_detectinators()` and `categorize()`. + + +```mermaid +classDiagram + direction LR + + Term >-- TermDetectinator + TermDetectinator --> Detectinator + Category <-- DetectinatorCategory + DetectinatorCategory --> Detectinator + Term --> TermCategory + TermCategory <-- Category + SuggestedResource --> Category + Term <-- TermSuggestedResource + TermSuggestedResource --> SuggestedResource + + + class Term + Term: +Integer id + Term: +String phrase + Term: categorize() + Term: evaluate_detectinators() + Term: evaluate_identifiers() + Term: evaluate_journals() + Term: evaluate_suggested_resources() + + class TermDetectinator + TermDetectinator: +Integer term_id + TermDetectinator: +Integer detector_id + TermDetectinator: +Boolean result + + class Detectinator + Detectinator: +Integer id + Detectinator: +String name + Detectinator: +Float confidence + + class Category + Category: +Integer id + Category: +String name + Category: +String note + + class DetectinatorCategory + DetectinatorCategory: +Integer detectinator_id + DetectinatorCategory: +Integer category_id + DetectinatorCategory: +Float confidence + + class TermCategory + TermCategory: +Integer term_id + TermCategory: +Integer category_id + TermCategory: +Float confidence + TermCategory: +Integer user_id + + class SuggestedResource + SuggestedResource: +Integer id + SuggestedResource: +String title + SuggestedResource: +String fingerprint + SuggestedResource: +URL url + SuggestedResource: +Integer category_id + + class TermSuggestedResource + TermSuggestedResource: +Integer term_id + TermSuggestedResource: +Integer suggested_resource_id + TermSuggestedResource: +Boolean result + + + style Term fill:#000,stroke:#66c2a5,color:#66c2a5,stroke-width:4px; + + style Category fill:#000,stroke:#fc8d62,color:#fc8d62 + style Detectinator fill:#000,stroke:#fc8d62,color:#fc8d62 + style DetectinatorCategory fill:#000,stroke:#fc8d62,color:#fc8d62 + style SuggestedResource fill:#000,stroke:#fc8d62,color:#fc8d62 + + style TermDetectinator fill:#000,stroke:#8da0cb,color:#8da0cb,stroke-dasharray: 3 5; + style TermSuggestedResource fill:#000,stroke:#8da0cb,color:#8da0cb,stroke-dasharray: 3 5; + style TermCategory fill:#000,stroke:#8da0cb,color:#8da0cb,stroke-dasharray: 3 5; +``` + +One immediate advantage of this approach is that we have appropriate fields in the knowledge graph for storing +confidence values, which would be multiplied together to generate the final `score` value that is recorded in the +`TermCategory` records. + +A drawback to this prototype is the duplication between the Detectinator and SuggestedResource models (remembering that +SuggestedResource is one of the application's detectors). While this set of models was meant to allow different +SuggestedResource records to be affiliated with different categories, that feature can be supported via code, rather +than relying on the data model. + +### Prototype C + +The "C" prototype was a further evolution of the "A" prototype, which attempted to combine all detection and +categorization outputs in a single model. By changing the `Detection` table to storing floats rather than boolean +values, we attempted to reduce the number of models needed in the application. + +```mermaid +classDiagram + direction LR + + Term --< Detection: has many + + + class Term + Term: +Integer id + Term: +String phrase + Term: calculateCategory() + + class Detection + Detection: +Integer id + Detection: +Integer term_id + Detection: +Integer detector_version + Detection: +Float DOI + Detection: +Float ISBN + Detection: +Float ISSN + Detection: +Float PMID + Detection: +Float Journal + Detection: +Float SuggestedResource + Detection: initialize() + Detection: setDetectionVersion() + Detection: recordDetections() + Detection: recordPatterns() + Detection: recordJournals() + Detection: recordSuggestedResource() + + + style Term fill:#000,stroke:#66c2a5,color:#66c2a5,stroke-width:4px; + + style Detection fill:#000,stroke:#8da0cb,color:#8da0cb,stroke-dasharray: 3 5; +``` + +Development of this prototype was halted fairly early, after realizing that the calculation of categorization values +would not necessarily be helped by combining models in this way. + +### Prototype D + +The "D" prototype was a further evolution of the "B" prototype, focused primarily on removing the separate structures +for SuggestedResources. There is still a knowledge graph spread across Detectors, Categories, and the mapping between +them. Detection and Categorization results are also spread across multiple link records. + +Further refinements in this prototype are the inclusion of a `detector_version` value in the Detection model, and the +removal of a `user_id` field from the Categorization model (we are still debating the role of user-supplied +categorizations, compared to the user-supplied validation of existing categorizations). + +```mermaid +classDiagram + direction LR + + Term "1" --> "1..*" Detection + Term "1" --> "0..*" Categorization + Detection "0..*" --> "1" Detector + + DetectionCategory "0..*" --> "1" Category + + Categorization "0..*" --> "1" Category + + Detector "1" --> "0..*" DetectionCategory + + + class Term + Term: +Integer id + Term: +String phrase + Term: calculateCategory() + + class Detection + Detection: +Integer id + Detection: +Integer term_id + Detection: +Integer detector_id + Detection: +Integer detector_version + Detection: +Float confidence + Detection: initialize() + Detection: setDetectionVersion() + Detection: recordDetections() + Detection: recordPatterns() + Detection: recordJournals() + Detection: recordSuggestedResource() + + class Detector + Detector: +Integer id + Detector: +String name + Detector: +Float confidence + Detector: incrementConfidence() + Detector: decrementConfidence() + + class Category + Category: +Integer id + Category: +String name + + class Categorization + Categorization: +Integer category_id + Categorization: +Integer term_id + Categorization: +Float confidence + + class DetectionCategory + DetectionCategory: +Integer id + DetectionCategory: +Integer detector_id + DetectionCategory: +Integer category_id + DetectionCategory: +Float confidence + DetectionCategory: incrementConfidence() + DetectionCategory: decrementConfidence() + + + style Term fill:#000,stroke:#66c2a5,color:#66c2a5,stroke-width:4px; + + style Category fill:#000,stroke:#fc8d62,color:#fc8d62 + style DetectionCategory fill:#000,stroke:#fc8d62,color:#fc8d62 + style Detector fill:#000,stroke:#fc8d62,color:#fc8d62 + + style Categorization fill:#000,stroke:#8da0cb,color:#8da0cb,stroke-dasharray: 3 5; + style Detection fill:#000,stroke:#8da0cb,color:#8da0cb,stroke-dasharray: 3 5; +``` + +The significant benefit of this prototype is the removal of the SuggestedResource models, which leaves a more +straightforward data model which records only Detectors and Categories, without special consideration for any one +Detector. + +## Decision + +We will pursue the "D" prototype, with explicit models for the application's knowledge graph, and detection and +categorization outputs spread across linking records rather than concentrated in a single record. + +## Consequences + +There are still unknowns which we will confront while implementing this design. Among those are how the user permissions +model will intersect with these models, and how the controller and view layers will be defined to enable this to +function. Additionally, while we have discussed the process of calculating confidence values, it may be that writing +this implementation may reveal shortcomings we have not yet realized. + +Our commitment at this stage, due to these uncertainties, is that we will further develop the "D" prototype by +attempting to implement it. Only time will tell whether we will successfully do so, or if we will need to change course. diff --git a/docs/explanation/categorization-workflow.md b/docs/explanation/categorization-workflow.md new file mode 100644 index 0000000..35037ae --- /dev/null +++ b/docs/explanation/categorization-workflow.md @@ -0,0 +1,107 @@ +# Categorization workflow + +## Conceptual diagram + +There are three basic models which we are attempting to relate to each other: +Terms, Detectors, and Categories. The relationship looks like this: + +```mermaid +flowchart LR + + Terms + Detectors + Categories + + Terms -- are evaluated by --> Detectors + Detectors <-- are mapped to --> Categories + Categories -- get linked with --> Terms +``` + +## Example data + +### Terms + +| id | phrase | +|----|---------------------------------------| +| 1 | web of science | +| 2 | pitchbook | +| 3 | vaibbhav taraate | +| 4 | doi.org/10.1080/17460441.2022.2084607 | +--- + +We have received more than 40,000 unique search terms from the Bento system in +the first three months of TACOS' operation. + +### Categories + +| id | name | note | +|----|---------------|-------------------------------------------------------------------------------------------| +| 1 | Transactional | The user wants to complete an _action_ (i.e. to receive an item) | +| 2 | Navigational | The user wants to reach a _place_ which might be a web page, or perhaps talk to a person. | +| 3 | Informational | The user wants _information_ about an idea or concept. | + +Thus far, we have only focused on these three categories of search intent. It +should be noted that the SEO literature references additional categories, such +as "commercial" or "conversational". + +Additionally, some of these categories may be sub-divided. Transactional +searches might be looking for a book, a journal article, or a thesis. +Navigational searches might be satisfied by visiting the desired webpage, or +contacting a liaison. + +### Detectors + +| id | name | note | +|----|--------------------|-----------------| +| 1 | DOI | Regex detection | +| 2 | ISBN | Regex detection | +| 3 | ISSN | Regex detection | +| 4 | PMID | Regex detection | +| 5 | Journal name | Term lookup | +| 6 | Suggested resource | Term lookup | + +Our detectors so far fall into one of two broad types: those which use regular expressions to detect patterns within +the search term, and those which check whether the search term appears in an external list of resources. + +--- + +## Workflow + +Most of the time, this workflow will be followed automatically when a new search phrase is recorded by the application +for the first time. Occasionally, we will re-run this workflow (either manually, or via a schedule) when the application +changes enough that a prior workflow is no longer valid. Our method of determining when prior work is no longer valid is +to rely on the `detector_version` value in the Detection model. + +When a search phrase is received which has already been categorized, the prior scores are used without needing to follow +this workflow again. + +### Pass the term through our suite of detectors + +Passing the search phrase through all of our detectors is done via a method like `recordDetections()`, which is part of +the `Detection` model. Should ony a subset of detectors need to be consulted, there are internal methods which can +accomplish this. + +### Calculate the categorization scores based on these detections + +The `Term` model has a method which looks up all the Detectors which found a positive result for that term. This +`calculateCategory()` model performs the necessary math to determine the score for each Category in the system, and +creates the needed `Categorization` records. The calculated score is stored in the `confidence` field of this model. + +One detector in this application is associated with different categories on a record-by-record basis - the +SuggestedResource detector. The `calculateCategory()` method includes a lookup for this detector to make sure that any +detections are scored appropriately. + +### Human validation of these operations + +There will be an ability for humans to inspect these operations, and to submit feedback about any actions which were +not correct. These validations will be used to further refine the confidence values associated with our `Detector` and +`DetectionCategory` records, as well as to refine the operation of the detectors, or the mappings between these +elements. + +This validation workflow has not been defined yet, nor has the data model been expanded to support this feedback. We do +anticipate, however, that successful or unsuccessful validations would end up adjusting the relevant confidence values +via the `incrementConfidence()` or `decrementConfidence()` methods. + +--- + +Further discussion of this design can be found in the [Classes diagram](../reference/classes.md). diff --git a/docs/explanation/validation-workflow-a.md b/docs/explanation/validation-workflow-a.md deleted file mode 100644 index 6264b9f..0000000 --- a/docs/explanation/validation-workflow-a.md +++ /dev/null @@ -1,161 +0,0 @@ -# The categorization and validation workflow - -This document describes the workflow for categorizing, and then validating, how -a given term has been processed by TACOS. - -## Preparation - -Pick what record we're working with. In production, this would happen as new -terms are recorded, but for now we're working with a randomly chosen example. - -```ruby -t = Term.all.sample -``` - -## Pass the term through our suite of detectors - -This assumes that all of our detection algorithms are integrated with the -Detector model, which creates a record of their output for processing during the -Categorization phase. - -```ruby -d = Detection.new(t) -d.save -``` - -To this point the Detection model only records activations by each detection, as -boolean values. Future development might add more details, such as which records -are matched, or what external lookups return. It might also be relevant to note -whether multiple patterns are found. - -```ruby -irb(main):013> d -=> -# -``` - -In this example, none of the detectors found anything. - -The `detection_version` value in these records gets stored in ENV, and -incremented as our detection algorithms change. This helps identify whether a -Detection is outdated and needs to be refreshed. - -## Generate the Categorization values based on these detections - -```ruby -c = Categorization.new(d) -c.save -``` - -The creation of the record includes the calculation of scores for each of the -three categories. To this point, the logic is exceedingly simple, but this can -be made more nuanced with time. - -```ruby -irb(main):019> c -=> -# -``` - -These scores are used by the `evaluate` method to assign the term to a category, -if relevant. Because none of the detectors fired in the previous step, all of -the category scores are 0.0 and the term will be placed in the "unknown" -category. - -```ruby -t.category = c.evaluate -t.save -``` - -There is also an `assign` method at the moment, which combines the above steps. -This may not make sense in production, however. - -The result of the Categorization workflow is that the original Term record now -has been placed in a category: - -```ruby -irb(main):008> t -=> -# -``` - -From end to end, the code to categorize all untouched term records is then this: - -```ruby -Term.where("category is null").each { |t| - d = Detection.new(t) - d.save - c = Categorization.new(d) - c.assign -} -``` - -## Validation - -Humans will be asked to inspect the outcomes of the previous steps, and provide -feedback about whether any decisions were made incorrectly. - -```ruby -v = Validation.new(c) -v.save -``` - -Validation records have a boolean flag for each decision which went into the -process thus far: - -```ruby -irb(main):011> v -=> -# -``` - -This includes a flag for the final result, each component score, each individual -detection, and a final flag that indicates the Term itself needs review. The -intent of this final flag is for the case where a search term is somehow -problematic and needs to be expunged. - -There are no methods yet on this model, because all values are meant to be set -individually via the web interface. - -There is not - yet - a notes field on the Validation model, but this is -something that we've discussed in case the validator has more detailed feedback -about some part of the decision-making that is being reviewed. - diff --git a/docs/explanation/validation-workflow-b.md b/docs/explanation/validation-workflow-b.md deleted file mode 100644 index e801366..0000000 --- a/docs/explanation/validation-workflow-b.md +++ /dev/null @@ -1,101 +0,0 @@ -# The categorization and validation workflow - -Start with a term record somehow... - -```ruby -t = Term.all.sample -``` - -All the methods in this prototype are part of the Term model. Because the data model is so distributd across so many -tables, the Term model feels like it could be the most stable place for both detection recording and categorization. -There is a Validation model, so that workflow be built out there. - -## Detections - -Detection results are created and stored via the `evaluate_*` methods. Calling these methods multiple times will result -in duplicate records. - -Only positive detections are stored in this prototype. Doing this makes categorization easier, but might hamper our -visibility into system behavior. - -```ruby -# This calls each of the sub-methods in turn for identifiers, journals, and suggested resources. -t.evalute_detectinators -``` - -### The "detection_version" environment variable - -The other prototype introduces an environment variable `DETECTION_VERSION` in order to recognize that TACOS' -capabilities will likely expand over time. - -While such a variable might be useful for this prototype, we should recognize that key aspects of the application's -behavior are recorded only in database records - such as the linkages between detectors and categories. Because those -records can change so easily, we will need to consider carefully how to implement a versioning feature to capture how -system performance changes over time. - -## Categorization - -At this point, the positive outputs of our detectors has been recorded. The next step is to perform the categorizations. - -This is not functional in this prototype, but the `categorize` method indicates a possible direction: - -```ruby -irb(main):051> t.categorize - - INFO -- : This method will calculate the confidence scores for this term. - INFO -- : Transactional-PMID: 0.95 * 0.95 = 0.9025 - INFO -- : Transactional-DOI: 0.95 * 0.95 = 0.9025 -``` - -In this example, both the DOI and PMID detectors returned positive results. Each of these detectors are joined to the -"Transactional" category, so the method multiplies the confidence values of the detector by the confidence value of the -mapping, and generates a score. These scores would be added together, resulting in the following scores for each -category: - -| Category | CategoryScore | -|---------------|---------------| -| Informational | 0.0 | -| Navigational | 0.0 | -| Transactional | ~1.8 | - -In SQL terms, the sort of querying logic that this method would need would be something like: - -```SQL -SELECT c.name AS Category, SUM(d.confidence * dc.confidence) AS CategoryScore -FROM terms t -LEFT OUTER JOIN TermDetectinator td ON t.id = td.term_id -LEFT OUTER JOIN detectors d ON td.detectinator_id = d.id -LEFT OUTER JOIN Mapping dc ON d.id = dc.detectinator_id -LEFT OUTER JOIN categories c ON dc.category_id = c.id -WHERE t.id = 4 -GROUP BY c.id -``` - -_Note: if we end up storing negative results from the detection workflow, the equation above would need to be expanded -to include the detector result as an integer: `d.confidence * dc.confidence * td.result`. This would end up dropping -the negative results and associated confidence values._ - -For convenience, the winning category could be stored back into the `Term` model, similarly to the other prototype. The -category scores would be stored as values in the TermCategory table. - -If we ever ask colleagues to manually categorize Term records - which is a fundamental break with these prototypes' -assumptions - taht TermCategory table would need to have an optional field to record who performed that categorization. - -## Validation - -Validation has been modeled in the classes prototype, but not executed in code. - -```ruby -v = Validation.new(t) - -v.report -# This would return a list of all linked recorects for the given Term. -``` - -This would end up querying all records from the validatable tables (TermDetectinator, TermCategory, and -TermSuggestedResource), and list everything that is returned. Every such record would spawn a related record in the -validation tables (ValidTermDetectinator, ValidTermCategory, and ValidTermSuggestedResource), with a boolean value to -indicate whether the detection is confirmed or invalidated. - -While there is a discussion elsewhere in this prototype about whether to store only positive or all results from the -Detection workflow, in terms of Validation I think it makes the most sense to store both types. diff --git a/docs/reference/classes-prototype-a-minus.md b/docs/reference/classes-prototype-a-minus.md deleted file mode 100644 index e0390ea..0000000 --- a/docs/reference/classes-prototype-a-minus.md +++ /dev/null @@ -1,213 +0,0 @@ -# Prototype A-minus ("Code, but leaves out display of tangentially related models") - -This prototype relies on fewer tables, with one record in each, and leans more heavily on behavior in code. - -> [!NOTE] -> There are no changes to this model other than removing the display of `Journal` and `SuggestedResource` tables which are used by some Detectors, but are not themselves part of the Detection/Categorization data model. - -## Shared preface - -The same color scheme is used for both prototypes: - -* Terms, which flow in continuously with Search Events; -* A knowledge graph, which includes the categories, detectors, and relationships - between the two which TACOS defines and maintains, and which is consulted during categorization; and -* The linkages between these terms and the graph, which record which signals are - detected in each term, and how those signals are interpreted to place the term into a category. - -A simple way to describe the Categorization workflow would be to say that Categorization involves populating the blue -tables in the diagrams below. - -## Categorization - -```mermaid -classDiagram - direction LR - - Term --< Detection: has many - Detection <-- Categorization: based on - - class Term - Term: +Integer id - Term: +String phrase - Term: +Enum category - - class Detection - Detection: +Integer id - Detection: +Integer term_id - Detection: +Integer detector_version - Detection: +Boolean DOI - Detection: +Boolean ISBN - Detection: +Boolean ISSN - Detection: +Boolean PMID - Detection: +Boolean Journal - Detection: +Boolean SuggestedResource - Detection: initialize() - Detection: setDetectionVersion() - Detection: recordDetections() - Detection: recordPatterns() - Detection: recordJournals() - Detection: recordSuggestedResource() - - class Categorization - Categorization: +Integer id - Categorization: +Integer detection_id - Categorization: +Float information_score - Categorization: +Float navigation_score - Categorization: +Float transaction_score - Categorization: initialize() - Categorization: assign() - Categorization: evaluate() - Categorization: calculateAll() - Categorization: calculateInformation() - Categorization: calculateNavigation() - Categorization: calculateTransaction() - - style Term fill:#000,stroke:#66c2a5,color:#66c2a5 - - style Category fill:#000,stroke:#fc8d62,color:#fc8d62 - style Detector fill:#000,stroke:#fc8d62,color:#fc8d62 - style Journal fill:#000,stroke:#fc8d62,color:#fc8d62 - style SuggestedResource fill:#000,stroke:#fc8d62,color:#fc8d62 - - style Detection fill:#000,stroke:#8da0cb,color:#8da0cb - style Categorization fill:#000,stroke:#8da0cb,color:#8da0cb -``` - -### Order of operations - -1. A new `Term` is registered. -2. A `Detection` record for that `Term` is created (which allows repeat detection operations as TACOS gains new - capabilities). -3. The various `Detection` records (either the most recent for each term, or all detections over time) are processed via - code to generate scores for each potential category. These results are stored as `Categorization` records. -4. The three category scores are compared, and the one with the highest score is stored back in the `Term` record. - -### Category values - -There is no `Category` table, but two models have separate enumerated fields. The `Detector::SuggestedResource` model -has three possible values (Informational, Navigational, and Transactional), while the `Term` model has an additional -value ("Unknown") which is assigned during Categorization if two category scores are equal. - -(This lack of a category table is not a fundamental aspect of this prototype, but it does indicate the general choice to -rely on code, rather than database records, as much as possible. Such a model could be accommodated, or implemented via -a shared helper method perhaps) - -### Calculating the category scores - -At the moment, category scores are assigned in methods like: - -```ruby -# FILE: app/models/categorization.rb - def calculate_transactional - self.transaction_score = 0.0 - self.transaction_score = 1.0 if %i[doi isbn issn pmid journal].any? do |signal| - self.detection[signal] - end - self.transaction_score = 1.0 if Detector::SuggestedResource.full_term_match(self.detection.term.phrase).first&.category == 'transactional' - end -``` - -This is effectively an "all or nothing" approach, where any detection at all results in the maximum possible score. This -lacks nuance, obviously, and we've talked about ways to include a confidence value in these calculations. As yet, this -prototype has not attempted to include that feature however. - -**Note:** I've tried to anticipate how to include confidence values appropriately in this prototype, and it is not at -all clear how that might happen. This gets to the mathematical operations involved in calculating the category scores, -which might need to be documented separately. I've started a [Tidbit article to explore this issue](https://mitlibraries.atlassian.net/wiki/spaces/D/pages/4019814405/Calculating+categorization+scores+via+confidence+values). - -## Validations - -```mermaid -classDiagram - direction LR - - Term --< Detection: has many - Detection <-- Categorization: based on - Categorization --> SuggestedResource: looks up - Detection --> SuggestedResource: looks up - Detection --> Journal: looks up - Categorization >-- Validation: subject to - - class Term - Term: +Integer id - Term: +String phrase - Term: +Enum category - - class SuggestedResource - SuggestedResource: +Integer id - SuggestedResource: +String title - SuggestedResource: +String url - SuggestedResource: +String phrase - SuggestedResource: +String fingerprint - SuggestedResource: +Enum category - SuggestedResource: calculateFingerprint() - - class Journal - Journal: +Integer id - Journal: +String title - - class Detection - Detection: +Integer id - Detection: +Integer term_id - Detection: +Integer detector_version - Detection: +Boolean DOI - Detection: +Boolean ISBN - Detection: +Boolean ISSN - Detection: +Boolean PMID - Detection: +Boolean Journal - Detection: +Boolean SuggestedResource - Detection: initialize() - Detection: setDetectionVersion() - Detection: recordDetections() - Detection: recordPatterns() - Detection: recordJournals() - Detection: recordSuggestedResource() - - class Categorization - Categorization: +Integer id - Categorization: +Integer detection_id - Categorization: +Float information_score - Categorization: +Float navigation_score - Categorization: +Float transaction_score - Categorization: initialize() - Categorization: assign() - Categorization: evaluate() - Categorization: calculateAll() - Categorization: calculateInformation() - Categorization: calculateNavigation() - Categorization: calculateTransaction() - - class Validation - Validation: +Integer id - Validation: +Integer categorization_id - Validation: +Integer user_id - Validation: +Boolean approve_transaction - Validation: +Boolean approve_information - Validation: +Boolean approve_navigation - Validation: +Boolean approve_doi - Validation: +Boolean approve_isbn - Validation: +Boolean approve_issn - Validation: +Boolean approve_pmid - Validation: +Boolean approve_journal - Validation: +Boolean approve_suggested_resource - - style Term fill:#000,stroke:#66c2a5,color:#66c2a5 - - style Category fill:#000,stroke:#fc8d62,color:#fc8d62 - style Detector fill:#000,stroke:#fc8d62,color:#fc8d62 - style Journal fill:#000,stroke:#fc8d62,color:#fc8d62 - style SuggestedResource fill:#000,stroke:#fc8d62,color:#fc8d62 - - style Detection fill:#000,stroke:#8da0cb,color:#8da0cb - style Categorization fill:#000,stroke:#8da0cb,color:#8da0cb - - style Validation fill:#000,stroke:#ffd407,color:#ffd407 -``` - -Validations, in this prototype, are collected in a single table with a field for each decision which came before it. As -the application expands, any new detectors or categories would result in new fields, both in the Detection or -Categorization models and also in the Validation model. - -Multiple validations are possible for a single Categorization decision, enabled by the user_id field, which allows for -feedback provided by multiple users if bandwidth allows. diff --git a/docs/reference/classes-prototype-a.md b/docs/reference/classes-prototype-a.md deleted file mode 100644 index fb4cb40..0000000 --- a/docs/reference/classes-prototype-a.md +++ /dev/null @@ -1,226 +0,0 @@ -# Prototype A ("Code") - -This prototype relies on fewer tables, with one record in each, and leans more heavily on behavior in code. - -## Shared preface - -The same color scheme is used for both prototypes: - -* Terms, which flow in continuously with Search Events; -* A knowledge graph, which includes the categories, detectors, and relationships - between the two which TACOS defines and maintains, and which is consulted during categorization; and -* The linkages between these terms and the graph, which record which signals are - detected in each term, and how those signals are interpreted to place the term into a category. - -A simple way to describe the Categorization workflow would be to say that Categorization involves populating the blue -tables in the diagrams below. - -## Categorization - -```mermaid -classDiagram - direction LR - - Term --< Detection: has many - Detection <-- Categorization: based on - Categorization --> SuggestedResource: looks up - Detection --> SuggestedResource: looks up - Detection --> Journal: looks up - - class Term - Term: +Integer id - Term: +String phrase - Term: +Enum category - - class SuggestedResource - SuggestedResource: +Integer id - SuggestedResource: +String title - SuggestedResource: +String url - SuggestedResource: +String phrase - SuggestedResource: +String fingerprint - SuggestedResource: +Enum category - SuggestedResource: calculateFingerprint() - - class Journal - Journal: +Integer id - Journal: +String title - - class Detection - Detection: +Integer id - Detection: +Integer term_id - Detection: +Integer detector_version - Detection: +Boolean DOI - Detection: +Boolean ISBN - Detection: +Boolean ISSN - Detection: +Boolean PMID - Detection: +Boolean Journal - Detection: +Boolean SuggestedResource - Detection: initialize() - Detection: setDetectionVersion() - Detection: recordDetections() - Detection: recordPatterns() - Detection: recordJournals() - Detection: recordSuggestedResource() - - class Categorization - Categorization: +Integer id - Categorization: +Integer detection_id - Categorization: +Float information_score - Categorization: +Float navigation_score - Categorization: +Float transaction_score - Categorization: initialize() - Categorization: assign() - Categorization: evaluate() - Categorization: calculateAll() - Categorization: calculateInformation() - Categorization: calculateNavigation() - Categorization: calculateTransaction() - - style Term fill:#000,stroke:#66c2a5,color:#66c2a5 - - style Category fill:#000,stroke:#fc8d62,color:#fc8d62 - style Detector fill:#000,stroke:#fc8d62,color:#fc8d62 - style Journal fill:#000,stroke:#fc8d62,color:#fc8d62 - style SuggestedResource fill:#000,stroke:#fc8d62,color:#fc8d62 - - style Detection fill:#000,stroke:#8da0cb,color:#8da0cb - style Categorization fill:#000,stroke:#8da0cb,color:#8da0cb -``` - -### Order of operations - -1. A new `Term` is registered. -2. A `Detection` record for that `Term` is created (which allows repeat detection operations as TACOS gains new - capabilities). -3. The various `Detection` records (either the most recent for each term, or all detections over time) are processed via - code to generate scores for each potential category. These results are stored as `Categorization` records. -4. The three category scores are compared, and the one with the highest score is stored back in the `Term` record. - -### Category values - -There is no `Category` table, but two models have separate enumerated fields. The `Detector::SuggestedResource` model -has three possible values (Informational, Navigational, and Transactional), while the `Term` model has an additional -value ("Unknown") which is assigned during Categorization if two category scores are equal. - -(This lack of a category table is not a fundamental aspect of this prototype, but it does indicate the general choice to -rely on code, rather than database records, as much as possible. Such a model could be accommodated, or implemented via -a shared helper method perhaps) - -### Calculating the category scores - -At the moment, category scores are assigned in methods like: - -```ruby -# FILE: app/models/categorization.rb - def calculate_transactional - self.transaction_score = 0.0 - self.transaction_score = 1.0 if %i[doi isbn issn pmid journal].any? do |signal| - self.detection[signal] - end - self.transaction_score = 1.0 if Detector::SuggestedResource.full_term_match(self.detection.term.phrase).first&.category == 'transactional' - end -``` - -This is effectively an "all or nothing" approach, where any detection at all results in the maximum possible score. This -lacks nuance, obviously, and we've talked about ways to include a confidence value in these calculations. As yet, this -prototype has not attempted to include that feature however. - -**Note:** I've tried to anticipate how to include confidence values appropriately in this prototype, and it is not at -all clear how that might happen. This gets to the mathematical operations involved in calculating the category scores, -which might need to be documented separately. I've started a [Tidbit article to explore this issue](https://mitlibraries.atlassian.net/wiki/spaces/D/pages/4019814405/Calculating+categorization+scores+via+confidence+values). - -## Validations - -```mermaid -classDiagram - direction LR - - Term --< Detection: has many - Detection <-- Categorization: based on - Categorization --> SuggestedResource: looks up - Detection --> SuggestedResource: looks up - Detection --> Journal: looks up - Categorization >-- Validation: subject to - - class Term - Term: +Integer id - Term: +String phrase - Term: +Enum category - - class SuggestedResource - SuggestedResource: +Integer id - SuggestedResource: +String title - SuggestedResource: +String url - SuggestedResource: +String phrase - SuggestedResource: +String fingerprint - SuggestedResource: +Enum category - SuggestedResource: calculateFingerprint() - - class Journal - Journal: +Integer id - Journal: +String title - - class Detection - Detection: +Integer id - Detection: +Integer term_id - Detection: +Integer detector_version - Detection: +Boolean DOI - Detection: +Boolean ISBN - Detection: +Boolean ISSN - Detection: +Boolean PMID - Detection: +Boolean Journal - Detection: +Boolean SuggestedResource - Detection: initialize() - Detection: setDetectionVersion() - Detection: recordDetections() - Detection: recordPatterns() - Detection: recordJournals() - Detection: recordSuggestedResource() - - class Categorization - Categorization: +Integer id - Categorization: +Integer detection_id - Categorization: +Float information_score - Categorization: +Float navigation_score - Categorization: +Float transaction_score - Categorization: initialize() - Categorization: assign() - Categorization: evaluate() - Categorization: calculateAll() - Categorization: calculateInformation() - Categorization: calculateNavigation() - Categorization: calculateTransaction() - - class Validation - Validation: +Integer id - Validation: +Integer categorization_id - Validation: +Integer user_id - Validation: +Boolean approve_transaction - Validation: +Boolean approve_information - Validation: +Boolean approve_navigation - Validation: +Boolean approve_doi - Validation: +Boolean approve_isbn - Validation: +Boolean approve_issn - Validation: +Boolean approve_pmid - Validation: +Boolean approve_journal - Validation: +Boolean approve_suggested_resource - - style Term fill:#000,stroke:#66c2a5,color:#66c2a5 - - style Category fill:#000,stroke:#fc8d62,color:#fc8d62 - style Detector fill:#000,stroke:#fc8d62,color:#fc8d62 - style Journal fill:#000,stroke:#fc8d62,color:#fc8d62 - style SuggestedResource fill:#000,stroke:#fc8d62,color:#fc8d62 - - style Detection fill:#000,stroke:#8da0cb,color:#8da0cb - style Categorization fill:#000,stroke:#8da0cb,color:#8da0cb - - style Validation fill:#000,stroke:#ffd407,color:#ffd407 -``` - -Validations, in this prototype, are collected in a single table with a field for each decision which came before it. As -the application expands, any new detectors or categories would result in new fields, both in the Detection or -Categorization models and also in the Validation model. - -Multiple validations are possible for a single Categorization decision, enabled by the user_id field, which allows for -feedback provided by multiple users if bandwidth allows. diff --git a/docs/reference/classes-prototype-b.md b/docs/reference/classes-prototype-b.md deleted file mode 100644 index e649791..0000000 --- a/docs/reference/classes-prototype-b.md +++ /dev/null @@ -1,257 +0,0 @@ -# Prototype B ("Data") - -This prototype relies on more models, more linking records, and as a result relies less on behavior in code. - -## Shared preface - -* Terms, which flow in continuously with Search Events; -* A knowledge graph, which includes the categories, detectors, and relationships - between the two which TACOS defines and maintains, and which is consulted during categorization; and -* The linkages between these terms and the graph, which record which signals are - detected in each term, and how those signals are interpreted to place the term into a category. - -A simple way to describe the Categorization workflow would be to say that Categorization involves populating the blue -tables in the diagrams below. - -## Categorization - -```mermaid -classDiagram - direction LR - - Term >-- TermDetectinator - TermDetectinator --> Detectinator - Category <-- Mapping - Mapping --> Detectinator - Term --> TermCategory - TermCategory <-- Category - SuggestedResource --> Category - Term <-- TermSuggestedResource - TermSuggestedResource --> SuggestedResource - - class Term:::primarytable - Term: +Integer id - Term: +String phrase - Term: categorize() - Term: evaluate_detectinators() - Term: evaluate_identifiers() - Term: evaluate_journals() - Term: evaluate_suggested_resources() - - class TermDetectinator - TermDetectinator: +Integer term_id - TermDetectinator: +Integer detector_id - TermDetectinator: +Boolean result - - class Detectinator - Detectinator: +Integer id - Detectinator: +String name - Detectinator: +Float confidence - - class Category - Category: +Integer id - Category: +String name - Category: +String note - - class Mapping - Mapping: +Integer detectinator_id - Mapping: +Integer category_id - Mapping: +Float confidence - - class TermCategory - TermCategory: +Integer term_id - TermCategory: +Integer category_id - TermCategory: +Integer user_id - - class SuggestedResource - SuggestedResource: +Integer id - SuggestedResource: +String title - SuggestedResource: +String fingerprint - SuggestedResource: +URL url - SuggestedResource: +Integer category_id - - class TermSuggestedResource - TermSuggestedResource: +Integer term_id - TermSuggestedResource: +Integer suggested_resource_id - TermSuggestedResource: +Boolean result - - style Term fill:#000,stroke:#66c2a5,color:#66c2a5 - - style Category fill:#000,stroke:#fc8d62,color:#fc8d62 - style Detectinator fill:#000,stroke:#fc8d62,color:#fc8d62 - style Mapping fill:#000,stroke:#fc8d62,color:#fc8d62 - style SuggestedResource fill:#000,stroke:#fc8d62,color:#fc8d62 - - style TermDetectinator fill:#000,stroke:#8da0cb,color:#8da0cb - style TermSuggestedResource fill:#000,stroke:#8da0cb,color:#8da0cb - style TermCategory fill:#000,stroke:#8da0cb,color:#8da0cb -``` - -### The "knowledge graph" - -The relationship between Detectors and Categories would be generally set ahead of time. Detectors produce a boolean -output in the cleanest case - they either detect a signal or they do not. Relatedly, detectors have an influence over -whether a given Category is relevant or not: - -* If the Detector for a DOI pattern returns `true`, then this influences the `transactional` Category to a significant - degree. -* However, the Detector for a DOI pattern does almost nothing to influence the `navigational` Category. -* If Categorization is a zero-sum activity, however, the DOI pattern detector would _exclusively_ claim a Term for the - `transactional` Category - so it would effectively rule out the other two Categories. - -The exception to this Detector rule is the SuggestedResource detector - which has variability in its records. Some -SuggestedResources are in each of the three Categories, so there is a more complicated decision-making algorithm, and -thus a different set of database tables. - -### Category scores - -At the moment, category scores are intended to be calculated by combining the confidence values for both the detector -and the DetectorCategory link (as well as the result of the detection pass, if negative results are stored). See the -workflow document for this prototype for an explanation of this math. I've begun an implementation of this approach -in the `Term.categorize` method in this prototype, but this is not finished. - -### Order of operations - -The linkages between these tables are filled in at different moments. - -The Detector-Category linkage is maintained as either set of resources evolves over time, and on a relatively slow -cadence. Operationally, the links which matter are made as new Terms flow into TACOS. - -1. A new Term is recorded in the system. -2. That Term is compared with each Detector, and any positive responses are recorded. Negative responses may be - discarded, or recorded for the sake of completeness (to confirm that the link was tested). These outcomes are stored - as several records across the TermDetectinator and TermSuggestedResource tables. -3. Those detection records are then used to perform the Categorization work, comparing the confidence values of each - Detectinator and Mapping. The responses are then used to perform the Categorization work, which results in records - being created in the TermCategory table. - -### Questions - -* The application defines a `Detector` module/namespace. Ideally I want a `Detector` class for the records of our - various detectors, but I'm not sure this is possible (or I haven't figured out how). If `Detector` is not possible, - should we use an un-namespaced option like `Detectinator`, or instead go with something like `Detector::Detector` or - `Detector::Base` ? - * One of the reasons why I went with an un-namespaced class here is to make defining link tables easier - (`Term_Detectinator` instead of `Term_DetectorBase`) -* The `TermDetectinator` table records the results of our suite of detectors in response to a given term. Should we - record only positive results, or should we also record negative results? - * The `Mappings` table (which should be named `CategoryDetectinator`) has a similar question - whether we should - record no-confidence mappings (for example, a DOI detection would have 0 confidence toward a navigational - categorization) - -## Validations - -Valdations might get thorny in this model, because the results we are validating are spread across multiple records in -the same class. For example, a single term record like `Collins HK. When listening is spoken. doi: 10.1016/j.copsyc.2022.101402. PMID: 35841883.` -would result in multiple records in the `TermDetectinator` table, each of which would be subject to validation. As a -result it might make sense to embed the validation throughout the data model, rather than in a separate field? - -```mermaid -classDiagram - direction LR - - Term >-- TermDetectinator - TermDetectinator --> Detectinator - Category <-- Mapping - Mapping --> Detectinator - Term --> TermCategory - TermCategory <-- Category - SuggestedResource --> Category - Term <-- TermSuggestedResource - TermSuggestedResource --> SuggestedResource - Validation <-- ValidTermDetectinator - ValidTermDetectinator --> TermDetectinator - Validation <-- ValidTermCategory - ValidTermCategory --> TermCategory - Validation <-- ValidTermSuggestedResource - ValidTermSuggestedResource --> TermSuggestedResource - - class Term:::primarytable - Term: +Integer id - Term: +String phrase - Term: categorize() - Term: evaluate_detectinators() - Term: evaluate_identifiers() - Term: evaluate_journals() - Term: evaluate_suggested_resources() - - class TermDetectinator - TermDetectinator: +Integer term_id - TermDetectinator: +Integer detector_id - TermDetectinator: +Boolean result - - class Detectinator - Detectinator: +Integer id - Detectinator: +String name - Detectinator: +Float confidence - - class Category - Category: +Integer id - Category: +String name - Category: +String note - - class Mapping - Mapping: +Integer detectinator_id - Mapping: +Integer category_id - Mapping: +Float confidence - - class TermCategory - TermCategory: +Integer term_id - TermCategory: +Integer category_id - TermCategory: +Integer user_id - - class SuggestedResource - SuggestedResource: +Integer id - SuggestedResource: +String title - SuggestedResource: +String fingerprint - SuggestedResource: +URL url - SuggestedResource: +Integer category_id - - class TermSuggestedResource - TermSuggestedResource: +Integer term_id - TermSuggestedResource: +Integer suggested_resource_id - TermSuggestedResource: +Boolean result - - class Validation - Validation: +Integer id - Validation: +Integer user_id - - class ValidTermCategory - ValidTermCategory: +Integer validation_id - ValidTermCategory: +Integer termcategory_id - ValidTermCategory: +Boolean valid - - class ValidTermDetectinator - ValidTermDetectinator: +Integer validation_id - ValidTermDetectinator: +Integer termdetectinator_id - ValidTermDetectinator: +Boolean valid - - class ValidTermSuggestedResource - ValidTermSuggestedResource: +Integer validation_id - ValidTermSuggestedResource: +Integer termsuggestedresource_id - ValidTermSuggestedResource: +Boolean valid - - - style Term fill:#000,stroke:#66c2a5,color:#66c2a5 - - style Category fill:#000,stroke:#fc8d62,color:#fc8d62 - style Detectinator fill:#000,stroke:#fc8d62,color:#fc8d62 - style Mapping fill:#000,stroke:#fc8d62,color:#fc8d62 - style SuggestedResource fill:#000,stroke:#fc8d62,color:#fc8d62 - - style TermDetectinator fill:#000,stroke:#8da0cb,color:#8da0cb - style TermSuggestedResource fill:#000,stroke:#8da0cb,color:#8da0cb - style TermCategory fill:#000,stroke:#8da0cb,color:#8da0cb - - style Validation fill:#000,stroke:#ffd407,color:#ffd407 - style ValidTermCategory fill:#000,stroke:#ffd407,color:#ffd407 - style ValidTermDetectinator fill:#000,stroke:#ffd407,color:#ffd407 - style ValidTermSuggestedResource fill:#000,stroke:#ffd407,color:#ffd407 -``` - -This is an extension of the original class diagram, adding the validation data model in yellow. The thesis of the model -is that every decision made during Categorization is subject to review during Validation, potentially by multiple -reviewers. - -If validation is only performed once, we don't need any of the yellow tables, and we instead could just add a boolean -`valid` flag to each categorization table. \ No newline at end of file diff --git a/docs/reference/classes-prototype-c.md b/docs/reference/classes-prototype-c.md deleted file mode 100644 index 486442c..0000000 --- a/docs/reference/classes-prototype-c.md +++ /dev/null @@ -1,77 +0,0 @@ -# Prototype C ("Detections with confidence") - -This prototype relies on fewer tables, with one record in each, and leans more heavily on behavior in code. - -> [!WARN] -> The intent was to collapse Categorizations into Detections by moving booleans to floats, but this looses important -nuance from the original prototype A-minus it was based on. - -## Shared preface - -The same color scheme is used for both prototypes: - -* Terms, which flow in continuously with Search Events; -* A knowledge graph, which includes the categories, detectors, and relationships - between the two which TACOS defines and maintains, and which is consulted during categorization; and -* The linkages between these terms and the graph, which record which signals are - detected in each term, and how those signals are interpreted to place the term into a category. - -A simple way to describe the Categorization workflow would be to say that Categorization involves populating the blue -tables in the diagrams below. - -## Categorization - -```mermaid -classDiagram - direction LR - - Term --< Detection: has many - - class Term - Term: +Integer id - Term: +String phrase - Term: calculateCategory() - - - class Detection - Detection: +Integer id - Detection: +Integer term_id - Detection: +Integer detector_version - Detection: +Float DOI - Detection: +Float ISBN - Detection: +Float ISSN - Detection: +Float PMID - Detection: +Float Journal - Detection: +Float SuggestedResource - Detection: initialize() - Detection: setDetectionVersion() - Detection: recordDetections() - Detection: recordPatterns() - Detection: recordJournals() - Detection: recordSuggestedResource() - - style Term fill:#000,stroke:#66c2a5,color:#66c2a5 - - style Category fill:#000,stroke:#fc8d62,color:#fc8d62 - style Detector fill:#000,stroke:#fc8d62,color:#fc8d62 - - style Detection fill:#000,stroke:#8da0cb,color:#8da0cb -``` - -### Order of operations - -1. A new `Term` is registered. -2. A `Detection` record for that `Term` is created (which allows repeat detection operations as TACOS gains new - capabilities). Rather than storing a boolean, we store a float to represent how confident we are that the detection is able to be used for categorization. This approach feels flawed - -### Category values - -Not worked out as the model seems flawed and was abandoned after initial discussion. - -### Calculating the category scores - -Not worked out as the model seems flawed. - -## Validations - -Not worked out as the model seems flawed. diff --git a/docs/reference/classes-prototype-d.md b/docs/reference/classes-prototype-d.md deleted file mode 100644 index 138b0ba..0000000 --- a/docs/reference/classes-prototype-d.md +++ /dev/null @@ -1,117 +0,0 @@ -# Prototype D ("Detectors have many Detections") - -This prototype focuses attempts to only store positive detections, which seems valuable as most of our Terms have no `Detections` and other Prototypes stored "misses" in addition to "hits". - -This comes at the cost of potentially storing more than one `Detection` for some Terms (i.e. if an ISSN and a JournalName match, we'll store 2 `Detection` records). - -`Category` and `Categorization` are optional tables. They would serve as a sort of cache for the `calculateCategory()` method for a `Term`. This allows a `Term` to have multiple categories and would serve as as way to quickly report on what we know about the system rather than calculating everything on the fly. As each `Categorization` stores a confidence float, this should allow us to return to consuming systems how confident we are in each `Category` we return. - -`Detector` is a place where we keep track of each algorithm we have coded and how confident we are in it's ability to predict a Category. - -## Shared preface - -The same color scheme is used for both prototypes: - -* Terms, which flow in continuously with Search Events; -* A knowledge graph, which includes the categories, detectors, and relationships - between the two which TACOS defines and maintains, and which is consulted during categorization; and -* The linkages between these terms and the graph, which record which signals are - detected in each term, and how those signals are interpreted to place the term into a category. - -A simple way to describe the Categorization workflow would be to say that Categorization involves populating the blue -tables in the diagrams below. - -## Categorization - -```mermaid -classDiagram - - Term "1" --> "1..*" Detection - Term "1" --> "0..*" Categorization - Detection "0..*" --> "1" Detector - - DetectionCategory "0..*" --> "1" Category - - Categorization "0..*" --> "1" Category - - Detector "1" --> "0..*" DetectionCategory - - class Term - Term: +Integer id - Term: +String phrase - Term: calculateCategory() - - - class Detection - Detection: +Integer id - Detection: +Integer term_id - Detection: +Integer detector_id - Detection: +Integer detector_version - Detection: +Float confidence - - Detection: initialize() - Detection: setDetectionVersion() - Detection: recordDetections() - Detection: recordPatterns() - Detection: recordJournals() - Detection: recordSuggestedResource() - - class Detector - Detector: +Integer id - Detector: +String name - Detector: +Float confidence - Detector: incrementConfidence() - Detector: decrementConfidence() - - class Category - Category: +Integer id - Category: +String name - - class Categorization - Categorization: +Integer category_id - Categorization: +Integer term_id - Categorization: +Float confidence - - class DetectionCategory - DetectionCategory: +Integer id - DetectionCategory: +Integer detector_id - DetectionCategory: +Integer category_id - DetectionCategory: +Float confidence - DetectionCategory: incrementConfidence() - DetectionCategory: decrementConfidence() - - - style Term fill:#000,stroke:#66c2a5,color:#66c2a5 - - style Category fill:#000,stroke:#fc8d62,color:#fc8d62 - style Categorization fill:#000,stroke:#8da0cb,color:#8da0cb - style DetectionCategory fill:#000,stroke:#fc8d62,color:#fc8d62 - - style Detector fill:#000,stroke:#fc8d62,color:#fc8d62 - style Detection fill:#000,stroke:#8da0cb,color:#8da0cb -``` - -### Order of operations - -1. A term enters the system -2. If Categories exist, return the existing Categories. -3. If no Categories exist, run all Detectors and create Detection and Categorization records. If no Detections are made, we should consider Categorizing the Term as "Unknown" Category to allow for not running Detections again. -4. If new Detectors are created/adjusted. Categorizations should be deleted or expired in some way to allow for new Detections/Categorizations to be created. - -### Category values - -These are largely algorithmic in this model. We'd know what was detected from the Detections table and the `Term` or `Category` model would handle `Categorization` based on business logic we put in place. Example, having a DOI is high confidence for being a Specific Item. - -Unsolved in this model: one Detector (so far) has `Categories` built into the `Detector` (SuggestedResources). These would need to be passed into the `calculateCategory()` method in some way to allow for appropriate `categorization`. - -### Calculating the category scores - -One interesting feature of DetectionCategory is that it stores the confidence of each algorithm to accurately predict a category. During validation, if a Detection made by an algorithm is confirmed, we can run `incrementConfidence()` whatever that ends up meaning. Similarly, if an Detection is validated as inaccurate, we can run `decrementConfidence()`. - -> [!NOTE] ->DetectionCategory is a join table represented in Prototype B as `mapping`. This tries to nudge it towards a better name. - -Some detectors are themselves non-binary in terms of prediction so they maintain a confidence level as well (namely JournalName detection is fairly weak compared to many other algorithms to date) - -## Validations - diff --git a/docs/reference/classes-prototype-zero.md b/docs/reference/classes-prototype-zero.md deleted file mode 100644 index 7330a7c..0000000 --- a/docs/reference/classes-prototype-zero.md +++ /dev/null @@ -1,36 +0,0 @@ -# Prototype Zero - -This was the simplest possible way to join the three basic resources (Terms, Detectors, and Categories). - -```mermaid -classDiagram - direction TB - - Term --> Link - Category --> Link - Detector --> Link - - class Term - Term: +Integer id - Term: +String phrase - - class Category - Category: +Integer id - Category: +String name - - class Link - Link: +Integer - Link: +Integer term_id - Link: +Integer category_id - Link: +Integer detector_id - - class Detector - Detector: +Integer id - Detector: +String name -``` - -This was not developed further, because the other two prototypes (A and B) immediately seemed more capable than this -approach. Having a single join table link all three resources is a recipe for duplicate and inconsistent data that is -hard to work with. - -It is included here only for the sake of completeness. diff --git a/docs/reference/classes.md b/docs/reference/classes.md index b194159..2b26047 100644 --- a/docs/reference/classes.md +++ b/docs/reference/classes.md @@ -1,149 +1,90 @@ # Modeling categorization -## Initial proposal +The application includes the following entities, most of which an be broken into one of the following three areas: + +* Search activity, which flow in continuously with Terms and Search Events; +* A knowledge graph, which includes the categories, detectors, and relationships + between the two which TACOS defines and maintains, and which is consulted during categorization; and +* The linkages between these search terms and the graph, which record which signals are + detected in each term, and how those signals are interpreted to place the term into a category. ```mermaid classDiagram - direction TB + direction LR - AdminUser --> User : Is a Type of Term --> SearchEvent : has many - User --> Categorization : Creates a - User --> Category : Proposes a - Categorization --> Term : Includes a - Categorization --> Category : Includes a + Term "1" --> "1..*" Detection + Term "1" --> "0..*" Categorization + Detection "0..*" --> "1" Detector + + DetectionCategory "0..*" --> "1" Category + + Categorization "0..*" --> "1" Category + Detector "1" --> "0..*" DetectionCategory + + class User + User: +String uid + User: +String email + User: +Boolean admin + class Term Term: id Term: +String phrase - Term: calculate_certainty(term) - Term: list_unique_terms_with_counts() - Term: uncategorized_term() - Term: categorized_term() + Term: calculateCategory() class SearchEvent SearchEvent: +Integer id SearchEvent: +Integer term_id SearchEvent: +String source - SearchEvent: +Timestamp timestamp - - class User - User: +String kerbid - User: +Boolean admin - User: categorize_term(term, category, notes (optional)) - User: propose_category(name, description, reason) - User: view_next_term() - - class AdminUser - AdminUser: approve_category() - AdminUser: create_category() - AdminUser: upload_batch() - AdminUser: view_proposed_categories() - - class Category - Category: +String name - Category: +String reason - Category: +Boolean approved - Category: +Text description - - class Categorization - Categorization: id - Categorization: +Integer category_id - Categorization: +Integer term_id - Categorization: +Integer user_id - Categorization: +Text notes - - class DetectorCategorization - DetectorCategorization: +Integer categorization_id - DetectorCategorization: +Integer detector_id - DetectorCategorization: +Float confidence # maybe this is a wrap up of multiple Detector confidences (calculated value) + SearchEvent: +Timestamp created_at + SearchEvent: single_month() + + class Detection + Detection: +Integer id + Detection: +Integer term_id + Detection: +Integer detector_id + Detection: +Integer detector_version + Detection: +Float confidence + Detection: initialize() + Detection: setDetectionVersion() + Detection: recordDetections() + Detection: recordPatterns() + Detection: recordJournals() + Detection: recordSuggestedResource() class Detector Detector: +Integer id Detector: +String name - Detector: +Float confidence # determined by validation yes/no votes - - class Report - Report: percent_categorized() - Report: category_history() -``` ---- - -## Conceptual diagram - -There are three basic models which we are attempting to relate to each other: -Terms, Detectors, and Categories. The relationship looks like this: - -```mermaid -classDiagram - direction TB - - Term --> Category: are placed into - Detector --> Term: get applied to - Category --> Detector: are informed by - - class Term - Term: +Integer id - Term: +String phrase + Detector: +Float confidence + Detector: incrementConfidence() + Detector: decrementConfidence() class Category Category: +Integer id Category: +String name - class Detector - Detector: +Integer id - Detector: +String name - -``` - -Some sample data in each table might be: - -### Terms - -| id | phrase | -|----|---------------------------------------| -| 1 | web of science | -| 2 | pitchbook | -| 3 | vaibbhav taraate | -| 4 | doi.org/10.1080/17460441.2022.2084607 | ---- - -We have received more than 40,000 unique search terms from the Bento system in -the first three months of TACOS operation. - -### Categories - -| id | name | note | -|----|---------------|-------------------------------------------------------------------------------------------| -| 1 | Transactional | The user wants to complete an _action_ (i.e. to receive an item) | -| 2 | Navigational | The user wants to reach a _place_ which might be a web page, or perhaps talk to a person. | -| 3 | Informational | The user wants _information_ about an idea or concept. | - -Thus far, we have only focused on these three categories of search intent. It -should be noted that the SEO literature references additional categories, such -as "commercial" or "conversational". - -Additionally, some of these categories may be sub-divided. Transactional -searches might be looking for a book, a journal article, or a thesis. -Navigational searches might be satisfied by visiting the desired webpage, or -contacting a liaison. - -### Detectors + class Categorization + Categorization: +Integer category_id + Categorization: +Integer term_id + Categorization: +Float confidence -| id | name | note | -|----|--------------------|-----------------| -| 1 | DOI | Regex detection | -| 2 | ISBN | Regex detection | -| 3 | ISSN | Regex detection | -| 4 | PMID | Regex detection | -| 5 | Journal name | Term lookup | -| 6 | Suggested resource | Term lookup | + class DetectionCategory + DetectionCategory: +Integer id + DetectionCategory: +Integer detector_id + DetectionCategory: +Integer category_id + DetectionCategory: +Float confidence + DetectionCategory: incrementConfidence() + DetectionCategory: decrementConfidence() ---- + style SearchEvent fill:#000,stroke:#66c2a5,color:#66c2a5,stroke-width:4px; + style Term fill:#000,stroke:#66c2a5,color:#66c2a5,stroke-width:4px; -Further discussion of the class diagram can be found in the three prototype files: + style Category fill:#000,stroke:#fc8d62,color:#fc8d62 + style DetectionCategory fill:#000,stroke:#fc8d62,color:#fc8d62 + style Detector fill:#000,stroke:#fc8d62,color:#fc8d62 -* [Prototype zero (abandoned)](./classes-prototype-zero.md) -* [Prototype A ("Code")](./classes-prototype-a.md) -* [Prototype B ("Data")](./classes-prototype-b.md) + style Categorization fill:#000,stroke:#8da0cb,color:#8da0cb,stroke-dasharray: 3 5; + style Detection fill:#000,stroke:#8da0cb,color:#8da0cb,stroke-dasharray: 3 5; +```