-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Write up documentation for workflow so far
Update validation workflow doc Update workflow explanation This adds the Ruby code block to categorize all terms Separate prototype data model document This needs to be cleaned up, along with classes.md Further documentation work Updates to documentation
- Loading branch information
1 parent
6a6bb4a
commit bba7881
Showing
6 changed files
with
686 additions
and
224 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,161 @@ | ||
# The categorization and validation workflow | ||
|
||
This document describes the workflow for categorizing, and then validating, how | ||
a given term has been processed by TACOS. | ||
|
||
## Preparation | ||
|
||
Pick what record we're working with. In production, this would happen as new | ||
terms are recorded, but for now we're working with a randomly chosen example. | ||
|
||
```ruby | ||
t = Term.all.sample | ||
``` | ||
|
||
## Pass the term through our suite of detectors | ||
|
||
This assumes that all of our detection algorithms are integrated with the | ||
Detector model, which creates a record of their output for processing during the | ||
Categorization phase. | ||
|
||
```ruby | ||
d = Detection.new(t) | ||
d.save | ||
``` | ||
|
||
To this point the Detection model only records activations by each detection, as | ||
boolean values. Future development might add more details, such as which records | ||
are matched, or what external lookups return. It might also be relevant to note | ||
whether multiple patterns are found. | ||
|
||
```ruby | ||
irb(main):013> d | ||
=> | ||
#<Detection:0x0000000122606878 | ||
id: 5, | ||
term_id: 53558, | ||
detection_version: 1, | ||
doi: false, | ||
isbn: false, | ||
issn: false, | ||
pmid: false, | ||
journal: false, | ||
suggestedresource: false, | ||
created_at: Fri, 23 Aug 2024 13:38:21.631333000 UTC +00:00, | ||
updated_at: Fri, 23 Aug 2024 13:38:21.631333000 UTC +00:00> | ||
``` | ||
|
||
In this example, none of the detectors found anything. | ||
|
||
The `detection_version` value in these records gets stored in ENV, and | ||
incremented as our detection algorithms change. This helps identify whether a | ||
Detection is outdated and needs to be refreshed. | ||
|
||
## Generate the Categorization values based on these detections | ||
|
||
```ruby | ||
c = Categorization.new(d) | ||
c.save | ||
``` | ||
|
||
The creation of the record includes the calculation of scores for each of the | ||
three categories. To this point, the logic is exceedingly simple, but this can | ||
be made more nuanced with time. | ||
|
||
```ruby | ||
irb(main):019> c | ||
=> | ||
#<Categorization:0x0000000117c3a920 | ||
id: 2, | ||
detection_id: 5, | ||
transaction_score: 0.0, | ||
information_score: 0.0, | ||
navigation_score: 0.0, | ||
created_at: Fri, 23 Aug 2024 13:43:17.640485000 UTC +00:00, | ||
updated_at: Fri, 23 Aug 2024 13:43:17.640485000 UTC +00:00> | ||
``` | ||
|
||
These scores are used by the `evaluate` method to assign the term to a category, | ||
if relevant. Because none of the detectors fired in the previous step, all of | ||
the category scores are 0.0 and the term will be placed in the "unknown" | ||
category. | ||
|
||
```ruby | ||
t.category = c.evaluate | ||
t.save | ||
``` | ||
|
||
There is also an `assign` method at the moment, which combines the above steps. | ||
This may not make sense in production, however. | ||
|
||
The result of the Categorization workflow is that the original Term record now | ||
has been placed in a category: | ||
|
||
```ruby | ||
irb(main):008> t | ||
=> | ||
#<Term:0x00000001073c56d8 | ||
id: 53558, | ||
phrase: "Darfur: A Short History of a Long War ", | ||
created_at: Tue, 20 Aug 2024 13:26:23.628215000 UTC +00:00, | ||
updated_at: Tue, 20 Aug 2024 13:26:23.628215000 UTC +00:00, | ||
category: "unknown"> | ||
``` | ||
|
||
From end to end, the code to categorize all untouched term records is then this: | ||
|
||
```ruby | ||
Term.where("category is null").each { |t| | ||
d = Detection.new(t) | ||
d.save | ||
c = Categorization.new(d) | ||
c.assign | ||
} | ||
``` | ||
|
||
## Validation | ||
|
||
Humans will be asked to inspect the outcomes of the previous steps, and provide | ||
feedback about whether any decisions were made incorrectly. | ||
|
||
```ruby | ||
v = Validation.new(c) | ||
v.save | ||
``` | ||
|
||
Validation records have a boolean flag for each decision which went into the | ||
process thus far: | ||
|
||
```ruby | ||
irb(main):011> v | ||
=> | ||
#<Validation:0x0000000116296870 | ||
id: 1, | ||
categorization_id: 3, | ||
valid_category: nil, | ||
valid_transaction: nil, | ||
valid_information: nil, | ||
valid_navigation: nil, | ||
valid_doi: nil, | ||
valid_isbn: nil, | ||
valid_issn: nil, | ||
valid_pmid: nil, | ||
valid_journal: nil, | ||
valid_suggested_resource: nil, | ||
flag_term: nil, | ||
created_at: Fri, 23 Aug 2024 14:57:09.627620000 UTC +00:00, | ||
updated_at: Fri, 23 Aug 2024 14:57:09.627620000 UTC +00:00> | ||
``` | ||
|
||
This includes a flag for the final result, each component score, each individual | ||
detection, and a final flag that indicates the Term itself needs review. The | ||
intent of this final flag is for the case where a search term is somehow | ||
problematic and needs to be expunged. | ||
|
||
There are no methods yet on this model, because all values are meant to be set | ||
individually via the web interface. | ||
|
||
There is not - yet - a notes field on the Validation model, but this is | ||
something that we've discussed in case the validator has more detailed feedback | ||
about some part of the decision-making that is being reviewed. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# The categorization and validation workflow | ||
|
||
Need to write up how Prototype B would operate from start to end... |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,226 @@ | ||
# Prototype A ("Code") | ||
|
||
This prototype relies on fewer tables, with one record in each, and leans more heavily on behavior in code. | ||
|
||
## Shared preface | ||
|
||
The same color scheme is used for both prototypes: | ||
|
||
* <font style="color:#66c2a5">Terms</font>, which flow in continuously with Search Events; | ||
* A <font style="color:#fc8d62">knowledge graph</font>, which includes the categories, detectors, and relationships | ||
between the two which TACOS defines and maintains, and which is consulted during categorization; and | ||
* The <font style="color:#8da0cb">linkages between these terms and the graph</font>, which record which signals are | ||
detected in each term, and how those signals are interpreted to place the term into a category. | ||
|
||
A simple way to describe the Categorization workflow would be to say that Categorization involves populating the blue | ||
tables in the diagrams below. | ||
|
||
## Categorization | ||
|
||
```mermaid | ||
classDiagram | ||
direction LR | ||
Term --< Detection: has many | ||
Detection <-- Categorization: based on | ||
Categorization --> SuggestedResource: looks up | ||
Detection --> SuggestedResource: looks up | ||
Detection --> Journal: looks up | ||
class Term | ||
Term: +Integer id | ||
Term: +String phrase | ||
Term: +Enum category | ||
class SuggestedResource | ||
SuggestedResource: +Integer id | ||
SuggestedResource: +String title | ||
SuggestedResource: +String url | ||
SuggestedResource: +String phrase | ||
SuggestedResource: +String fingerprint | ||
SuggestedResource: +Enum category | ||
SuggestedResource: calculateFingerprint() | ||
class Journal | ||
Journal: +Integer id | ||
Journal: +String title | ||
class Detection | ||
Detection: +Integer id | ||
Detection: +Integer term_id | ||
Detection: +Integer detector_version | ||
Detection: +Boolean DOI | ||
Detection: +Boolean ISBN | ||
Detection: +Boolean ISSN | ||
Detection: +Boolean PMID | ||
Detection: +Boolean Journal | ||
Detection: +Boolean SuggestedResource | ||
Detection: initialize() | ||
Detection: setDetectionVersion() | ||
Detection: recordDetections() | ||
Detection: recordPatterns() | ||
Detection: recordJournals() | ||
Detection: recordSuggestedResource() | ||
class Categorization | ||
Categorization: +Integer id | ||
Categorization: +Integer detection_id | ||
Categorization: +Float information_score | ||
Categorization: +Float navigation_score | ||
Categorization: +Float transaction_score | ||
Categorization: initialize() | ||
Categorization: assign() | ||
Categorization: evaluate() | ||
Categorization: calculateAll() | ||
Categorization: calculateInformation() | ||
Categorization: calculateNavigation() | ||
Categorization: calculateTransaction() | ||
style Term fill:#000,stroke:#66c2a5,color:#66c2a5 | ||
style Category fill:#000,stroke:#fc8d62,color:#fc8d62 | ||
style Detector fill:#000,stroke:#fc8d62,color:#fc8d62 | ||
style Journal fill:#000,stroke:#fc8d62,color:#fc8d62 | ||
style SuggestedResource fill:#000,stroke:#fc8d62,color:#fc8d62 | ||
style Detection fill:#000,stroke:#8da0cb,color:#8da0cb | ||
style Categorization fill:#000,stroke:#8da0cb,color:#8da0cb | ||
``` | ||
|
||
### Order of operations | ||
|
||
1. A new `Term` is registered. | ||
2. A `Detection` record for that `Term` is created (which allows repeat detection operations as TACOS gains new | ||
capabilities). | ||
3. The various `Detection` records (either the most recent for each term, or all detections over time) are processed via | ||
code to generate scores for each potential category. These results are stored as `Categorization` records. | ||
4. The three category scores are compared, and the one with the highest score is stored back in the `Term` record. | ||
|
||
### Category values | ||
|
||
There is no `Category` table, but two models have separate enumerated fields. The `Detector::SuggestedResource` model | ||
has three possible values (Informational, Navigational, and Transactional), while the `Term` model has an additional | ||
value ("Unknown") which is assigned during Categorization if two category scores are equal. | ||
|
||
(This lack of a category table is not a fundamental aspect of this prototype, but it does indicate the general choice to | ||
rely on code, rather than database records, as much as possible. Such a model could be accommodated, or implemented via | ||
a shared helper method perhaps) | ||
|
||
### Calculating the category scores | ||
|
||
At the moment, category scores are assigned in methods like: | ||
|
||
```ruby | ||
# FILE: app/models/categorization.rb | ||
def calculate_transactional | ||
self.transaction_score = 0.0 | ||
self.transaction_score = 1.0 if %i[doi isbn issn pmid journal].any? do |signal| | ||
self.detection[signal] | ||
end | ||
self.transaction_score = 1.0 if Detector::SuggestedResource.full_term_match(self.detection.term.phrase).first&.category == 'transactional' | ||
end | ||
``` | ||
|
||
This is effectively an "all or nothing" approach, where any detection at all results in the maximum possible score. This | ||
lacks nuance, obviously, and we've talked about ways to include a confidence value in these calculations. As yet, this | ||
prototype has not attempted to include that feature however. | ||
|
||
**Note:** I've tried to anticipate how to include confidence values appropriately in this prototype, and it is not at | ||
all clear how that might happen. This gets to the mathematical operations involved in calculating the category scores, | ||
which might need to be documented separately. | ||
|
||
## Validations | ||
|
||
```mermaid | ||
classDiagram | ||
direction LR | ||
Term --< Detection: has many | ||
Detection <-- Categorization: based on | ||
Categorization --> SuggestedResource: looks up | ||
Detection --> SuggestedResource: looks up | ||
Detection --> Journal: looks up | ||
Categorization >-- Validation: subject to | ||
class Term | ||
Term: +Integer id | ||
Term: +String phrase | ||
Term: +Enum category | ||
class SuggestedResource | ||
SuggestedResource: +Integer id | ||
SuggestedResource: +String title | ||
SuggestedResource: +String url | ||
SuggestedResource: +String phrase | ||
SuggestedResource: +String fingerprint | ||
SuggestedResource: +Enum category | ||
SuggestedResource: calculateFingerprint() | ||
class Journal | ||
Journal: +Integer id | ||
Journal: +String title | ||
class Detection | ||
Detection: +Integer id | ||
Detection: +Integer term_id | ||
Detection: +Integer detector_version | ||
Detection: +Boolean DOI | ||
Detection: +Boolean ISBN | ||
Detection: +Boolean ISSN | ||
Detection: +Boolean PMID | ||
Detection: +Boolean Journal | ||
Detection: +Boolean SuggestedResource | ||
Detection: initialize() | ||
Detection: setDetectionVersion() | ||
Detection: recordDetections() | ||
Detection: recordPatterns() | ||
Detection: recordJournals() | ||
Detection: recordSuggestedResource() | ||
class Categorization | ||
Categorization: +Integer id | ||
Categorization: +Integer detection_id | ||
Categorization: +Float information_score | ||
Categorization: +Float navigation_score | ||
Categorization: +Float transaction_score | ||
Categorization: initialize() | ||
Categorization: assign() | ||
Categorization: evaluate() | ||
Categorization: calculateAll() | ||
Categorization: calculateInformation() | ||
Categorization: calculateNavigation() | ||
Categorization: calculateTransaction() | ||
class Validation | ||
Validation: +Integer id | ||
Validation: +Integer categorization_id | ||
Validation: +Integer user_id | ||
Validation: +Boolean approve_transaction | ||
Validation: +Boolean approve_information | ||
Validation: +Boolean approve_navigation | ||
Validation: +Boolean approve_doi | ||
Validation: +Boolean approve_isbn | ||
Validation: +Boolean approve_issn | ||
Validation: +Boolean approve_pmid | ||
Validation: +Boolean approve_journal | ||
Validation: +Boolean approve_suggested_resource | ||
style Term fill:#000,stroke:#66c2a5,color:#66c2a5 | ||
style Category fill:#000,stroke:#fc8d62,color:#fc8d62 | ||
style Detector fill:#000,stroke:#fc8d62,color:#fc8d62 | ||
style Journal fill:#000,stroke:#fc8d62,color:#fc8d62 | ||
style SuggestedResource fill:#000,stroke:#fc8d62,color:#fc8d62 | ||
style Detection fill:#000,stroke:#8da0cb,color:#8da0cb | ||
style Categorization fill:#000,stroke:#8da0cb,color:#8da0cb | ||
style Validation fill:#000,stroke:#ffd407,color:#ffd407 | ||
``` | ||
|
||
Validations, in this prototype, are collected in a single table with a field for each decision which came before it. As | ||
the application expands, any new detectors or categories would result in new fields, both in the Detection or | ||
Categorization models and also in the Validation model. | ||
|
||
Multiple validations are possible for a single Categorization decision, enabled by the user_id field, which allows for | ||
feedback provided by multiple users if bandwidth allows. |
Oops, something went wrong.