-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This is the model before reintroducing the `mapping` table from Prototype B that became clearer during documentation
- Loading branch information
Showing
1 changed file
with
100 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
# Prototype D ("Detectors have many Detections") | ||
|
||
This prototype focuses attempts to only store positive detections, which seems valuable as most of our Terms have no `Detections` and other Prototypes stored "misses" in addition to "hits". | ||
|
||
This comes at the cost of potentially storing more than one `Detection` for some Terms (i.e. if an ISSN and a JournalName match, we'll store 2 `Detection` records). | ||
|
||
`Category` and `Categorization` are optional tables. They would serve as a sort of cache for the `calculateCategory()` method for a `Term`. This allows a `Term` to have multiple categories and would serve as as way to quickly report on what we know about the system rather than calculating everything on the fly. As each `Categorization` stores a confidence float, this should allow us to return to consuming systems how confident we are in each `Category` we return. | ||
|
||
`Detector` is a place where we keep track of each algorithm we have coded and how confident we are in it's ability to predict a Category. | ||
|
||
## Shared preface | ||
|
||
The same color scheme is used for both prototypes: | ||
|
||
* <font style="color:#66c2a5">Terms</font>, which flow in continuously with Search Events; | ||
* A <font style="color:#fc8d62">knowledge graph</font>, which includes the categories, detectors, and relationships | ||
between the two which TACOS defines and maintains, and which is consulted during categorization; and | ||
* The <font style="color:#8da0cb">linkages between these terms and the graph</font>, which record which signals are | ||
detected in each term, and how those signals are interpreted to place the term into a category. | ||
|
||
A simple way to describe the Categorization workflow would be to say that Categorization involves populating the blue | ||
tables in the diagrams below. | ||
|
||
## Categorization | ||
|
||
```mermaid | ||
classDiagram | ||
direction LR | ||
Term "1" --> "1..*" Detection | ||
Detector "1" --> "0..*" Detection | ||
Term "1" --> "0..*" Categorization | ||
Categorization "0..*" --> "1" Category | ||
class Term | ||
Term: +Integer id | ||
Term: +String phrase | ||
Term: calculateCategory() | ||
class Detection | ||
Detection: +Integer id | ||
Detection: +Integer term_id | ||
Detection: +Integer detector_id | ||
Detection: +Integer detector_version | ||
Detection: +Float confidence | ||
Detection: initialize() | ||
Detection: setDetectionVersion() | ||
Detection: recordDetections() | ||
Detection: recordPatterns() | ||
Detection: recordJournals() | ||
Detection: recordSuggestedResource() | ||
class Detector | ||
Detector: +Integer id | ||
Detector: +String name | ||
Detector: +Float confidence | ||
Detector: incrementConfidence() | ||
Detector: decrementConfidence() | ||
class Category | ||
Category: +Integer id | ||
Category: +String name | ||
class Categorization | ||
Categorization: +Integer category_id | ||
Categorization: +Integer term_id | ||
Categorization: +Float confidence | ||
style Term fill:#000,stroke:#66c2a5,color:#66c2a5 | ||
style Category fill:#000,stroke:#fc8d62,color:#fc8d62 | ||
style Categorization fill:#000,stroke:#8da0cb,color:#8da0cb | ||
style Detector fill:#000,stroke:#fc8d62,color:#fc8d62 | ||
style Detection fill:#000,stroke:#8da0cb,color:#8da0cb | ||
``` | ||
|
||
### Order of operations | ||
|
||
1. A term enters the system | ||
2. If Categories exist, return the existing Categories. | ||
3. If no Categories exist, run all Detectors and create Detection and Categorization records. If no Detections are made, we should consider Categorizing the Term as "Unknown" Category to allow for not running Detections again. | ||
4. If new Detectors are created/adjusted. Categorizations should be deleted or expired in some way to allow for new Detections/Categorizations to be created. | ||
|
||
### Category values | ||
|
||
These are largely algorithmic in this model. We'd know what was detected from the Detections table and the `Term` or `Category` model would handle `Categorization` based on business logic we put in place. Example, having a DOI is high confidence for being a Specific Item. | ||
|
||
Unsolved in this model: one Detector (so far) has `Categories` built into the `Detector` (SuggestedResources). These would need to be passed into the `calculateCategory()` method in some way to allow for appropriate `categorization`. | ||
|
||
### Calculating the category scores | ||
|
||
One interesting feature of Detector is that it stores the confidence of each algorithm to accurately predict a category. During validation, if a Detection made by an algorithm is confirmed, we can run `incrementConfidence()` whatever that ends up meaning. Similarly, if an Detection is validated as inaccurate, we can run `decrementConfidence()`. | ||
|
||
Note: Prototype B likely has this feature as well, and has a mapping table between Detector and Category which makes a lot of sense. The confidence is likely actually part of the join table and not part of the Detector itself. Some detectors are themselves non-binary in terms of predection so it is possible that there should be confidence stored for each (namely JournalName detection is fairly weak compared to many other algorithms to date) | ||
|
||
## Validations | ||
|