Diagramming work prior to prototyping

Reword this... tmp best commit message evah DO NOT MERGE: More verbose class diagram The changes here are still workshopping how we want to handle the data model - there isn't a single change that's being proposed yet. Add validation to code approach diagram
MITLibraries · Aug 29, 2024 · 6a6bb4a · 6a6bb4a
1 parent ab821df
commit 6a6bb4a
Show file tree

Hide file tree

Showing 2 changed files with 315 additions and 2 deletions.
diff --git a/docs/explanation/work-activity-analysis.md b/docs/explanation/work-activity-analysis.md
@@ -50,7 +50,7 @@ _Our initial minimal product will only include this staff workflow to allow us t
 
 ```mermaid
 graph TD
-  A("Liaison 🧑") --> C{Dashboard}
+  A("Expert 🧑") --> C{Dashboard}
   C --> E(View uncategorized)
   G --> E
   E --> G("Enter categorization (and optional comments)")
@@ -68,7 +68,7 @@ One way to frame this is "Is this search a match with this category" (a yes/no q
 
 ```mermaid
 graph TD
-  A("Liaison 🧑") --> C{Dashboard}
+  A("Expert 🧑") --> C{Dashboard}
   G --> D
   C --> D(View algorithm predictions)
   D --> F{Correct prediction?}

diff --git a/docs/reference/classes.md b/docs/reference/classes.md
@@ -1,3 +1,7 @@
+# Modeling categorization
+
+## Initial proposal
+
 ```mermaid
 classDiagram
   direction TB
@@ -50,7 +54,316 @@ classDiagram
     Categorization: +Integer user_id
     Categorization: +Text notes
 
+  class DetectorCategorization
+    DetectorCategorization: +Integer categorization_id
+    DetectorCategorization: +Integer detector_id
+    DetectorCategorization: +Float confidence # maybe this is a wrap up of multiple Detector confidences (calculated value)
+
+  class Detector
+    Detector: +Integer id
+    Detector: +String name
+    Detector: +Float confidence # determined by validation yes/no votes
+
   class Report
     Report: percent_categorized()
     Report: category_history()
 ```
+---
+
+## Conceptual diagram
+
+There are three basic models which we are attempting to relate to each other:
+Terms, Detectors, and Categories. The relationship looks like this:
+
+```mermaid
+classDiagram
+  direction TB
+
+  Term --> Category: are placed into
+  Detector --> Term: get applied to
+  Category --> Detector: are informed by
+
+  class Term
+    Term: +Integer id
+    Term: +String phrase
+
+  class Category
+    Category: +Integer id
+    Category: +String name
+
+  class Detector
+    Detector: +Integer id
+    Detector: +String name
+
+```
+
+Some sample data in each table might be:
+
+### Terms
+
+| id | phrase                                |
+|----|---------------------------------------|
+| 1  | web of science                        |
+| 2  | pitchbook                             |
+| 3  | vaibbhav taraate                      |
+| 4  | doi.org/10.1080/17460441.2022.2084607 |
+---
+
+We have received more than 40,000 unique search terms from the Bento system in
+the first three months of TACOS operation.
+
+### Categories
+
+| id | name          | note                                                                                      |
+|----|---------------|-------------------------------------------------------------------------------------------|
+| 1  | Transactional | The user wants to complete an _action_ (i.e. to receive an item)                          |
+| 2  | Navigational  | The user wants to reach a _place_ which might be a web page, or perhaps talk to a person. |
+| 3  | Informational | The user wants _information_ about an idea or concept.                                    |
+
+Thus far, we have only focused on these three categories of search intent. It
+should be noted that the SEO literature references additional categories, such
+as "commercial" or "conversational".
+
+Additionally, some of these categories may be sub-divided. Transactional
+searches might be looking for a book, a journal article, or a thesis.
+Navigational searches might be satisfied by visiting the desired webpage, or
+contacting a liaison.
+
+### Detectors
+
+| id | name               | note            |
+|----|--------------------|-----------------|
+| 1  | DOI                | Regex detection |
+| 2  | ISBN               | Regex detection |
+| 3  | ISSN               | Regex detection |
+| 4  | PMID               | Regex detection |
+| 5  | Journal name       | Term lookup     |
+| 6  | Suggested resource | Term lookup     |
+
+
+## One central join table
+```mermaid
+classDiagram
+  direction TB
+
+  Term --> Link
+  Category --> Link
+  Detector --> Link
+
+  class Term
+    Term: +Integer id
+    Term: +String phrase
+
+  class Category
+    Category: +Integer id
+    Category: +String name
+
+  class Link
+    Link: +Integer
+    Link: +Integer term_id
+    Link: +Integer category_id
+    Link: +Integer detector_id
+
+  class Detector
+    Detector: +Integer id
+    Detector: +String name
+```
+---
+# Sets of two-way join tables
+
+```mermaid
+classDiagram
+  direction LR
+
+  Term >-- TermDetector
+  TermDetector --> Detector
+  Category <-- DetectorCategory
+  DetectorCategory --> Detector
+  Term --> TermCategory
+  TermCategory <-- Category
+  SuggestedResource --> Category
+  Term <-- TermSuggestedResource
+  TermSuggestedResource --> SuggestedResource
+
+  class Term:::primarytable
+    Term: +Integer id
+    Term: +String phrase
+
+  class TermDetector
+    TermDetector: +Integer term_id
+    TermDetector: +Integer detector_id
+    TermDetector: +Boolean result
+
+  class Detector
+    Detector: +Integer id
+    Detector: +String name
+    Detector: hasMatch()
+
+  class Category
+    Category: +Integer id
+    Category: +String name
+    Category: +String note
+
+  class DetectorCategory
+    DetectorCategory: +Integer detector_id
+    DetectorCategory: +Integer category_id
+
+  class TermCategory
+    TermCategory: +Integer term_id
+    TermCategory: +Integer category_id
+    TermCategory: +Integer user_id
+
+  class SuggestedResource
+    SuggestedResource: +Integer id
+    SuggestedResource: +String title
+    SuggestedResource: +String fingerprint
+    SuggestedResource: +URL url
+    SuggestedResource: +Integer category_id
+
+  class TermSuggestedResource
+    TermSuggestedResource: +Integer term_id
+    TermSuggestedResource: +Integer suggested_resource_id
+    TermSuggestedResource: +Boolean result
+
+  style Category fill:#000,stroke:#ffd407,color:#ffd407
+  style Detector fill:#000,stroke:#ffd407,color:#ffd407
+  style Term fill:#000,stroke:#ffd407,color:#ffd407
+```
+
+The principle resources are Terms, Categories, and Detectors. Terms flow in
+continuously. Detectors are less fluid, but might still be expected to change as
+we improve our operations. Categories are the slowest changing.
+
+The relationship between Detectors and Categories would be generally set ahead
+of time. Detectors produce a boolean output in the cleanest case - they either
+detect a signal, or they do not. Relatedly, detectors have an influence over
+whether a given Category is relevant, or not:
+
+* If the Detector for a DOI pattern returns `true`, then this influences the
+  `transactional` Category to a significant degree.
+* However, the Detector for a DOI pattern does almost nothing to influence the
+  `navigational` Category.
+* If Categorization is a zero-sum activity, however, the DOI pattern detector
+  would _exclusively_ claim a Term for the `transactional` Category - so it
+  would effectively rule out the other two Categories.
+
+The exception to this Detector rule is the SuggestedResource detector - which
+has variability in its records. Some SuggestedResources are in each of the three
+Categories, so there is a more complicated decision-making algorithm, and thus
+a different set of database tables.
+
+## Order of operations
+
+The linkages between these tables are filled in at different moments.
+
+The Detector-Category linkage is determined as either set of resource is made,
+and on a relatively slow cadence. Operationally, the links which matter are made
+as new Terms flow into TACOS.
+
+1. A new Term is recorded in the system.
+2. That Term is compared with each Detector, and any positive responses are
+   recorded. Negative responses may be discarded, or recorded for the sake of
+   completeness (to confirm that the link was tested).
+3. Those Term-Detector responses are then used to perform the Categorization
+   work, which results in records being created in the TermCategory table.
+
+---
+
+# Less "pure" implementation
+```mermaid
+classDiagram
+
+  Term >-- Detection: has many
+  Detection >-- Categorization: based on
+  Category >-- SuggestedResource: belongs to
+  Categorization --> SuggestedResource: looks up
+  Detection --> SuggestedResource: looks up
+  Detection --> Journal: looks up
+  Categorization >-- Validation: subject to
+
+  class Term
+    Term: +Integer id
+    Term: +String phrase
+
+  class SuggestedResource
+    SuggestedResource: +Integer id
+    SuggestedResource: +String title
+    SuggestedResource: +String url
+    SuggestedResource: +String phrase
+    SuggestedResource: +String fingerprint
+    SuggestedResource: +Integer category_id
+    SuggestedResource: calculateFingerprint()
+
+  class Journal
+    Journal: +Integer id
+    Journal: +String title
+
+  class Detection
+    Detection: +Integer id
+    Detection: +Integer term_id
+    Detection: +Integer detector_version
+    Detection: +Boolean DOI
+    Detection: +Boolean ISBN
+    Detection: +Boolean ISSN
+    Detection: +Boolean PMID
+    Detection: +Boolean Journal
+    Detection: +Integer journal_id
+    Detection: +Boolean SuggestedResource
+    Detection: +Integer suggested_resource_id
+    Detection: +Boolean LCSH
+    Detection: +Boolean WebsitePageTitle
+    Detection: hasDOI()
+    Detection: hasISBN()
+    Detection: hasISSN()
+    Detection: hasPMID()
+    Detection: hasJournal()
+    Detection: hasSuggestedResource()
+    Detection: hasLCSH()
+    Detection: hasWebsitePageTitle()
+
+  class Detector
+    Detector: +Integer id
+    Detector: +String name
+    Detector: +Float DOI_Confidence
+
+  class Category
+    Category: +Integer id
+    Category: +String name
+
+  class Categorization
+    Categorization: +Integer id
+    Categorization: +Integer detection_id
+    Categorization: +Float transaction_score
+    Categorization: +Float information_score
+    Categorization: +Float navigation_score
+    Categorization: evaluateTransaction()
+    Categorization: evaluateInformation()
+    Categorization: evaluateNavigation()
+
+  class Validation
+    Validation: +Integer id
+    Validation: +Integer categorization_id
+    Validation: +Boolean approve_transaction
+    Validation: +Boolean approve_information
+    Validation: +Boolean approve_navigation
+    Validation: +Boolean approve_doi
+    Validation: +Boolean approve_isbn
+    Validation: +Boolean approve_issn
+    Validation: +Boolean approve_pmid
+    Validation: +Boolean approve_journal
+    Validation: +Boolean approve_suggested_resource
+    Validation: +Boolean approve_lcsh
+    Validation: +Boolean approve_webpage
+
+  style Term fill:#000,stroke:#ffd407,color:#ffd407
+  style Detector fill:#000,stroke:#ffd407,color:#ffd407
+  style Category fill:#000,stroke:#ffd407,color:#ffd407
+```
+This makes the order of operation a bit more explicit:
+
+1. A new Term is registered.
+2. The Detection table entry for that Term is populated (which allows repeat
+   Detection passes as the detector models change).
+3. The output of various Detection passes (either the most recent for each term,
+   or all detections over time) are processed via code to generate scores for
+   each potential category.