Skip to content

Commit

Permalink
Cleanup this PR to focus on needed files
Browse files Browse the repository at this point in the history
  • Loading branch information
matt-bernhardt committed Sep 6, 2024
1 parent a5922d1 commit 1dd3a8d
Show file tree
Hide file tree
Showing 11 changed files with 177 additions and 1,314 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ classDiagram
Detector <-- Link
Category <-- Link
class Term
Term: +Integer id
Term: +String phrase
Expand All @@ -84,6 +85,7 @@ classDiagram
Detector: +Integer id
Detector: +String name
style Term fill:#000,stroke:#66c2a5,color:#66c2a5,stroke-width:4px;
style Category fill:#000,stroke:#fc8d62,color:#fc8d62
Expand Down Expand Up @@ -115,6 +117,7 @@ classDiagram
Detection --> SuggestedResource: looks up
Detection --> Journal: looks up
class Term
Term: +Integer id
Term: +String phrase
Expand Down Expand Up @@ -164,6 +167,7 @@ classDiagram
Categorization: calculateNavigation()
Categorization: calculateTransaction()
style Term fill:#000,stroke:#66c2a5,color:#66c2a5,stroke-width:4px;
style Category fill:#000,stroke:#fc8d62,color:#fc8d62
Expand Down Expand Up @@ -211,7 +215,8 @@ classDiagram
Term <-- TermSuggestedResource
TermSuggestedResource --> SuggestedResource
class Term:::primarytable
class Term
Term: +Integer id
Term: +String phrase
Term: categorize()
Expand Down Expand Up @@ -258,6 +263,7 @@ classDiagram
TermSuggestedResource: +Integer suggested_resource_id
TermSuggestedResource: +Boolean result
style Term fill:#000,stroke:#66c2a5,color:#66c2a5,stroke-width:4px;
style Category fill:#000,stroke:#fc8d62,color:#fc8d62
Expand Down Expand Up @@ -291,12 +297,12 @@ classDiagram
Term --< Detection: has many
class Term
Term: +Integer id
Term: +String phrase
Term: calculateCategory()
class Detection
Detection: +Integer id
Detection: +Integer term_id
Expand All @@ -314,10 +320,8 @@ classDiagram
Detection: recordJournals()
Detection: recordSuggestedResource()
style Term fill:#000,stroke:#66c2a5,color:#66c2a5,stroke-width:4px;
style Category fill:#000,stroke:#fc8d62,color:#fc8d62
style Detector fill:#000,stroke:#fc8d62,color:#fc8d62
style Term fill:#000,stroke:#66c2a5,color:#66c2a5,stroke-width:4px;
style Detection fill:#000,stroke:#8da0cb,color:#8da0cb,stroke-dasharray: 3 5;
```
Expand Down Expand Up @@ -349,19 +353,18 @@ classDiagram
Detector "1" --> "0..*" DetectionCategory
class Term
Term: +Integer id
Term: +String phrase
Term: calculateCategory()
class Detection
Detection: +Integer id
Detection: +Integer term_id
Detection: +Integer detector_id
Detection: +Integer detector_version
Detection: +Float confidence
Detection: initialize()
Detection: setDetectionVersion()
Detection: recordDetections()
Expand Down
107 changes: 107 additions & 0 deletions docs/explanation/categorization-workflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Categorization workflow

## Conceptual diagram

There are three basic models which we are attempting to relate to each other:
Terms, Detectors, and Categories. The relationship looks like this:

```mermaid
flowchart LR
Terms
Detectors
Categories
Terms -- are evaluated by --> Detectors
Detectors <-- are mapped to --> Categories
Categories -- get linked with --> Terms
```

## Example data

### Terms

| id | phrase |
|----|---------------------------------------|
| 1 | web of science |
| 2 | pitchbook |
| 3 | vaibbhav taraate |
| 4 | doi.org/10.1080/17460441.2022.2084607 |
---

We have received more than 40,000 unique search terms from the Bento system in
the first three months of TACOS' operation.

### Categories

| id | name | note |
|----|---------------|-------------------------------------------------------------------------------------------|
| 1 | Transactional | The user wants to complete an _action_ (i.e. to receive an item) |
| 2 | Navigational | The user wants to reach a _place_ which might be a web page, or perhaps talk to a person. |
| 3 | Informational | The user wants _information_ about an idea or concept. |

Thus far, we have only focused on these three categories of search intent. It
should be noted that the SEO literature references additional categories, such
as "commercial" or "conversational".

Additionally, some of these categories may be sub-divided. Transactional
searches might be looking for a book, a journal article, or a thesis.
Navigational searches might be satisfied by visiting the desired webpage, or
contacting a liaison.

### Detectors

| id | name | note |
|----|--------------------|-----------------|
| 1 | DOI | Regex detection |
| 2 | ISBN | Regex detection |
| 3 | ISSN | Regex detection |
| 4 | PMID | Regex detection |
| 5 | Journal name | Term lookup |
| 6 | Suggested resource | Term lookup |

Our detectors so far fall into one of two broad types: those which use regular expressions to detect patterns within
the search term, and those which check whether the search term appears in an external list of resources.

---

## Workflow

Most of the time, this workflow will be followed automatically when a new search phrase is recorded by the application
for the first time. Occasionally, we will re-run this workflow (either manually, or via a schedule) when the application
changes enough that a prior workflow is no longer valid. Our method of determining when prior work is no longer valid is
to rely on the `detector_version` value in the Detection model.

When a search phrase is received which has already been categorized, the prior scores are used without needing to follow
this workflow again.

### Pass the term through our suite of detectors

Passing the search phrase through all of our detectors is done via a method like `recordDetections()`, which is part of
the `Detection` model. Should ony a subset of detectors need to be consulted, there are internal methods which can
accomplish this.

### Calculate the categorization scores based on these detections

The `Term` model has a method which looks up all the Detectors which found a positive result for that term. This
`calculateCategory()` model performs the necessary math to determine the score for each Category in the system, and
creates the needed `Categorization` records. The calculated score is stored in the `confidence` field of this model.

One detector in this application is associated with different categories on a record-by-record basis - the
SuggestedResource detector. The `calculateCategory()` method includes a lookup for this detector to make sure that any
detections are scored appropriately.

### Human validation of these operations

There will be an ability for humans to inspect these operations, and to submit feedback about any actions which were
not correct. These validations will be used to further refine the confidence values associated with our `Detector` and
`DetectionCategory` records, as well as to refine the operation of the detectors, or the mappings between these
elements.

This validation workflow has not been defined yet, nor has the data model been expanded to support this feedback. We do
anticipate, however, that successful or unsuccessful validations would end up adjusting the relevant confidence values
via the `incrementConfidence()` or `decrementConfidence()` methods.

---

Further discussion of this design can be found in the [Classes diagram](../reference/classes.md).
161 changes: 0 additions & 161 deletions docs/explanation/validation-workflow-a.md

This file was deleted.

Loading

0 comments on commit 1dd3a8d

Please sign in to comment.