Skip to content

Commit

Permalink
Write up documentation for workflow so far
Browse files Browse the repository at this point in the history
Update validation workflow doc

Update workflow explanation

This adds the Ruby code block to categorize all terms

Separate prototype data model document

This needs to be cleaned up, along with classes.md

Further documentation work

Updates to documentation
  • Loading branch information
matt-bernhardt committed Aug 30, 2024
1 parent 6a6bb4a commit bba7881
Show file tree
Hide file tree
Showing 6 changed files with 686 additions and 224 deletions.
161 changes: 161 additions & 0 deletions docs/explanation/validation-workflow-a.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# The categorization and validation workflow

This document describes the workflow for categorizing, and then validating, how
a given term has been processed by TACOS.

## Preparation

Pick what record we're working with. In production, this would happen as new
terms are recorded, but for now we're working with a randomly chosen example.

```ruby
t = Term.all.sample
```

## Pass the term through our suite of detectors

This assumes that all of our detection algorithms are integrated with the
Detector model, which creates a record of their output for processing during the
Categorization phase.

```ruby
d = Detection.new(t)
d.save
```

To this point the Detection model only records activations by each detection, as
boolean values. Future development might add more details, such as which records
are matched, or what external lookups return. It might also be relevant to note
whether multiple patterns are found.

```ruby
irb(main):013> d
=>
#<Detection:0x0000000122606878
id: 5,
term_id: 53558,
detection_version: 1,
doi: false,
isbn: false,
issn: false,
pmid: false,
journal: false,
suggestedresource: false,
created_at: Fri, 23 Aug 2024 13:38:21.631333000 UTC +00:00,
updated_at: Fri, 23 Aug 2024 13:38:21.631333000 UTC +00:00>
```

In this example, none of the detectors found anything.

The `detection_version` value in these records gets stored in ENV, and
incremented as our detection algorithms change. This helps identify whether a
Detection is outdated and needs to be refreshed.

## Generate the Categorization values based on these detections

```ruby
c = Categorization.new(d)
c.save
```

The creation of the record includes the calculation of scores for each of the
three categories. To this point, the logic is exceedingly simple, but this can
be made more nuanced with time.

```ruby
irb(main):019> c
=>
#<Categorization:0x0000000117c3a920
id: 2,
detection_id: 5,
transaction_score: 0.0,
information_score: 0.0,
navigation_score: 0.0,
created_at: Fri, 23 Aug 2024 13:43:17.640485000 UTC +00:00,
updated_at: Fri, 23 Aug 2024 13:43:17.640485000 UTC +00:00>
```

These scores are used by the `evaluate` method to assign the term to a category,
if relevant. Because none of the detectors fired in the previous step, all of
the category scores are 0.0 and the term will be placed in the "unknown"
category.

```ruby
t.category = c.evaluate
t.save
```

There is also an `assign` method at the moment, which combines the above steps.
This may not make sense in production, however.

The result of the Categorization workflow is that the original Term record now
has been placed in a category:

```ruby
irb(main):008> t
=>
#<Term:0x00000001073c56d8
id: 53558,
phrase: "Darfur: A Short History of a Long War ",
created_at: Tue, 20 Aug 2024 13:26:23.628215000 UTC +00:00,
updated_at: Tue, 20 Aug 2024 13:26:23.628215000 UTC +00:00,
category: "unknown">
```

From end to end, the code to categorize all untouched term records is then this:

```ruby
Term.where("category is null").each { |t|
d = Detection.new(t)
d.save
c = Categorization.new(d)
c.assign
}
```

## Validation

Humans will be asked to inspect the outcomes of the previous steps, and provide
feedback about whether any decisions were made incorrectly.

```ruby
v = Validation.new(c)
v.save
```

Validation records have a boolean flag for each decision which went into the
process thus far:

```ruby
irb(main):011> v
=>
#<Validation:0x0000000116296870
id: 1,
categorization_id: 3,
valid_category: nil,
valid_transaction: nil,
valid_information: nil,
valid_navigation: nil,
valid_doi: nil,
valid_isbn: nil,
valid_issn: nil,
valid_pmid: nil,
valid_journal: nil,
valid_suggested_resource: nil,
flag_term: nil,
created_at: Fri, 23 Aug 2024 14:57:09.627620000 UTC +00:00,
updated_at: Fri, 23 Aug 2024 14:57:09.627620000 UTC +00:00>
```

This includes a flag for the final result, each component score, each individual
detection, and a final flag that indicates the Term itself needs review. The
intent of this final flag is for the case where a search term is somehow
problematic and needs to be expunged.

There are no methods yet on this model, because all values are meant to be set
individually via the web interface.

There is not - yet - a notes field on the Validation model, but this is
something that we've discussed in case the validator has more detailed feedback
about some part of the decision-making that is being reviewed.

3 changes: 3 additions & 0 deletions docs/explanation/validation-workflow-b.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# The categorization and validation workflow

Need to write up how Prototype B would operate from start to end...
226 changes: 226 additions & 0 deletions docs/reference/classes-prototype-a.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
# Prototype A ("Code")

This prototype relies on fewer tables, with one record in each, and leans more heavily on behavior in code.

## Shared preface

The same color scheme is used for both prototypes:

* <font style="color:#66c2a5">Terms</font>, which flow in continuously with Search Events;
* A <font style="color:#fc8d62">knowledge graph</font>, which includes the categories, detectors, and relationships
between the two which TACOS defines and maintains, and which is consulted during categorization; and
* The <font style="color:#8da0cb">linkages between these terms and the graph</font>, which record which signals are
detected in each term, and how those signals are interpreted to place the term into a category.

A simple way to describe the Categorization workflow would be to say that Categorization involves populating the blue
tables in the diagrams below.

## Categorization

```mermaid
classDiagram
direction LR
Term --< Detection: has many
Detection <-- Categorization: based on
Categorization --> SuggestedResource: looks up
Detection --> SuggestedResource: looks up
Detection --> Journal: looks up
class Term
Term: +Integer id
Term: +String phrase
Term: +Enum category
class SuggestedResource
SuggestedResource: +Integer id
SuggestedResource: +String title
SuggestedResource: +String url
SuggestedResource: +String phrase
SuggestedResource: +String fingerprint
SuggestedResource: +Enum category
SuggestedResource: calculateFingerprint()
class Journal
Journal: +Integer id
Journal: +String title
class Detection
Detection: +Integer id
Detection: +Integer term_id
Detection: +Integer detector_version
Detection: +Boolean DOI
Detection: +Boolean ISBN
Detection: +Boolean ISSN
Detection: +Boolean PMID
Detection: +Boolean Journal
Detection: +Boolean SuggestedResource
Detection: initialize()
Detection: setDetectionVersion()
Detection: recordDetections()
Detection: recordPatterns()
Detection: recordJournals()
Detection: recordSuggestedResource()
class Categorization
Categorization: +Integer id
Categorization: +Integer detection_id
Categorization: +Float information_score
Categorization: +Float navigation_score
Categorization: +Float transaction_score
Categorization: initialize()
Categorization: assign()
Categorization: evaluate()
Categorization: calculateAll()
Categorization: calculateInformation()
Categorization: calculateNavigation()
Categorization: calculateTransaction()
style Term fill:#000,stroke:#66c2a5,color:#66c2a5
style Category fill:#000,stroke:#fc8d62,color:#fc8d62
style Detector fill:#000,stroke:#fc8d62,color:#fc8d62
style Journal fill:#000,stroke:#fc8d62,color:#fc8d62
style SuggestedResource fill:#000,stroke:#fc8d62,color:#fc8d62
style Detection fill:#000,stroke:#8da0cb,color:#8da0cb
style Categorization fill:#000,stroke:#8da0cb,color:#8da0cb
```

### Order of operations

1. A new `Term` is registered.
2. A `Detection` record for that `Term` is created (which allows repeat detection operations as TACOS gains new
capabilities).
3. The various `Detection` records (either the most recent for each term, or all detections over time) are processed via
code to generate scores for each potential category. These results are stored as `Categorization` records.
4. The three category scores are compared, and the one with the highest score is stored back in the `Term` record.

### Category values

There is no `Category` table, but two models have separate enumerated fields. The `Detector::SuggestedResource` model
has three possible values (Informational, Navigational, and Transactional), while the `Term` model has an additional
value ("Unknown") which is assigned during Categorization if two category scores are equal.

(This lack of a category table is not a fundamental aspect of this prototype, but it does indicate the general choice to
rely on code, rather than database records, as much as possible. Such a model could be accommodated, or implemented via
a shared helper method perhaps)

### Calculating the category scores

At the moment, category scores are assigned in methods like:

```ruby
# FILE: app/models/categorization.rb
def calculate_transactional
self.transaction_score = 0.0
self.transaction_score = 1.0 if %i[doi isbn issn pmid journal].any? do |signal|
self.detection[signal]
end
self.transaction_score = 1.0 if Detector::SuggestedResource.full_term_match(self.detection.term.phrase).first&.category == 'transactional'
end
```

This is effectively an "all or nothing" approach, where any detection at all results in the maximum possible score. This
lacks nuance, obviously, and we've talked about ways to include a confidence value in these calculations. As yet, this
prototype has not attempted to include that feature however.

**Note:** I've tried to anticipate how to include confidence values appropriately in this prototype, and it is not at
all clear how that might happen. This gets to the mathematical operations involved in calculating the category scores,
which might need to be documented separately.

## Validations

```mermaid
classDiagram
direction LR
Term --< Detection: has many
Detection <-- Categorization: based on
Categorization --> SuggestedResource: looks up
Detection --> SuggestedResource: looks up
Detection --> Journal: looks up
Categorization >-- Validation: subject to
class Term
Term: +Integer id
Term: +String phrase
Term: +Enum category
class SuggestedResource
SuggestedResource: +Integer id
SuggestedResource: +String title
SuggestedResource: +String url
SuggestedResource: +String phrase
SuggestedResource: +String fingerprint
SuggestedResource: +Enum category
SuggestedResource: calculateFingerprint()
class Journal
Journal: +Integer id
Journal: +String title
class Detection
Detection: +Integer id
Detection: +Integer term_id
Detection: +Integer detector_version
Detection: +Boolean DOI
Detection: +Boolean ISBN
Detection: +Boolean ISSN
Detection: +Boolean PMID
Detection: +Boolean Journal
Detection: +Boolean SuggestedResource
Detection: initialize()
Detection: setDetectionVersion()
Detection: recordDetections()
Detection: recordPatterns()
Detection: recordJournals()
Detection: recordSuggestedResource()
class Categorization
Categorization: +Integer id
Categorization: +Integer detection_id
Categorization: +Float information_score
Categorization: +Float navigation_score
Categorization: +Float transaction_score
Categorization: initialize()
Categorization: assign()
Categorization: evaluate()
Categorization: calculateAll()
Categorization: calculateInformation()
Categorization: calculateNavigation()
Categorization: calculateTransaction()
class Validation
Validation: +Integer id
Validation: +Integer categorization_id
Validation: +Integer user_id
Validation: +Boolean approve_transaction
Validation: +Boolean approve_information
Validation: +Boolean approve_navigation
Validation: +Boolean approve_doi
Validation: +Boolean approve_isbn
Validation: +Boolean approve_issn
Validation: +Boolean approve_pmid
Validation: +Boolean approve_journal
Validation: +Boolean approve_suggested_resource
style Term fill:#000,stroke:#66c2a5,color:#66c2a5
style Category fill:#000,stroke:#fc8d62,color:#fc8d62
style Detector fill:#000,stroke:#fc8d62,color:#fc8d62
style Journal fill:#000,stroke:#fc8d62,color:#fc8d62
style SuggestedResource fill:#000,stroke:#fc8d62,color:#fc8d62
style Detection fill:#000,stroke:#8da0cb,color:#8da0cb
style Categorization fill:#000,stroke:#8da0cb,color:#8da0cb
style Validation fill:#000,stroke:#ffd407,color:#ffd407
```

Validations, in this prototype, are collected in a single table with a field for each decision which came before it. As
the application expands, any new detectors or categories would result in new fields, both in the Detection or
Categorization models and also in the Validation model.

Multiple validations are possible for a single Categorization decision, enabled by the user_id field, which allows for
feedback provided by multiple users if bandwidth allows.
Loading

0 comments on commit bba7881

Please sign in to comment.