[blocked: UI decisions, testing capacity] modify interactions with workflow service #1363

jmartin-sul · 2020-02-11T19:49:17Z

blocked till we answer a design question

we want to stop using fake workflows as an error reporting mechanism for preservation catalog.

TBD: where will the errors be exposed? @andrewjbtw currently tries to watch the preservation errors via argo. the errors are exposed in argo because workflow information in general is indexed (including preservationAuditWF). andrew would prefer if we kept some easy way to view these errors in argo. we have a meeting this afternoon with andrew and @astridu to discuss the desired high-level usage, and then we can figure out a replacement.

two ideas so far:

expose audit status via pres cat REST endpoint, have dor_indexing_app index that info (similar to the way it reaches out to WFS to index workflow info). expose the indexed fields in argo faceting and search.
have a pres cat page that andrew and other users can visit, and have that page list currently known audit errors.

as with the current workflow solution, what will be reported/exposed is the result of the most recently run audit. audits can be expensive for large objects, and so we don't run them on demand from a web UI. at present, if someone really wants audit results on demand, they can synchronously run an audit on a given object from rails console. frequency and triggering mechanism for the audit code are outside the scope of this ticket (and this work cycle).

The text was updated successfully, but these errors were encountered:

mjgiarlo · 2020-02-12T00:26:26Z

@jmartin-sul intentionally moved to out-of-scope? cc: @jcoyne

jmartin-sul · 2020-02-12T00:58:03Z

@mjgiarlo yeah, i need to leave some notes from this afternoon's meeting, but the short answer is that @andrewjbtw thinks we want to block versioning progress if there are preservation issues. we get that for free with the current setup, and i thought that refactoring to still use preservationAuditWF for gating that (while not using it for reporting) might be more work on that front than we have capacity for in this work cycle.

but i think that warrants some discussion as a team (including whether the gating is desirable, though that's more of a direct/repo manager decision than a developer decision, i'd think). so, moving to out of scope was likely presumptuous of me (even if that ends up being the decision in the end).

will leave notes this afternoon, and maybe we can make this a discussion topic after standup tomorrow?

mjgiarlo · 2020-02-12T00:59:29Z

@jmartin-sul sounds like a plan. thank you!

ndushay · 2020-02-13T22:56:11Z

basic write up from Tues afternoon meeting with @jmartin-sul, @andrewjbtw and myself:

Preservation Audit Reporting Plan

Requirements:

(per Andrew)

(A) can view details of existing preservation audit errors for a particular object in Argo
- currently done with WF error details
(B) audit errors on Moabs (not replicated objects) block new versions from being created
- workflow errors currently accomplish this
- it is a happy accident that currently, there are no audit errors reported to WF for replicated objects [jmartin-sul: and we'd like to re-enable replication error reporting, but messages were overrunning old WFS field limits]
(C) notifications of new audit errors
- currently done via WF and honeybadger
(D) an easy way to monitor preservation errors overall (overall count, possibly broken down by type)
- workflow errors currently accomplish this
- currently (and generally) we only have invalid_checksum errors

Nice to Have

(E) a way to be able to interact with the objects that have a particular error
- currently (and generally) we only have invalid_checksum errors
(F) surfacing info in Argo is nice as a "one-stop-shop"
- currently accomplished via WF

(per Justin Coyne)

(G) avoid adding to Argo Solr index as it's already bloated and slow

Short Term Plan:

1. send results of prescat audits to event service. (issue #1357)

WHY:

Use an appropriate service for recording results over time
Historical record of prescat audits available for an object
The event service will soon allow events to be displayed for an individual Argo object (this is WIP.)

In the future, this will address requirement (A).

2. keep reporting to preservation audit WF

WHY:

it blocks a new version from being opened on an object when there is a preservation audit error

This currently addresses requirements (B), (D), (E) and (F)

BEWARE:

if we expand auditing of replicated copies, we do NOT want audit errors to block new versions of an object ... only online Moab errors should block new versions of objects.

Future Plans

(A)
- event details will have this information available.
- if event history display isn't sufficient, Argo code could surface a current error outside of the event history display
(B)
- Could prescat audit could send a blocking status to versioning workflow or versioning service to address this? We could ensure via prescat code that we only block for audit errors for CompleteMoab objects.
(C)
- Could accomplish notifications via emails (individual or aggregated), or Honeybadger, or ...
(D), (E), (F)
- We could set up a way to query the event service or prescat db (See issue Web Interface to Show Audit Results (per root? per druid?) #1320) to get aggregated info on status errors, including druids? Could display in Argo??
- We could add audit status field to Argo Solr - it would effectively be an enum field with the specific statuses enumerated in CompleteMoab model. This could be used as a facet.
- Monitoring aggregate audit error stats doesn't have to be in Argo necessarily, but Andrew wants a way to continue monitoring if audit errors are increasing or decreasing ...

jmartin-sul · 2020-02-13T23:23:20Z

Could prescat audit could send a blocking status to versioning workflow or versioning service to address this? We could ensure via prescat code that we only block for audit errors for CompleteMoab objects.

i think this would not be that hard... but given that we feel pressed for time in this work cycle, and given that it's not strictly necessary for the storage migration, we've decided to defer this work for now (probably as much for the testing effort as anything).

jmartin-sul mentioned this issue Feb 11, 2020

Send preservationAuditWF events to the event service #1357

Closed

jmartin-sul added the question label Feb 11, 2020

jmartin-sul changed the title ~~stop sending events to preservationAuditWF~~ [blocked] stop sending events to preservationAuditWF Feb 11, 2020

jmartin-sul changed the title ~~[blocked] stop sending events to preservationAuditWF~~ [blocked] modify interactions with workflow service [was: stop sending events to preservationAuditWF] Feb 12, 2020

jmartin-sul changed the title ~~[blocked] modify interactions with workflow service [was: stop sending events to preservationAuditWF]~~ [blocked, UI decisions] modify interactions with workflow service Feb 13, 2020

jmartin-sul changed the title ~~[blocked, UI decisions] modify interactions with workflow service~~ [blocked: UI decisions, testing capacity] modify interactions with workflow service Feb 13, 2020

jmartin-sul mentioned this issue Apr 27, 2020

workflow service expects a version, but isn't getting one from pres cat #1515

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[blocked: UI decisions, testing capacity] modify interactions with workflow service #1363

[blocked: UI decisions, testing capacity] modify interactions with workflow service #1363

jmartin-sul commented Feb 11, 2020 •

edited

Loading

mjgiarlo commented Feb 12, 2020

jmartin-sul commented Feb 12, 2020

mjgiarlo commented Feb 12, 2020

ndushay commented Feb 13, 2020 •

edited

Loading

jmartin-sul commented Feb 13, 2020

[blocked: UI decisions, testing capacity] modify interactions with workflow service #1363

[blocked: UI decisions, testing capacity] modify interactions with workflow service #1363

Comments

jmartin-sul commented Feb 11, 2020 • edited Loading

blocked till we answer a design question

mjgiarlo commented Feb 12, 2020

jmartin-sul commented Feb 12, 2020

mjgiarlo commented Feb 12, 2020

ndushay commented Feb 13, 2020 • edited Loading

Preservation Audit Reporting Plan

Requirements:

Nice to Have

Short Term Plan:

1. send results of prescat audits to event service. (issue #1357)

2. keep reporting to preservation audit WF

Future Plans

jmartin-sul commented Feb 13, 2020

jmartin-sul commented Feb 11, 2020 •

edited

Loading

ndushay commented Feb 13, 2020 •

edited

Loading