Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[blocked: UI decisions, testing capacity] modify interactions with workflow service #1363

Open
jmartin-sul opened this issue Feb 11, 2020 · 5 comments
Labels

Comments

@jmartin-sul
Copy link
Member

jmartin-sul commented Feb 11, 2020

blocked till we answer a design question

we want to stop using fake workflows as an error reporting mechanism for preservation catalog.

TBD: where will the errors be exposed? @andrewjbtw currently tries to watch the preservation errors via argo. the errors are exposed in argo because workflow information in general is indexed (including preservationAuditWF). andrew would prefer if we kept some easy way to view these errors in argo. we have a meeting this afternoon with andrew and @astridu to discuss the desired high-level usage, and then we can figure out a replacement.

two ideas so far:

  1. expose audit status via pres cat REST endpoint, have dor_indexing_app index that info (similar to the way it reaches out to WFS to index workflow info). expose the indexed fields in argo faceting and search.
  2. have a pres cat page that andrew and other users can visit, and have that page list currently known audit errors.

as with the current workflow solution, what will be reported/exposed is the result of the most recently run audit. audits can be expensive for large objects, and so we don't run them on demand from a web UI. at present, if someone really wants audit results on demand, they can synchronously run an audit on a given object from rails console. frequency and triggering mechanism for the audit code are outside the scope of this ticket (and this work cycle).

@jmartin-sul jmartin-sul changed the title stop sending events to preservationAuditWF [blocked] stop sending events to preservationAuditWF Feb 11, 2020
@mjgiarlo
Copy link
Member

@jmartin-sul intentionally moved to out-of-scope? cc: @jcoyne

@jmartin-sul
Copy link
Member Author

@mjgiarlo yeah, i need to leave some notes from this afternoon's meeting, but the short answer is that @andrewjbtw thinks we want to block versioning progress if there are preservation issues. we get that for free with the current setup, and i thought that refactoring to still use preservationAuditWF for gating that (while not using it for reporting) might be more work on that front than we have capacity for in this work cycle.

but i think that warrants some discussion as a team (including whether the gating is desirable, though that's more of a direct/repo manager decision than a developer decision, i'd think). so, moving to out of scope was likely presumptuous of me (even if that ends up being the decision in the end).

will leave notes this afternoon, and maybe we can make this a discussion topic after standup tomorrow?

@mjgiarlo
Copy link
Member

@jmartin-sul sounds like a plan. thank you!

@jmartin-sul jmartin-sul changed the title [blocked] stop sending events to preservationAuditWF [blocked] modify interactions with workflow service [was: stop sending events to preservationAuditWF] Feb 12, 2020
@ndushay
Copy link
Contributor

ndushay commented Feb 13, 2020

basic write up from Tues afternoon meeting with @jmartin-sul, @andrewjbtw and myself:

Preservation Audit Reporting Plan

Requirements:

(per Andrew)

  • (A) can view details of existing preservation audit errors for a particular object in Argo
    • currently done with WF error details
  • (B) audit errors on Moabs (not replicated objects) block new versions from being created
    • workflow errors currently accomplish this
    • it is a happy accident that currently, there are no audit errors reported to WF for replicated objects [jmartin-sul: and we'd like to re-enable replication error reporting, but messages were overrunning old WFS field limits]
  • (C) notifications of new audit errors
    • currently done via WF and honeybadger
  • (D) an easy way to monitor preservation errors overall (overall count, possibly broken down by type)
    • workflow errors currently accomplish this
    • currently (and generally) we only have invalid_checksum errors

Nice to Have

  • (E) a way to be able to interact with the objects that have a particular error
    • currently (and generally) we only have invalid_checksum errors
  • (F) surfacing info in Argo is nice as a "one-stop-shop"
    • currently accomplished via WF

(per Justin Coyne)

  • (G) avoid adding to Argo Solr index as it's already bloated and slow

Short Term Plan:

1. send results of prescat audits to event service. (issue #1357)

WHY:

  • Use an appropriate service for recording results over time
  • Historical record of prescat audits available for an object
  • The event service will soon allow events to be displayed for an individual Argo object (this is WIP.)

In the future, this will address requirement (A).

2. keep reporting to preservation audit WF

WHY:

  • it blocks a new version from being opened on an object when there is a preservation audit error

This currently addresses requirements (B), (D), (E) and (F)

BEWARE:

  • if we expand auditing of replicated copies, we do NOT want audit errors to block new versions of an object ... only online Moab errors should block new versions of objects.

Future Plans

  • (A)

    • event details will have this information available.
    • if event history display isn't sufficient, Argo code could surface a current error outside of the event history display
  • (B)

    • Could prescat audit could send a blocking status to versioning workflow or versioning service to address this? We could ensure via prescat code that we only block for audit errors for CompleteMoab objects.
  • (C)

    • Could accomplish notifications via emails (individual or aggregated), or Honeybadger, or ...
  • (D), (E), (F)

    • We could set up a way to query the event service or prescat db (See issue Web Interface to Show Audit Results (per root? per druid?) #1320) to get aggregated info on status errors, including druids? Could display in Argo??
    • We could add audit status field to Argo Solr - it would effectively be an enum field with the specific statuses enumerated in CompleteMoab model. This could be used as a facet.
    • Monitoring aggregate audit error stats doesn't have to be in Argo necessarily, but Andrew wants a way to continue monitoring if audit errors are increasing or decreasing ...

@jmartin-sul
Copy link
Member Author

Could prescat audit could send a blocking status to versioning workflow or versioning service to address this? We could ensure via prescat code that we only block for audit errors for CompleteMoab objects.

i think this would not be that hard... but given that we feel pressed for time in this work cycle, and given that it's not strictly necessary for the storage migration, we've decided to defer this work for now (probably as much for the testing effort as anything).

@jmartin-sul jmartin-sul changed the title [blocked] modify interactions with workflow service [was: stop sending events to preservationAuditWF] [blocked, UI decisions] modify interactions with workflow service Feb 13, 2020
@jmartin-sul jmartin-sul changed the title [blocked, UI decisions] modify interactions with workflow service [blocked: UI decisions, testing capacity] modify interactions with workflow service Feb 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants