Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create anomaly detection workflow #3340

Open
1 of 3 tasks
pabloarosado opened this issue Sep 30, 2024 · 7 comments
Open
1 of 3 tasks

Create anomaly detection workflow #3340

pabloarosado opened this issue Sep 30, 2024 · 7 comments
Assignees

Comments

@pabloarosado
Copy link
Contributor

pabloarosado commented Sep 30, 2024

Summary

We should perform better data quality checks, and ideally have tools to help us identify data anomalies, as part the normal flow of our data work.

Problem

We currently perform sanity checks via assertions or with ad-hoc code outside of ETL, or via visual inspection on Indicator Upgrader explorer tool, or chart diff. But many data issues still remain in our data, and often users point them out.

In most cases, the issues were "not our fault", since they were already in the original data. However, we should, at least, be aware of these issues, contact data providers early on, and fix them when possible.

Impact

These data issues can lead to the following undesired outcomes:

  • It's time consuming for data managers to add ad-hoc sanity checks
  • Users become aware of data quality issues that we didn't spot first, which erodes trust
  • We miss spurious data points that we might otherwise have removed, meaning spurious trends may be picked up and amplified by our users

Scope

List of PRs

Open issues and enhancements

We were discussing issues on slack, but from now on (since we'll have to wrap up next week), we can keep adding issues here:
#3436

@lucasrodes
Copy link
Member

Notes taken during our workshop in the offsite

image

@Marigold
Copy link
Collaborator

Marigold commented Oct 1, 2024

Dataset Quality Checker Proposal

This is a rough idea for a basic dataset quality checker. The goal is to see how technically challenging it might be while helping us spot important anomalies we can actually do something about. We're looking for things like fat-finger errors, out-of-range values, or extreme differences between countries. We're not trying to catch more complex stuff like regime shifts or changes in definitions—those are a bit too advanced for now.

  1. Calculate statistics:
    For each indicator in the dataset, let's calculate some basic stats—min, max, mean, etc. We'll also look at differences between data points to catch any sudden jumps. Maybe we'll throw in some other random checks to see what sticks.

  2. Add Some Context:
    Attach relevant metadata for each indicator (like unit, description, etc.) so we can better understand the numbers.

  3. Ask ChatGPT:
    Feed all this info into ChatGPT and ask it to flag any weird indicators or countries. While we're at it, we can also ask for countries that might be useful for comparison.

  4. Visualize the Results:
    Once we have the flagged data, let's plot it as charts so we can easily see what's going on.

This could be relatively easy to pull out for a few datasets and see how well does it work. Ideally, we'd have a list of real anomalies to test it against, but if not, we can just artificially mess up data.

@pabloarosado
Copy link
Contributor Author

Proposal A

Summary

We create a new dedicated page in wizard for anomaly detection. To begin with, this will be a table with one row per indicator, among the list of indicators from datasets that have changed in a given PR.

Changes in wizard

We need a new dedicated page for anomaly detection.

New backend code for anomaly detection in wizard

We need the code to execute anomaly detection at the individual indicator level, and when comparing indicator versions.

Changes in indicator upgrader (optional)

We may also need to look into how to handle variable mappings. In the easiest version, we just store a json file locally, but in the long term it would be good to be able to "restore" mappings in indicator upgrader, and store mappings in the grapher database.

Distribution of tasks

We thought we could roughly divide the work as follows:

  • @lucasrodes can take care of creating the streamlit app.
    • Lucas can also think a bit more about how to handle variable mappings, and whether the changes in indicator upgrader are necessary.
  • @Marigold can take care of indicator-related anomaly detection, possibly using LLMs.
  • @pabloarosado can take care of the indicator versions comparison, e.g. comparing each of the old and new indicators in an updated dataset, and detecting abrupt changes.

@lucasrodes
Copy link
Member

To add on Pablo's summary:

  • An extra motivation for working on something that keeps variable mappings is that we could use this information for other projects. Relating old → new indicators could enable things like pointing users to older indicator versions on our site, etc.

@lucasrodes
Copy link
Member

lucasrodes commented Oct 2, 2024

Workflow

How I see the data manager workflow:

  1. User adds new indicators to DB.
  2. Anomaly detector (AD) finds anomalies.
  3. Streamlit app presents these anomalies to the user.
  4. User decides what to do with it: Go back to ETL code and fix these, ignore the anomalies, etc.

Challenges

  • If the user sees an anomaly and tries to fix it (step 4), then they will want to (i) execute the code again and (ii) go and check in the app if anything has changed (e.g. anomaly is no longer present). They might expect the app to be refreshed, meaning we must run the AD again. Is this scalable? Is the latency ok?

Possible structure of AD output

Step 2 generates an output summary of the anomalies, which is then used in the streamlit app (step 3). We should agree on its format

anomalies:
  - indicator_slug: "grapher/bla/2024-01-01/bla/bla#indicator_1"
    - description: "Sudden spike in ..."
    - description: "Sudden drop in ..."
    - description: "Missing values in period ..."

  - indicator_slug: "grapher/bla/2024-01-01/bla/bla#indicator_2"
    - description: "Values way higher in ..."
      indicator_old_slug: "grapher/bla/2023-01-01/bla/bla#indicator_2"

Note: some anomalies could be just for one indicator, and others could be for that indicator relative to its old version (see indicator_old_slug).

Optional: We could try to pin-point country & years for single anomaly (i.e. 'when that anomaly happened'), though that could be either 'a single country-year', 'a single country for a year period', 'multiple countries and years', etc. So I am unsure about the format at the moment.

@larsyencken
Copy link
Collaborator

larsyencken commented Oct 2, 2024

From our discussion:

First version

The UI and workflow / Lucas

  • In the Wizard, there is a page for anomaly detection
  • The page should present you with a list of all changed grapher indicators
    • It should read from local disk what indicators are upgrades of other indicators
    • You may optionally ask about any arbitrary ETL path for a grapher indicator
  • You click "Detect all anomalies"
  • It begins fetches data from the S3 API and finds things whilst you wait
    • It should be interactive speed, i.e. take less than one minute
    • If an indicator is an upgrade, it also detects within the context of the previous indicator
    • (optional) It would be dreamy if they appear one by one whilst you wait
  • The anomaly detector tries to return only the few "most important" anomalies
  • It stores them in some YAML file or SQLite DB; they're still there if you come back later
  • It provides a one-click link for any anomaly that takes you to the indicator (either in a chart, or in the admin)
  • There is a button to regenerate them all
  • There is a button to delete them all
  • They are ephemeral: when you merge your branch, and your staging server dies, they disappear
    • But you can still manually add something important to "What you should know about this indicator" for its data page

Anomalies within an indicator / Mojmir

We will probably use a ChatGPT-based approach. A major challenge is packing the dataset into the 128k context window. We could:

  • Using summary statistics or otherwise compressing each country series some way
  • Split the data and use multiple ChatGPT calls
  • Split the data, generate thumbnails, and do detection based on thumbnail images
  • Consider Claude with a 200k context window

Anomalies relative to a baseline / Pablo

We can experiment with an approach based on percentage change.

Later options

If all of this went well, we could consider:

  • Automatically trigger this on change (e.g. on push), or otherwise do them as a batch
  • Some kind of workflow where we save the most important stuff into metadata
  • Some kind of visualisation that can show you the chart with the anomaly

@larsyencken larsyencken changed the title Perform data quality checks, and help identify anomalies Create anomaly detection workflow Oct 4, 2024
@larsyencken larsyencken changed the title Create anomaly detection workflow 🎉 Create anomaly detection workflow Oct 4, 2024
@larsyencken larsyencken changed the title 🎉 Create anomaly detection workflow Create anomaly detection workflow Oct 4, 2024
@lucasrodes
Copy link
Member

Today we had a long discussion where we discussed which are the tasks pending and how to organize ourselves to work on this project.

From our meeting, we concluded that there are three main parts:

  1. Backend: Working on a rule-based anomaly detection tool for single indicators and indicator upgrades.
  2. Backend/DB: Make sense of the detected anomalies, group similar ones, and store them in a database.
  3. UI: Present the relevant anomalies to the user in a streamlit app.

Some other important points:

  • We might need pagination
  • Review/Pending anomaly status is just optional (we don't force users to review, like in chart diff)
  • We need filter options to show more/less anomalies, show certain anomalies, etc.

Other pending discussion points

  • Agree on each phase's input and output specs (1, 2, 3).
  • It needs to be clarified what gets saved in the database.

Image
Figma board

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants