Convert database notebooks to use SQL Runner #20

iaindillingham · 2022-10-03T10:25:34Z

opensafely/database-notebooks contains four database notebooks; three are published on OpenSAFELY Reports (builds, history, schema). This playbook describes how we update them.

We should convert each database notebook to use SQL Runner. We should start with the schema notebook (#9).

convert builds notebook (repo, workspace, report)
convert history notebook (repo, workspace, report)
convert schema notebook (repo, workspace, report)
update playbook
check documentation for links
close PRs, close issues, update README, and archive database-notebooks

Considerations

Create or update

Why create a new repo? If we created a new repo from opensafely-core/repo-template, then we would have a more developer-friendly setup. This would make it easier to incorporate some SQL linting (e.g. SQLFluff has a pre-commit hook).

Why update the existing repo? The existing repo, opensafely/database-notebooks, has seen nearly two years of development; it's not immediately clear which parts should be retained and which parts should be discarded, so updating it would be safer than creating a new repo; opensafely-core/reports and https://reports.opensafely.org/ may have references to the existing repo.

Repos and notebooks

Should there be one repo for all notebooks or one repo for each notebook?

inglesp · 2022-10-03T12:03:34Z

If possible, can we do the schema one first?

iaindillingham · 2022-10-03T15:26:06Z

I'm leaning towards creating a new repo for each report (starting with the schema report, as per #9 and Peter's comment).

Rationale:

The developer-friendly setup would be useful. The existing repo contains requirements.txt, which is built from requirements.in. Whilst we can't change the production dependencies much (i.e. those on python-docker), we'd like to specify some development dependencies. For example, the existing repo requires nbval. And as I mentioned, we could incorporate some SQL linting, which could help us find and fix errors faster.
The new repo would be very similar to that of a reusable action: a study repo with developer tools. See, for example, opensafely-actions/cohort-joiner.
The one-repo-per-report model would make a report more like a study, and would make it easier to determine when the schema report was generated: OpenSAFELY Jobs would record when the schema report was generated, rather than when all reports were generated.
Much of the existing repo contains code that is unrelated to the schema report, or is for configuring/running Jupyter Notebook. opensafely jupyter may have made this code redundant.

iaindillingham · 2022-10-03T15:33:55Z

It looks as though pointing OpenSAFELY Reports to a new repo would be straightforward; the existing repo isn't hard-coded; the fields on the admin page in OpenSAFELY Reports are clear.

wjchulme · 2022-10-04T09:10:17Z

Starting with a new repo instead of updating the existing repo is absolutely the right thing to do. Start afresh, cut the chaff.

I'm not convinced of the benefit of splitting each notebook into different repos. It feels like part of the same suite of information that will likely all be updated on the same schedule, and so can be accessed and updated all in one place. Although if they're set to be updated automatically and viewed on the reports site, then maybe having them all in one place matters less. I don't have a strong preference so if there are developery reasons that I'm not fully grasping then go for it.

wjchulme · 2022-10-04T09:12:44Z

Thinking a bit more -- the "history" and "builds" notebooks should be combined, as raised in this issue. So maybe one repo for that combined report, and one for the schema report does make sense after all!

wjchulme · 2022-10-04T09:13:51Z

Another thing in passing. There is currently no sych between the repo on the L2/3 server and on github because git pushing out from L2/3 is no longer possible for mere mortals like me. So when we convert we need to check for any server side updates that haven't made it to github. Happy to flag these when needed

iaindillingham · 2022-10-04T14:08:28Z

Thanks for your comments @wjchulme.

I'm still leaning towards a one-to-one mapping between repos and reports. As I understand it, each report (notebook) queries the database, transforms the results, and displays the transformed results. There's one action in project.yaml for each report,¹ so regenerating a report means rerunning that action.

When we convert each report to use SQL Runner, there will be separate actions in project.yaml for querying (sqlrunner), transforming (python), and displaying (python). Regenerating a report will mean rerunning three actions. If there were a one-to-many mapping between repos and reports, then we'd need to run the actions we want to run, without running the actions we don't want to run. If there were a one-to-one mapping between repos and reports, however, then we'd not need to be as careful.

This isn't quite correct; there's one action for each output format for each report 🙂. ↩

wjchulme · 2022-10-04T14:50:19Z

Thanks @iaindillingham that makes sense. Often in research repos you often may only want to run a subset of actions at a time -- to run a specific subgroup analysis for example -- so that feels like less of a problem to me. Particularly in this case as I think all these reports will likely end up always being run at the same time (= after each database update).

But as long as there's no inelegant duplication across reports then there is certainly a logic to a one-repo-per-report approach.

iaindillingham · 2023-05-04T12:59:18Z

Phew, so all check boxes checked. We're good to close this issue, right?

Yes and no.

There are two open issues that give reports parity with notebooks:

And three that make reports a bit more user-friendly:

And three that remove a potential error, but have a dependency on SQL Runner:

There's also the issue of running the TPP database reports after L2 access is clarified.

Nevertheless, these issues are captured elsewhere; breaking them out into smaller issues means I think we are good to close this (larger, long-running) issue.

🎉

This was referenced Oct 3, 2022

Run database notebooks regularly #21

Open

Take ownership of database notebooks opensafely-core/cohort-extractor#831

Closed

Replace notebooks with pipelines that use SQL Runner #9

Closed

iaindillingham assigned iaindillingham and milanwiedemann Oct 3, 2022

wjchulme mentioned this issue Dec 6, 2022

remove python output from database build notebook opensafely/database-notebooks#36

Closed

inglesp unassigned milanwiedemann Mar 13, 2023

iaindillingham closed this as completed May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert database notebooks to use SQL Runner #20

Convert database notebooks to use SQL Runner #20

iaindillingham commented Oct 3, 2022 •

edited

Loading

inglesp commented Oct 3, 2022

iaindillingham commented Oct 3, 2022

iaindillingham commented Oct 3, 2022

wjchulme commented Oct 4, 2022

wjchulme commented Oct 4, 2022

wjchulme commented Oct 4, 2022

iaindillingham commented Oct 4, 2022

wjchulme commented Oct 4, 2022

iaindillingham commented May 4, 2023

Convert database notebooks to use SQL Runner #20

Convert database notebooks to use SQL Runner #20

Comments

iaindillingham commented Oct 3, 2022 • edited Loading

Considerations

Create or update

Repos and notebooks

inglesp commented Oct 3, 2022

iaindillingham commented Oct 3, 2022

iaindillingham commented Oct 3, 2022

wjchulme commented Oct 4, 2022

wjchulme commented Oct 4, 2022

wjchulme commented Oct 4, 2022

iaindillingham commented Oct 4, 2022

Footnotes

wjchulme commented Oct 4, 2022

iaindillingham commented May 4, 2023

iaindillingham commented Oct 3, 2022 •

edited

Loading