Discuss: package cohort validation strategy #83

dgkf · 2024-04-11T15:03:02Z

As @mmengelbier brought up at today's meeting, we have a challenge of producing validation documentation that are inter-dependent.

A package's validation results are dependent on the broader set of available packages, and we should be intentional with how we manage that relationship.

In this issue we hope to settle on a strategy for managing inter-package relationships and their effects on metrics, with the goal of aligning on which steps should be taken by a validation pipeline.

dgkf · 2024-04-11T15:45:01Z

Assuming the r-hub/repos-style repository proves to be a viable path forward, I propose that we schedule a recurring (let's say nightly, since that's how often r-hub/repos runs, though the exact frequency isn't important) process to update packages. I propose that a pipeline should:

Pipeline Steps

Compare the existing set of packages with the latest set of packages
For each new/updated package, as well as their direct reverse dependencies, calculate new risk metrics
(Note that when running riskmetric, dependencies should be installed from the new list of packages)
Update the registry with the new packages and their metrics.

This will mean that within a single snapshot of the repo, the package and all the dependencies used to produce risk metrics would have existed simultaneously at that time.

Alternatives

In place of step 2, there are more comprehensive, but more computationally intensive alternatives:

Re-run any (direct or indirect) reverse dependency
Re-run all packages in the repository

yannfeat · 2024-04-18T12:28:31Z

This seems very feasible @dgkf. I think that the current validation pipeline could be well transformed into such an algorithm. A not-too-greedy implementation of the first alternative should not be too difficult either.

yannfeat · 2024-04-19T11:45:37Z

@dgkf this process seems straightforward to me, at least if we restrict ourselves, for this MVP, to the metrics that can be produced when the source argument of riskmetric::pkg_ref is set to "pkg_cran_remote", such as downloads_1yr, has_maintainer, news_current...

dgkf · 2024-04-19T15:16:23Z

Sounds good @yannfeat - Initially, I'd do whatever is most actionable for getting the overall process in place. Even if the metrics aren't 100% accurate, just getting it hooked up to the repos build process would be a huge step forward.

Longer-term, thinking more about the details of the process, I think we'll need to invest some time in ensuring we're grabbing the packages defined in the repos' PACKAGE file, and not from public CRAN.

yannfeat · 2024-05-02T16:53:01Z

@dgkf I am implementing the comparison of the packages: pharmaR/repos@1618191d8ce878fc9c894ecbf29e86a0458355c4. Do we actually want to see if a new package version has been published, or if there is a more recent release?
See:

The name of the most recent Github release of the first package is colorspace_2.1-0_b1_R4.5_x86_64-pc-linux-gnu-ubuntu-22.04.tar.gz, whereas the name referenced in the PACKAGES file is colorspace_2.1-0_b1_R4.4_x86_64-pc-linux-gnu-ubuntu-22.04.tar.gz ("R4.5" vs. "R4.4").

dgkf · 2024-05-02T17:55:59Z

Ah, I see. It looks like r-hub/repos maintains multiple repositories (for each architecture, one for R4.4 and one for R4.5).

For simplicity, I would suggest picking one specific architecture/R version for development. I would probably start with ubuntu + R4.4.

So to answer your question, we should only care about new releases for the particular version of R that the repo corresponds to.

yannfeat · 2024-05-03T10:35:27Z

Ok, I have calculated the risk metrics on the set of packages with differences: https://github.com/pharmaR/repos/blob/feature/riskscore/dev/poc_cohort_validation.R

This approach is not very solid, as I am comparing the versions of releases on GitHub but calculating the metrics from CRAN. Unfortunately, riskmetric hasn't yet implemented the assessment of remote Git repositories. It doesn't really matter however, as in the end we will need to assess packages downloaded locally to get all of the metrics, and also to update the PACKAGES file via the packages' DESCRIPTION files, as per #88. We will need to coordinate with whomever picks up that last issue.

dgkf · 2024-05-03T21:55:22Z

This approach is not very solid, as I am comparing the versions of releases on GitHub but calculating the metrics from CRAN.

That's okay! The github repos that r-hub/repos refers to are mirrors of CRAN, so the versions should be identical (so long as we pull the same version of the package from CRAN). The tagline for the https://github.com/cran/ org is "Unofficial read-only mirror of all CRAN R packages", so I think we should be okay here.

Certainly for a first proof of concept, I think we can safely assume that the source code in github.com/cran is an accurate reflection of the source code tarball that you would get from CRAN.

dgkf changed the title ~~Draft cohort validation strategy~~ Map a package cohort validation strategy Apr 11, 2024

dgkf added the discussion label Apr 11, 2024

dgkf changed the title ~~Map a package cohort validation strategy~~ Discuss: package cohort validation strategy Apr 11, 2024

yannfeat self-assigned this Apr 30, 2024

yannfeat added the review needed label May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss: package cohort validation strategy #83

Discuss: package cohort validation strategy #83

dgkf commented Apr 11, 2024 •

edited

Loading

dgkf commented Apr 11, 2024

yannfeat commented Apr 18, 2024

yannfeat commented Apr 19, 2024 •

edited

Loading

dgkf commented Apr 19, 2024 •

edited

Loading

yannfeat commented May 2, 2024

dgkf commented May 2, 2024

yannfeat commented May 3, 2024

dgkf commented May 3, 2024

Discuss: package cohort validation strategy #83

Discuss: package cohort validation strategy #83

Comments

dgkf commented Apr 11, 2024 • edited Loading

dgkf commented Apr 11, 2024

Pipeline Steps

Alternatives

yannfeat commented Apr 18, 2024

yannfeat commented Apr 19, 2024 • edited Loading

dgkf commented Apr 19, 2024 • edited Loading

yannfeat commented May 2, 2024

dgkf commented May 2, 2024

yannfeat commented May 3, 2024

dgkf commented May 3, 2024

dgkf commented Apr 11, 2024 •

edited

Loading

yannfeat commented Apr 19, 2024 •

edited

Loading

dgkf commented Apr 19, 2024 •

edited

Loading