Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss: package cohort validation strategy #83

Open
dgkf opened this issue Apr 11, 2024 · 8 comments
Open

Discuss: package cohort validation strategy #83

dgkf opened this issue Apr 11, 2024 · 8 comments

Comments

@dgkf
Copy link
Collaborator

dgkf commented Apr 11, 2024

As @mmengelbier brought up at today's meeting, we have a challenge of producing validation documentation that are inter-dependent.

A package's validation results are dependent on the broader set of available packages, and we should be intentional with how we manage that relationship.

In this issue we hope to settle on a strategy for managing inter-package relationships and their effects on metrics, with the goal of aligning on which steps should be taken by a validation pipeline.

@dgkf
Copy link
Collaborator Author

dgkf commented Apr 11, 2024

Assuming the r-hub/repos-style repository proves to be a viable path forward, I propose that we schedule a recurring (let's say nightly, since that's how often r-hub/repos runs, though the exact frequency isn't important) process to update packages. I propose that a pipeline should:

Pipeline Steps

  1. Compare the existing set of packages with the latest set of packages
  2. For each new/updated package, as well as their direct reverse dependencies, calculate new risk metrics
    (Note that when running riskmetric, dependencies should be installed from the new list of packages)
  3. Update the registry with the new packages and their metrics.

This will mean that within a single snapshot of the repo, the package and all the dependencies used to produce risk metrics would have existed simultaneously at that time.

Alternatives

In place of step 2, there are more comprehensive, but more computationally intensive alternatives:

  • Re-run any (direct or indirect) reverse dependency
  • Re-run all packages in the repository

@dgkf dgkf changed the title Draft cohort validation strategy Map a package cohort validation strategy Apr 11, 2024
@dgkf dgkf changed the title Map a package cohort validation strategy Discuss: package cohort validation strategy Apr 11, 2024
@yannfeat
Copy link
Collaborator

This seems very feasible @dgkf. I think that the current validation pipeline could be well transformed into such an algorithm. A not-too-greedy implementation of the first alternative should not be too difficult either.

@yannfeat
Copy link
Collaborator

yannfeat commented Apr 19, 2024

@dgkf this process seems straightforward to me, at least if we restrict ourselves, for this MVP, to the metrics that can be produced when the source argument of riskmetric::pkg_ref is set to "pkg_cran_remote", such as downloads_1yr, has_maintainer, news_current...

@dgkf
Copy link
Collaborator Author

dgkf commented Apr 19, 2024

Sounds good @yannfeat - Initially, I'd do whatever is most actionable for getting the overall process in place. Even if the metrics aren't 100% accurate, just getting it hooked up to the repos build process would be a huge step forward.

Longer-term, thinking more about the details of the process, I think we'll need to invest some time in ensuring we're grabbing the packages defined in the repos' PACKAGE file, and not from public CRAN.

@yannfeat yannfeat self-assigned this Apr 30, 2024
@yannfeat
Copy link
Collaborator

yannfeat commented May 2, 2024

@dgkf I am implementing the comparison of the packages: pharmaR/repos@1618191d8ce878fc9c894ecbf29e86a0458355c4. Do we actually want to see if a new package version has been published, or if there is a more recent release?
See:
image
The name of the most recent Github release of the first package is colorspace_2.1-0_b1_R4.5_x86_64-pc-linux-gnu-ubuntu-22.04.tar.gz, whereas the name referenced in the PACKAGES file is colorspace_2.1-0_b1_R4.4_x86_64-pc-linux-gnu-ubuntu-22.04.tar.gz ("R4.5" vs. "R4.4").

@dgkf
Copy link
Collaborator Author

dgkf commented May 2, 2024

Ah, I see. It looks like r-hub/repos maintains multiple repositories (for each architecture, one for R4.4 and one for R4.5).

For simplicity, I would suggest picking one specific architecture/R version for development. I would probably start with ubuntu + R4.4.

So to answer your question, we should only care about new releases for the particular version of R that the repo corresponds to.

@yannfeat
Copy link
Collaborator

yannfeat commented May 3, 2024

Ok, I have calculated the risk metrics on the set of packages with differences: https://github.com/pharmaR/repos/blob/feature/riskscore/dev/poc_cohort_validation.R

This approach is not very solid, as I am comparing the versions of releases on GitHub but calculating the metrics from CRAN. Unfortunately, riskmetric hasn't yet implemented the assessment of remote Git repositories. It doesn't really matter however, as in the end we will need to assess packages downloaded locally to get all of the metrics, and also to update the PACKAGES file via the packages' DESCRIPTION files, as per #88. We will need to coordinate with whomever picks up that last issue.

@dgkf
Copy link
Collaborator Author

dgkf commented May 3, 2024

This approach is not very solid, as I am comparing the versions of releases on GitHub but calculating the metrics from CRAN.

That's okay! The github repos that r-hub/repos refers to are mirrors of CRAN, so the versions should be identical (so long as we pull the same version of the package from CRAN). The tagline for the https://github.com/cran/ org is "Unofficial read-only mirror of all CRAN R packages", so I think we should be okay here.

Certainly for a first proof of concept, I think we can safely assume that the source code in github.com/cran is an accurate reflection of the source code tarball that you would get from CRAN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants