Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore/gen_pypi_cache overhaul #225

Merged
merged 18 commits into from
Jan 29, 2024

Conversation

blast-hardcheese
Copy link
Collaborator

@blast-hardcheese blast-hardcheese commented Jan 24, 2024

Why

Statically analyzing which modules are provided by a given package can give us some serious speed improvements.

Pseudoscientific performance graph of execution time before/after this PR:

Where Time Graph Enjoyment
Current HEAD 1h30m 📈 📉
With this PR 16m 📉 📈

Additionally, looking at some user reports, relying on pkgutil seems to miss packages that are valid. As an example:

$ mkdir foo/bar/baz
$ touch foo/bar/baz/blix.py
$ python -c 'import foo.bar.baz.blix'

exits cleanly, though pkgutil would fail to see inside foo due to foo not being a module itself.

A real-world example of no longer overreporting modules:

-{"name":"3scale-api",...,"modules":["certifi","charset_normalizer","idna","requests","threescale_api","urllib3"]}
+{"name":"3scale-api",...,"modules":["threescale_api"]}
-{"name":"ailever",...,"modules":["FinanceDataReader","PIL","_brotli","_plotly_future_","_plotly_utils","ailever","brotli","bs4","certifi","charset_normalizer","click","cycler","dash","dash_bootstrap_components","dash_core_components","dash_html_components","dash_table","dateutil","flask","flask_compress","fontTools","idna","importlib_metadata","itsdangerous","jinja2","jupyterlab_plotly","kiwisolver","lxml","markupsafe","matplotlib","numpy","packaging","pandas","patsy","plotly","pylab","pyparsing","pytz","requests","requests_file","requests_futures","scipy","seaborn","six","soupsieve","statsmodels","tenacity","tqdm","urllib3","werkzeug","yahooquery","zipp"]}
+{"name":"ailever",...,"modules":["ailever"]}

What changed

  • Add test-one subcommand for easy manual testing
  • Rely on the python runtime to determine which file extensions are valid for considering whether they are modules or not
  • Use pip install --no-deps to limit search blast radius. Important for "extension"-style modules which depend on the underlying package, which would obscure which package supplies the extension (flaskext.mysql, as an example, may also falsely report flask as a provided module)

Test plan

$ go run ./gen_pypi_map test -force

observe that it is faster

$ go run ./gen_pypi_map test-one -force flask-mysql
...

observe output

@blast-hardcheese blast-hardcheese requested a review from a team as a code owner January 24, 2024 22:45
@blast-hardcheese blast-hardcheese requested review from airportyh and removed request for a team January 24, 2024 22:45
Initially I didn't realize that the whole error info was being
accumulated for serialization into retval
@blast-hardcheese blast-hardcheese force-pushed the dstewart/chore/gen_pypi_cache-overhaul branch from 22bc862 to 1816364 Compare January 24, 2024 23:31
@blast-hardcheese blast-hardcheese force-pushed the dstewart/chore/gen_pypi_cache-overhaul branch from 1816364 to 81bb490 Compare January 24, 2024 23:51
@blast-hardcheese blast-hardcheese force-pushed the dstewart/chore/gen_pypi_cache-overhaul branch from 37aeb7d to a36ac96 Compare January 25, 2024 02:13

func pipInstall(pkg string, root string, timeout time.Duration) error {
// Run pip to install just the package, so we can statically analyze it
cmd := exec.Command("pip", "install", "--no-deps", "--target", root, pkg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! no deps makes this fast.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added unconditional retry here to try and catch transient failures, it added a few minutes more to execution time but not much more

@airportyh
Copy link
Contributor

Exciting! I am testing this...

Copy link
Contributor

@airportyh airportyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@blast-hardcheese blast-hardcheese merged commit a5fa4e2 into main Jan 29, 2024
3 checks passed
@blast-hardcheese blast-hardcheese deleted the dstewart/chore/gen_pypi_cache-overhaul branch January 29, 2024 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants