Skip to content
This repository has been archived by the owner on Sep 5, 2023. It is now read-only.

Unconditional dependency on pandas/numpy increases package size by ~24x (<6MB -> 135MB) #29

Open
huonw opened this issue Jun 6, 2022 · 1 comment

Comments

@huonw
Copy link

huonw commented Jun 6, 2022

Thank you for publishing a client library!

Issue

The here-location-services package currently unconditionally depends on pandas, which depends on numpy, pytz and python-dateutil. On x86-64 Linux (for Python 3.9), these end up being very large (~130MB), with all of the rest of the dependencies being ~5MB. However, pandas is only used for converting the result for two functions associated with the matrix routing API:

class MatrixRoutingResponse(ApiResponse):
"""A class representing Matrix routing response data."""
def __init__(self, **kwargs):
super().__init__()
self._filters = {"matrix": None}
for param, default in self._filters.items():
setattr(self, param, kwargs.get(param, default))
def to_geojson(self):
"""Return API response as GeoJSON."""
raise NotImplementedError("This method is not valid for MatrixRoutingResponse.")
def to_distnaces_matrix(self):
"""Return distnaces matrix in a dataframe."""
if self.matrix and self.matrix.get("distances"):
distances = self.matrix.get("distances")
dest_count = self.matrix.get("numDestinations")
nested_distances = [
distances[i : i + dest_count] for i in range(0, len(distances), dest_count)
]
return DataFrame(nested_distances, columns=range(dest_count))
def to_travel_times_matrix(self):
"""Return travel times matrix in a dataframe."""
if self.matrix and self.matrix.get("travelTimes"):
distances = self.matrix.get("travelTimes")
dest_count = self.matrix.get("numDestinations")
nested_distances = [
distances[i : i + dest_count] for i in range(0, len(distances), dest_count)
]
return DataFrame(nested_distances, columns=range(dest_count))

It seems unfortunate to require these huge dependencies to be installed for only these wo functions when many people are likely to not be calling them anyway, and when the dependencies seemingly aren't required for any additional functionality within this client library.

Potential alternatives

  1. Have pandas be an optional dependency (for example, via extra_requires={"pandas": ["pandas"]} in setup.py), and import it on-demand in the individual functions that need it. For example:
    def to_distnaces_matrix(self):
        """Return distnaces matrix in a dataframe."""
        try:
            from pandas import DataFrame
        except ImportError as e:
            raise ImportError("pandas is not installed, run `pip install here-location-services[pandas]`) from e
    
        # ... existing implementation as before ...
    For an example of prior art, this option is what the popular Pydantic library does:
  2. Remove the pandas dependency totally, and have the functions return the nested lists (nested_distances) without converting to a DataFrame. A user who wants to use pandas can still convert to a DataFrame themselves: DataFrame(result.to_distnaces_matrix()) (the columns= argument seems to be unnecessary, as doing that call gives the same result AFAICT).

Both of these are probably best considered as breaking changes.

Context

We were attempting to use this package in an AWS Lambda, which has strict size limits on the size of the code asset, and exceeding it results in errors like 'Unzipped size must be smaller than 262144000 bytes' when deploying (relevant docs: https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html#function-configuration-deployment-and-execution "Deployment package (.zip file archive)"). Additionally, larger packages result in slower cold starts: https://mikhail.io/serverless/coldstarts/aws/ .

There's various ways to provide more code beyond the size limits (layers or docker images), but this provides some context for why someone might care about the size of a package and its dependency. (Those methods are fiddly enough and the cold start impact large enough that we've actually switched away from using this client library for now.)

Package size details

Here's some commands I used to investigate the size impact, leveraging pip install --target to install a set of packages to a specific directory:

uname -a # Linux 322c9a327f85 5.10.104-linuxkit #1 SMP PREEMPT Wed Mar 9 19:01:25 UTC 2022 x86_64 GNU/Linux
python --version # Python 3.9.10

pip install --target=everything here-location-services
pip install --target=deps-pandas requests geojson flexpolyline pyhocon requests_oauthlib
pip install --target=deps-no-pandas requests geojson flexpolyline pyhocon requests_oauthlib pandas

du -sh everything # 135M
du -sh deps-pandas # 134M
du -sh deps-no-pandas # 5.1M
du -sh everything/here_location_services # 484K

That is, without pandas, the total installed package size would be 5.1M (deps-no-pandas) + 484K (everything/here_location_services) = ~5.6MB, down from 135MB (everything).

Summary of individual packages (reported by du -sh everything/*, ignoring the $package.dist-info directories that are mostly less than 50k anyway):

package size only required for pandas?
pandas 58M yes
numpy.libs 35M yes
numpy 33M yes
pytz 2.8M yes
oauthlib 1.4M
urllib3 872K
dateutil 748K yes
idna 496K
here_location_services 484K
8 others 1.5M
@gravinci
Copy link

Upvote this as well. Just ran into this with our AWS lambda as well. Are there any alternatives to make this work with lambda without doing too much work around?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants