Unconditional dependency on pandas/numpy increases package size by ~24x (<6MB -> 135MB) #29

huonw · 2022-06-06T02:28:43Z

Thank you for publishing a client library!

Issue

The here-location-services package currently unconditionally depends on pandas, which depends on numpy, pytz and python-dateutil. On x86-64 Linux (for Python 3.9), these end up being very large (~130MB), with all of the rest of the dependencies being ~5MB. However, pandas is only used for converting the result for two functions associated with the matrix routing API:

here-location-services-python/here_location_services/responses.py

Lines 151 to 182 in 325b4c0

    
           class MatrixRoutingResponse(ApiResponse): 
        
               """A class representing Matrix routing response data.""" 
        
               def __init__(self, **kwargs): 
        
                   super().__init__() 
        
                   self._filters = {"matrix": None} 
        
                   for param, default in self._filters.items(): 
        
                       setattr(self, param, kwargs.get(param, default)) 
        
               def to_geojson(self): 
        
                   """Return API response as GeoJSON.""" 
        
                   raise NotImplementedError("This method is not valid for MatrixRoutingResponse.") 
        
               def to_distnaces_matrix(self): 
        
                   """Return distnaces matrix in a dataframe.""" 
        
                   if self.matrix and self.matrix.get("distances"): 
        
                       distances = self.matrix.get("distances") 
        
                       dest_count = self.matrix.get("numDestinations") 
        
                       nested_distances = [ 
        
                           distances[i : i + dest_count] for i in range(0, len(distances), dest_count) 
        
                       ] 
        
                       return DataFrame(nested_distances, columns=range(dest_count)) 
        
               def to_travel_times_matrix(self): 
        
                   """Return travel times matrix in a dataframe.""" 
        
                   if self.matrix and self.matrix.get("travelTimes"): 
        
                       distances = self.matrix.get("travelTimes") 
        
                       dest_count = self.matrix.get("numDestinations") 
        
                       nested_distances = [ 
        
                           distances[i : i + dest_count] for i in range(0, len(distances), dest_count) 
        
                       ] 
        
                       return DataFrame(nested_distances, columns=range(dest_count))

It seems unfortunate to require these huge dependencies to be installed for only these wo functions when many people are likely to not be calling them anyway, and when the dependencies seemingly aren't required for any additional functionality within this client library.

Potential alternatives

Have pandas be an optional dependency (for example, via extra_requires={"pandas": ["pandas"]} in setup.py), and import it on-demand in the individual functions that need it. For example:
```
def to_distnaces_matrix(self):
    """Return distnaces matrix in a dataframe."""
    try:
        from pandas import DataFrame
    except ImportError as e:
        raise ImportError("pandas is not installed, run `pip install here-location-services[pandas]`) from e

    # ... existing implementation as before ...
```
For an example of prior art, this option is what the popular Pydantic library does:
- 'extra' dependency on python-dotenv: https://github.com/samuelcolvin/pydantic/blob/8846ec4685e749b93907081450f592060eeb99b1/setup.py#L134-L137
- importing from dotenv within a function (not at the top level) and catching the ImportError to provide additional help to the user: https://github.com/samuelcolvin/pydantic/blob/8846ec4685e749b93907081450f592060eeb99b1/pydantic/env_settings.py#L297-L300
Remove the pandas dependency totally, and have the functions return the nested lists (nested_distances) without converting to a DataFrame. A user who wants to use pandas can still convert to a DataFrame themselves: DataFrame(result.to_distnaces_matrix()) (the columns= argument seems to be unnecessary, as doing that call gives the same result AFAICT).

Both of these are probably best considered as breaking changes.

Context

We were attempting to use this package in an AWS Lambda, which has strict size limits on the size of the code asset, and exceeding it results in errors like 'Unzipped size must be smaller than 262144000 bytes' when deploying (relevant docs: https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html#function-configuration-deployment-and-execution "Deployment package (.zip file archive)"). Additionally, larger packages result in slower cold starts: https://mikhail.io/serverless/coldstarts/aws/ .

There's various ways to provide more code beyond the size limits (layers or docker images), but this provides some context for why someone might care about the size of a package and its dependency. (Those methods are fiddly enough and the cold start impact large enough that we've actually switched away from using this client library for now.)

Package size details

Here's some commands I used to investigate the size impact, leveraging pip install --target to install a set of packages to a specific directory:

uname -a # Linux 322c9a327f85 5.10.104-linuxkit #1 SMP PREEMPT Wed Mar 9 19:01:25 UTC 2022 x86_64 GNU/Linux
python --version # Python 3.9.10

pip install --target=everything here-location-services
pip install --target=deps-pandas requests geojson flexpolyline pyhocon requests_oauthlib
pip install --target=deps-no-pandas requests geojson flexpolyline pyhocon requests_oauthlib pandas

du -sh everything # 135M
du -sh deps-pandas # 134M
du -sh deps-no-pandas # 5.1M
du -sh everything/here_location_services # 484K

That is, without pandas, the total installed package size would be 5.1M (deps-no-pandas) + 484K (everything/here_location_services) = ~5.6MB, down from 135MB (everything).

Summary of individual packages (reported by du -sh everything/*, ignoring the $package.dist-info directories that are mostly less than 50k anyway):

package	size	only required for pandas?
pandas	58M	yes
numpy.libs	35M	yes
numpy	33M	yes
pytz	2.8M	yes
oauthlib	1.4M
urllib3	872K
dateutil	748K	yes
idna	496K
here_location_services	484K
8 others	1.5M

The text was updated successfully, but these errors were encountered:

gravinci · 2023-06-30T02:19:48Z

Upvote this as well. Just ran into this with our AWS lambda as well. Are there any alternatives to make this work with lambda without doing too much work around?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unconditional dependency on pandas/numpy increases package size by ~24x (<6MB -> 135MB) #29

Unconditional dependency on pandas/numpy increases package size by ~24x (<6MB -> 135MB) #29

huonw commented Jun 6, 2022 •

edited

Loading

gravinci commented Jun 30, 2023

Unconditional dependency on pandas/numpy increases package size by ~24x (<6MB -> 135MB) #29

Unconditional dependency on pandas/numpy increases package size by ~24x (<6MB -> 135MB) #29

Comments

huonw commented Jun 6, 2022 • edited Loading

Issue

Potential alternatives

Context

Package size details

gravinci commented Jun 30, 2023

huonw commented Jun 6, 2022 •

edited

Loading