Skip to content
This repository has been archived by the owner on Jan 27, 2021. It is now read-only.

support cpu/gpu consumption metrics as well in addition to request count #14

Open
gurvindersingh opened this issue Dec 13, 2018 · 4 comments
Labels
feature request Requests or implements improvements that are, specifically, new features

Comments

@gurvindersingh
Copy link

Thanks for releasing this useful tool :)

What would you like to be added?
Currently it seems that only metrics supported is request count. Is there any plan to monitor consumption of CPU and GPU resources in addition to request count to make decision as if a given pod is idle or not.

Why is this needed?
As a user might simple submit a job and the job runs for hours before user will check it again on status. Usually in ML model training. So this will avoid the issue if killing the pod when the analysis is running.

@krancour
Copy link
Contributor

We've always assumed that in the future, scaling decisions could be made on a variety of metrics. That being said, Osiris is presently little more than a proof of concept and this feature isn't among those I imagine making the cut for a minimal viable product.

I would expect a road map to be forthcoming in January.

@gurvindersingh
Copy link
Author

thanks @krancour for the info. Do you plan to collect these extra metrics from prometheus or from heapster or use any other custom methods.

@krancour
Copy link
Contributor

krancour commented Dec 14, 2018

We originally used Prometheus for collecting request counts, but eventually decided that a dependency on Prometheus, configured a certain way, possibly in addition to a Prometheus you might already be running for other purposes was an unnacceptably high barrier to entry for something that's supposed to be helping you bring resource utilization down. (You have to ask yourself where the break even point is. How many workloads do you have to scale to zero to justify the extra components you have to run to make that possible?) We eventually decided that Prometheus was, perhaps, overkill for the one metric we're currently collecting and we cut it out completely.

In the future, making scaling decisions, based on something other than request count will likely require that we re-examine how we intend to collect those metrics. tbh, when we start putting a road map together, I'm not even sure how high a priority that will be, as I can guess that things like support for https, http2, and other protocols (which you have also asked about) will probably emerge as more pressing concerns.

@krancour krancour added the feature request Requests or implements improvements that are, specifically, new features label Dec 14, 2018
@krancour
Copy link
Contributor

Looking at the issue queue here and considering closing this...

I think the scope of this project is well-understood now to be pretty narrowly confined to scaling to/from zero in response, specifically, to HTTP/S requests or the absence thereof. For workloads serving HTTP/S requests, arbitrary metrics like CPU or memory pressure are not as reliable an indicator of active vs idling as simply monitoring the traffic to the pods like we are already doing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature request Requests or implements improvements that are, specifically, new features
Projects
None yet
Development

No branches or pull requests

2 participants