Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] EHN add a model to fit and predict heart-rate from power #6

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

glemaitre
Copy link
Contributor

@glemaitre glemaitre commented May 27, 2018

Partially addressing #5

Implement a scikit-learn like regressor to fit and predict heart-rate data from power.

@glemaitre glemaitre mentioned this pull request May 27, 2018
5 tasks
@codecov
Copy link

codecov bot commented May 27, 2018

Codecov Report

Merging #6 into master will increase coverage by 0.5%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master       #6     +/-   ##
=========================================
+ Coverage   97.04%   97.54%   +0.5%     
=========================================
  Files          23       25      +2     
  Lines         440      530     +90     
  Branches       48       52      +4     
=========================================
+ Hits          427      517     +90     
  Misses          6        6             
  Partials        7        7
Impacted Files Coverage Δ
sksports/model/heart_rate.py 100% <100%> (ø)
sksports/model/__init__.py 100% <100%> (ø) ⬆️
sksports/model/tests/test_heart_rate.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a80a679...af0c989. Read the comment docs.

@glemaitre
Copy link
Contributor Author

@AartGoossens I modify the function which was making the estimation and prediction of the heart rate to follow the scikit-learn API. I have couple of suggestions:

  • Assuming that a user pass dataframe or series, then I think it would be useful to return such type at predict. It can be easily done by using the index of X and creating the dataframe or the series on the fly.
  • We should make a sanity check and decide a policy regarding the sampling time consistency. At fit we should check that X and y are sampled is the same manner. At predict time, we need to check that we have the same sampling rate than at fit. This is only possible when given a datetime index with a serie or dataframe thought. Then, we need to decide whether we make a conversion or raise an error.

I also want to improve the current test. Checking strictly floating values is a bit flaky. I still don't have a clear idea thought.

Let me know if you have any comment.

@glemaitre
Copy link
Contributor Author

I forgot to ping @sladkovm as well, sorry.

@AartGoossens
Copy link

Nice work! And good to see that you're picking up this model first because I think there are many interesting use cases for this (thinking e.g. OpenData).

  • @sladkovm and me had discussions about input and return types before (ref1, ref2). I'm not in favor of explicitly handling multiple input types (but Numpy and probably Scipy accept both lists/ndarrays in some cases) and am even less in favor of converting the return type to the original input type. I think the interface of the library (and therefore the function signature) should be as concise as possible as this results in the most predictable behavior for the user of the library. Maybe you can you explain how Scipy is handling this? There must some good use-cases for this.
  • Good point. I have thought about this before but haven't come to a satisfactory solution. I think that for now I would be happy with just accepting 1Hz sampled data because I'm not sure what other sampling frequencies and irregular sampled data would mean for the validity of the model. Another option could be to allow specifying the sampling frequency with an optional input argument (with a default of 1).
  • About testing: I really have no idea what would be the best way to do this and I agree that checking floats is flaky. pytest.approx might be a step in the right direction.

@glemaitre
Copy link
Contributor Author

but Numpy and probably Scipy accept both lists/ndarrays in some cases

check_array will convert list/ndarray/dataframe/series into ndarray such that we can work directly with numpy and scipy.

Regarding the output type, I really think that it would a plus to return the same type as input since we just need to reuse the index. It is coming almost for free and can avoid user to do that outside.

I think that for now I would be happy with just accepting 1Hz sampled data because I'm not sure what other sampling frequencies and irregular sampled data would mean for the validity of the model.

I see. We could resample the data using pandas in case that we get a dataframe. In case of numpy we just assume that this is 1Hz. It should be explained in the docstring thought.

return np.sum(_model_residual(params, heart_rate, power) ** 2)


class HeartRateRegressor(BaseEstimator, RegressorMixin):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the benefit of BaseEstimator and RegressorMixin? I don't see it used anywhere in the fit and predict methods. Is there other functionality of these classes that can be useful?
Either way, I like implementing this model like a regressor class, it makes the api much more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically, it allows to use the tool as pipeline, grid-search, and cross-validation from scikit-learn.
The mixin just define a score function which is the R^2 score while the BaseEstimator provides get_params and set_params to deal with the parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, I like implementing this model like a regressor class, it makes the api much more clear.

We can also the model public to the user. We just need a good name.

params_init = [self.hr_rest, self.hr_max, self.hr_slope, self.hr_drift,
self.rate_growth, self.rate_decay]

if self.solver == 'leastsq':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this if-else? To me it seems the flow for self.solver == 'leastsq' is redundant. I.e. why can minimize be used for least-squares?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'leastqs is actually using Levenverg-Marquardt algorithm. I was just playing with the different way of minimizing the problem. I am not really sure what is the best solution there. We can of course fix the solver as well.

NB: Levenvenberg-Maquardt will use the gradient will an algorithm as the simplex (Nelder-Mead) will not use it. So depending of the optimization problem and the convexity it would be more appropriate to use one or another.

self.rate_growth = rate_growth
self.rate_decay = rate_decay

def fit(self, X, y):
Copy link

@AartGoossens AartGoossens May 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: It still itches when I see Python variables starting with a capital... 😅 (but I recognize that it's accepted for these cases)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is kinda of imposed by scikit-learn estimators. power and heart_rate would be much better.

def __init__(self, solver='Nelder-Mead', hr_rest=75, hr_max=200,
hr_slope=0.30, hr_drift=3e-5, rate_growth=24, rate_decay=30):
self.solver = solver
self.hr_rest = hr_rest

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I'd like to propose renaming hr_rest to hr_start or hr_0 because this model parameter is not related to the actual resting heart rate of the athlete and therefore might cause confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I was confused :) hr_start seems a good idea,

@glemaitre
Copy link
Contributor Author

On a simple example (which will be in the tests) it seems that Nelder-Mead is the only solver leading to the right parameter. I think that we could simplify and keep the solver that you propose at the beginning.


"""
if y is None:
is_df = True if hasattr(X, 'loc') else False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is valid in most cases, why not just go for isinstance(X, pd.DataFrame)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to ducktype. It usually allows to not import the library (which we do anyway) and accept data structure which are inheriting from it.

"""
if y is None:
is_df = True if hasattr(X, 'loc') else False
X_ = X.resample('1s').mean() if is_df else X

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this raises a TypeError for a df with a non datetime-like index. Should we catch that here or add a check for a datetime-like index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, dataframe is not enough, we should check the index.

is_df = True if hasattr(X, 'loc') else False
X_ = X.resample('1s').mean() if is_df else X
X_ = check_array(X_)
return X_.ravel(), is_df

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to return y here as the second return argument right? Else line 193 will raise a ValueError (unpack error).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is confusing, I should always return X_.ravel(), y, is_df and ignore y when it is None.
Right now, I am returning different thing depending if it is fit or predict.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah now I get it. That could work indeed but I think it's less confusing if a method always returns the same number of arguments.

"""
check_is_fitted(self, ['hr_start_', 'hr_max_', 'hr_slope_',
'hr_drift_', 'rate_growth_', 'rate_decay_'])
power, is_df = self._check_inputs(X)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think self._check_inputs returns 3 items

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only 2 at predict

from sksports.model import exp_heart_rate_model
from sksports.model import HeartRateRegressor

POWER = np.array([100] * 10 + [240] * 100 + [160] * 100)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can consider moving these constants to a pytest fixture. It's a matter of preference but I think these fixtures are really nice to work with and also well reusable in other tests. But I'm fine with doing improvements like this later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I was playing with the tests for the moment :)

reg = HeartRateRegressor()
reg.fit(power, heart_rate)

assert reg.hr_start_ == pytest.approx(100, 1.0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a question than a comment: what's best practice for models like this for deciding how and which model parameters to test? Do you have any examples from scikit-learn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we run a classifier or a regressor on a known dataset and check the score. It is non-trivial to actually test parameters. Here I think that we can make it on those 2 parameters with a large enough approximation to be sure that the optimization (or any change in the future) does not lead to a complete non-sense.

Basically, I think that we could make something like:

assert reg.score(X) > 0.9

to be sure that we are fitting the data properly.

@AartGoossens
Copy link

On a simple example (which will be in the tests) it seems that Nelder-Mead is the only solver leading to the right parameter. I think that we could simplify and keep the solver that you propose at the beginning.

This is really interesting. What kind of results did you get? I expected other solvers to work at least reasonably well. I'm already looking forward to using OpenData to play around with different solvers... (it's time for me to get some more work into that the OpenData tooling)

@glemaitre
Copy link
Contributor Author

What kind of results did you get?

solver

return power


@pytest.fixture(params=[

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks interesting! If I'm correct you're not parametrizing the test but the fixture?...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.I would think that fixture are good for data (parametrizing allows to build different supported type. Then parametrizing the test is useful for passing some parameters to the function or something specific to the test. This is just my thumb rule and not sure this is the best ;)

@pep8speaks
Copy link

pep8speaks commented Jun 1, 2018

Hello @glemaitre! Thanks for updating the PR.

Line 19:1: E402 module level import not at top of file
Line 26:1: E402 module level import not at top of file
Line 27:1: E402 module level import not at top of file
Line 38:1: E402 module level import not at top of file
Line 53:1: E402 module level import not at top of file

Comment last updated on June 03, 2018 at 09:43 Hours UTC

@glemaitre glemaitre changed the title [WIP] EHN add a model to fit and predict heart-rate from power [MRG] EHN add a model to fit and predict heart-rate from power Jun 2, 2018
@glemaitre glemaitre force-pushed the master branch 6 times, most recently from ab4c1e9 to ab457a0 Compare June 2, 2018 20:31
15, 40)
data['heart-rate'] = heart_rate
data.plot(legend=True)
plt.title('Simulated power of a cyclist and inferred heart-rate.')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove final dot

power.plot(legend=True)
heart_rate.plot(legend=True)
heart_rate_initial.plot(legend=True)
heart_rate_pred.plot(legend=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add title

@glemaitre
Copy link
Contributor Author

@AartGoossens I was checking the results of the optimization using Nelder-Mead and it can lead to fitted parameters without any mining even if the fitting is good (e.g. hr_max could be 300). Somehow, I was thinking to make a minimization with bounds but L-BFGS is really not stable.

Somehow I found that the best for the moment is to use Nelder-Mead at first and then refine with L-BFGS using bounds.

One of the drawback is that we make the fitting slower (i.e. we are minimizing twice). I also not sure 100% if combining both methods will be always stable or if L-BFGS can give erroneous results as when we use it alone.

@glemaitre
Copy link
Contributor Author

Also HeartRateRegressor is maybe not the best name. If we go some other methods to do the same job then we will have some issue with naming.

@AartGoossens
Copy link

That's very interesting indeed. Is combining optimizations common practice? I haven't heard of it before.

If this is only about a better fit for the hr_max I want to mention this: I do not expect that the hr_max model parameter is really interesting for 2 reasons:

  1. I expect it to properly minimize only for activities where there's enough high power data (i.e. where hr_lin > hr_max)
  2. Heart rate and power are only approximately linear correlated for intensities where the almost all work is done aerobically (I think this is also mentioned in the article of De Smet). As a result the current model will overestimate heart rate for higher intensities and of course this will result in a worse model fit.

In conclusion: in order to properly fit/minimize the hr_max we need activities with high power (1) but we also know that for these activities the model fit is not that good (2)...

Either way, if the fit is improved by combining these optimizations it's worth it of course.
By the way, in my opinion the performance hit is not an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants