Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

erroneous results when using categorical variables with linear regression algorithm #12

Closed
mb52089 opened this issue Dec 3, 2019 · 19 comments

Comments

@mb52089
Copy link

mb52089 commented Dec 3, 2019

We have a categorical variable for day_of_week as one of 4 independent variables in our model. The LightGBM algorithm works correctly but when I force the model to use the linear regression algorithm, the resultant prediction is incorrect. If I subsequently remove the categorical variable, the linear regression algorithm gives an accurate prediction. Here's an example of what our data set looks like:

{:day_of_service_util=>0.80952380952381, :day_in_advance_util=>0.714285714285714, :block_minutes=>420.0, :week_day=>"Fri"},
{:day_of_service_util=>0.69047619047619, :day_in_advance_util=>0.214285714285714, :block_minutes=>420.0, :week_day=>"Mon"},
{:day_of_service_util=>0.80952380952381, :day_in_advance_util=>0.238095238095238, :block_minutes=>420.0, :week_day=>"Mon"},
{:day_of_service_util=>0.80952380952381, :day_in_advance_util=>0.238095238095238, :block_minutes=>420.0, :week_day=>"Mon"}

day_of_service_util is the Target dependent variable.

Thanks for this great gem!

@mb52089
Copy link
Author

mb52089 commented Dec 3, 2019

correction: 3 independent variables and 1 dependent variable, not 4 independent variables.

@ankane
Copy link
Owner

ankane commented Dec 4, 2019

Hey @mb52089, thanks for the report. Can you give more details about what you mean by "incorrect" prediction? It's different than linear regression in another language, or the error is high?

@mb52089
Copy link
Author

mb52089 commented Dec 4, 2019

Thanks Andrew. We're predicting the % utilization of a resource on the day of service based on the % utilization x days in advance, the duration of the resource in minutes and the day of the week. The predicted value should be between 0 and 1. In the particular test example we're using, the predicted value should be around 66%. We get that value when we use the lightgbm algorithm, but when we use the linear regression we get -1.4 which is a value that doesn't make sense giving the context and the training data. However, if I remove the "day of week" categorical variable and re-run the prediction using the linear regression algorithm, I get a prediction in range. I wasn't sure if the gem deals with categorical variables differently in the linear regression than in the lightGBM algorithm. The data set has around 150 rows of independent variables.

@mb52089
Copy link
Author

mb52089 commented Dec 4, 2019

and this is all done in ruby/rails.

@ankane
Copy link
Owner

ankane commented Dec 4, 2019

If it's not too sensitive, paste the model summary and PMML here or send it to me over email (on my GitHub profile)?

puts model.summary
puts model.to_pmml

@mb52089
Copy link
Author

mb52089 commented Dec 4, 2019

I just ran the model summary for the error condition:

Math::DomainError: Numerical argument is out of domain - "sqrt"
from /Users/michaelburke/.rvm/gems/ruby-2.6.5@copient_health_rails6/bundler/gems/eps-509da754d6e9/lib/eps/linear_regression.rb:186:in `sqrt'
[4] pry(main)>

@mb52089
Copy link
Author

mb52089 commented Dec 4, 2019

The model summary after I remove the categorical variable week_day:
=> "Validation RMSE: 0.14\n\n coef p\n_intercept 0.42 0.094\nday_in_advance_util 0.54 0.000\nblock_minutes -0.00 0.932\n\nadjusted r2: 0.330\n"

@mb52089
Copy link
Author

mb52089 commented Dec 4, 2019

just sent to your chartkick email. I didn't know you were the author of chartkick. It's great too!

@ankane
Copy link
Owner

ankane commented Dec 4, 2019

To close the loop: the issue was likely related to multicollinearity, which can produce an unstable solution (the link provides a good explanation). One way to counteract this is to use GSL, which uses a different algorithm to produce a more stable solution.

@ankane ankane closed this as completed Dec 4, 2019
@ankane
Copy link
Owner

ankane commented Dec 5, 2019

Going to reopen this until the model.summary error is fixed. @mb52089, can you paste the output of:

model.send(:diagonal)

for a model where you're seeing Math::DomainError: Numerical argument is out of domain - "sqrt"?

@ankane ankane reopened this Dec 5, 2019
@mb52089
Copy link
Author

mb52089 commented Dec 5, 2019 via email

@ankane
Copy link
Owner

ankane commented Dec 5, 2019

My bad, it should be:

model.instance_variable_get("@estimator").send(:diagonal)

@mb52089
Copy link
Author

mb52089 commented Dec 5, 2019 via email

@ankane
Copy link
Owner

ankane commented Dec 5, 2019

Thanks. This is from the model that errors on the summary? I'm unable to reproduce with those numbers.

@mb52089
Copy link
Author

mb52089 commented Dec 5, 2019

Now that I have installed GSL, I can't seem to reproduce the error when I do the linear regression. Do you want me to uninstall GSL and see if I can reproduce?

@ankane
Copy link
Owner

ankane commented Dec 5, 2019

Yeah, GSL changes the code path, so you'll want to recreate the initial conditions.

@mb52089
Copy link
Author

mb52089 commented Dec 6, 2019

Here you go. After removing the gsl gem and re-bundling, I ran @model.instance_variable_get("@estimator").send(:diagonal) from a model that generated the following error when running @model.summary: Math::DomainError: Numerical argument is out of domain - "sqrt". Here's the output:

[-666372359695044.8, 1.0761875986711336, -3777621086.706599, 0.19673979554666882, -339985897803588.5, 2.390797741078714]

@ankane ankane closed this as completed in 2bfe901 Dec 6, 2019
@ankane
Copy link
Owner

ankane commented Dec 6, 2019

Thanks @mb52089, fixed the error message for unstable solutions. Pushing out a new release in a few with all the fixes we discussed. Thanks for the help!

@mb52089
Copy link
Author

mb52089 commented Dec 6, 2019

No problem at all. Thanks for all the great gems!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants