Effect of Standardizing Variables #317

krn-hov · 2023-04-25T08:56:32Z

krn-hov
Apr 25, 2023

I've noticed that pySR works much more efficiently when the appropriate variables are standardized by being mapped to a range of [0,1]. I am using a synthetic dataset where an exact solution exists, whether the variables are standardized or not. I am configuring the model quite conservatively (6 cores, 50 iterations) as the regressor would work well either way if given more resources. The model finds exact solution reliably with only the standardized variables.

I want to intuitively explain this; 1 thought is that the perturbations that pySR uses to optimize constants will be more effective given the smaller range, and so it will be more efficient in finding the global optimum. I am curious to hear others ideas on this!

MilesCranmer · 2023-04-25T11:41:21Z

MilesCranmer
Apr 25, 2023
Maintainer

Very interesting. It could be because the code to make a random constant in an expression is sampled from a Gaussian with scale=1? https://github.com/MilesCranmer/SymbolicRegression.jl/blob/61ecbed3aa25c73aeaf9ecc2846023979980a741/src/MutationFunctions.jl#L153

        return Node(; val=randn(T))

the variable T is typically Float32.

The perturbations themselves are multiplicative rather than additive: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/61ecbed3aa25c73aeaf9ecc2846023979980a741/src/MutationFunctions.jl#L68-L72 so the scale-dependence must be from the initialization?

Side-question: I wonder if there is a way to make PySR scale-invariant, perhaps by having the scale of random constants depend on the standard deviation in a given variable... what do you think?

0 replies

krn-hov · 2023-04-25T12:32:14Z

krn-hov
Apr 25, 2023
Author

Thanks for the reply! Thinking about it more... if the variance of the response variable is large, doesn't that mean there will inherently be more ways of explaining (incorrectly/poorly but nevertheless) it as opposed to a smaller one? And so the regressor goes through more intermediate mutations/steps in general?

Because even if the variables are mapped to [0,1], constants of the equation are not necessarily affected by the scaling and may follow different unknown distributions. So to change the initialization, I'm not sure how it would affect it.

3 replies

MilesCranmer Apr 25, 2023
Maintainer

Not sure I follow your first question here. By response variable do you mean y? Variance is proportional to the scale of y, so I'm not sure how the variance would indicate anything other than scale, no?

krn-hov Apr 25, 2023
Author

No, it wouldn't. Was wondering more about if the affect of the scale is inherent. And not really sure how to mitigate that aside from scaling variables beforehand.

MilesCranmer Apr 25, 2023
Maintainer

I wonder if I should just have an option in PySRRegressor to automatically normalize the data as part of the fit... e.g.,

model = PySRRegressor(
    ...,
    normalize=True
)

But the potential difficulty is that people might write down the equations being printed and not realize there is an extra factor attached to every variable. So perhaps its better to just recommend people use a StandardScaler or similar preprocessing model before input.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Effect of Standardizing Variables #317

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Effect of Standardizing Variables #317

krn-hov Apr 25, 2023

Replies: 2 comments · 3 replies

MilesCranmer Apr 25, 2023 Maintainer

krn-hov Apr 25, 2023 Author

MilesCranmer Apr 25, 2023 Maintainer

krn-hov Apr 25, 2023 Author

MilesCranmer Apr 25, 2023 Maintainer

krn-hov
Apr 25, 2023

Replies: 2 comments 3 replies

MilesCranmer
Apr 25, 2023
Maintainer

krn-hov
Apr 25, 2023
Author

MilesCranmer Apr 25, 2023
Maintainer

krn-hov Apr 25, 2023
Author

MilesCranmer Apr 25, 2023
Maintainer