-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outliers #64
Comments
Hi Mislav! cross-section: take a fixed time & fixed characteristic, say, size (Market Cap). What you want it to make sure that you don't have a firm that has a time-series: here the purpose is to detect errors in the data. Imagine the series of Apple, which, for some dates is divided by 1,000. This is a big deal, and it could be important to make sure that Market Cap figures behave relatively smoothly in time (monthly variations, except in bankruptcy scenarii and taking splits into account, should not lie outside -80% and +200%). However, you should NOT windsorize in the time-series! So in your code, the first line is ok, but not the second. Time-series outliers must be checked before and the purpose is to make sure you are confident with your values. In this case, if you have "crazy" values, you should replace them by the last "correct" value. |
Hi Mislav, yes, this often happens when there are"atoms" at the beginning and the end of the distribution, I guess in your case it comes from the prior winsorization. |
Thanks. all clear now. |
I ma trying to set up data for analysis using FMP cloud data. I am not sure how to remove outliers from my data.
In the book you recommend winsorization:
"The winsorization stage must be performed on a feature-by-feature and a date-by-date basis. However, keeping a time series perspective is also useful."
If I get it right, using data.tables this procedure imply:
But I am not sure this is the rigt way. For example, lets say we have market cap feature. There is always one firm with highest market cap. If we do winsorization, we will always replace market cap of biggest firm with market cap of firm that belogns to 99 percentil. But this is not due to incorrect data or outliers.
Similar conclusion arise with time dimension. If EPS or examplegrows through time, we would replace highest EPS with 99 percentil even if data is not wrong.
The text was updated successfully, but these errors were encountered: