-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data for ml factor analysis #2
Comments
Dear Mislav, sadly good quality data is not free. At the very beginning of Section 5.1, we list a few data providers but their services can be quite expensive. Also, because of that, they do not allow their subscribers to download data and give it away for free. This is why the data we provide is anonymous: so that it is impossible to trace its origin - we don't want any problems! Price data is easy to obtain via Yahoo finance (e.g. with the quantmod package) or other providers like Alphavantage and Tiingo for which Matt Dancho has created R interfaces. But for more "exotic" information, I do not know of publicly available data, though the R package edgar may be one way to circumvent this issue (via SEC filings). I know of one other (bigger) dataset and you can find it here https://dachxiu.chicagobooth.edu (search for "empirical data"), but the returns are not given and the subscription to the corresponding service is prohibitively priced. One option is to scrap the web for quarterly statements and I've seen a few tutorials both in R and Python. One example: In short: I'm sorry to disappoint, there is no simple cheap solution. |
Could you explain what kind of data we need for factor analysis? If I got it right, we need quarterly financial statements and monthly price data (from which we can additionally calculate all kinds of indicators). Then we can calculate all kinds of ratios like P/E, P/B, EPS, etc. But some ratios are only available on a quarterly basis (like financial leverage). So, if we want monthly data, we can use ratios which include prices. If we want quoter data, we can use additional ratios from financial statements that don't include prices (like fin leverage). Is that correct? I got annual data from IB, but it's not enough. I will try to contact hem to see if they can provide historical quater data (fundamental ratios and fin. statements). How much in the past we have to go if we have a high stock universe (let's say 2.000 stocks). |
The more data, the better obviously - roughly speaking. You can also add risk measures, like vol, that's not too hard to get. Market beta likewise. Of course, the deeper the dataset the better. Basically, you are going to need a minimum of 5 years to train your first model, but ideally more (10 years). And then you will roll (or expand) it forward, at least for 10 years to have sufficient testing history. So my advice is 15-20Y minimum in total, preferably more. The problem is that the further you go in the past, the scarcest the data becomes. This is a tradeoff, I prefer having a higher proportion of well-defined features (so that data imputation is less intensive) and not going back 40 years in the past. |
Thank you for such a detailed answer. It's very helpful. You won't miss if you copy-paste this answer in the book (maybe as QA or something similar). As you said, I will try to get data for span 2004-2019 min, or 1999-2019 max. I don't think I will find data for older periods. Additionally, for long t, probably panel would be very unbalanced, maybe it wouldn't be so helpful, but not sure. I am reading chapter 4 now. I feel like I would need 10 years to read all these articles in debt :) Great review, especially because is up to date; new papers are inside. Maybe I will come back with an additional questions. Think this can be closed. |
@shokru , I am trying to calculate all data from chapter 17 from simfin+ subscription data. For price momentum 12 - 1 month in USD Does that mean close price in, say December - close price in January? |
Hi Mislav, first a bit of history. The original paper of momentum is called "Returns to Buying Winners and Selling Losers: Implications for Stock Market Efficiency" and it has had a huge impact on both the literature & practitioners (easy to find online). The idea is that past performance has some kind of persistence. Subsequently, other researchers started digging into what "past performance" really is, that is: over which period of time do you compute the returns (see Is Momentum Really Momentum? by Novy-Marx). The most common (accepted) definition of momentum is that you take the return computed as: |
Thanks for detailed explanation. So, for monthly data, that would be:
Vol1Y_Usd is volatility from monthly or daily observations? It is rolling window volatility? |
Yes, the formula seems correct, though I guess the overarching brackets are superfluous. |
I have daily data for all stocks so I can calculate volatility from daily data. I can send you the codes when I finish. These can be one example of how to construct data for factor ml analysis. |
my question is related to the uniformization of data. Unfortunately, I do not grab the concept of regularization in the cross-section. How does it actually work? Is it for a specific date, we uniformize the indicator eg price to book ratio across different firms? The credit spread example you gave on the solution of chapter 4 works by grouping the data by dates. I have normalized my own collected data as per your suggestion, but dates and stock-ids are also among the significant features choosen by penalized regression. According to my understanding, it is not conceptually correct in the cross-sectional study. If you can please eloborate with a working example, that will be great! |
Let's say you want to explain future returns with 2 variables: market capitalization and past returns (momentum). So first of all: dates & stock ids are NOT predictors, so you should probably leave them out. If you keep them, basically, you are using a panel approach which means that you can allow for trends in dates and trends in asset id. In a simple regression: future_return = a + bdate +cstock_id + dmkt_cap + fpast_return+ error (time & stock indices are omitted) this means that the dates & stock_id impact future returns which does not make any sense econometrically. Though of course you could build one model for each stock separately, but this is another issue. In the book, models are common to all stocks. Ok so now we are in the setting of the book, where the model would be something like (in linear regression format): future_return = a + bmkt_cap + cpast_return + error Now, there is a big issue because mkt_cap and past returns really don't have the same scales. The first is measured in billions of $, while the second does not have units. We thus need to homogenise them because some models (eg: neural networks) behave much better when predictors have similar (small) scales. The simplest way to do that is to rank firms. So, at each date, we process the data so that mkt_cap is equal to 0 for the smallest firm and to 1 for the largest firm, and same thing for past returns: 0 for the smallest return and 1 for the largest one. This is equivalent to computing the rank for each stock (starting at zero) and then dividing by the number of stocks. Some people normalize between -0.5 and +0.5 or -1 and +1. After this step, all variables are comparable in magnitude, so we can proceed. In the exercise, the credit spread at date t is multiplied with all other predictor values; this changes the distribution of values, so we normalise again after the product. This will ensure that, again, predictors, at each date, have a uniform distribution across all stocks (between 0 for the smallest value to 1 for the largest value). |
In |
Thank you for your detailed answer! I will look into the package. Thank you for the recommendation! |
I started to read your book. I have finished chapters 1 and 2. In the book, you use the following data:
This data is anonymous. We don't know which stock is represented by id.
It would be very helpful if you can give some tips in the book how to get data in the first place. It would be very helpful for beginners in ml factor analysis (like me) who don't have data yet. This would be the first step if we would like to follow you analysis with real stocks.
I even have subscriptions on interactive brokers, but they don't have data on quarterly financial statements, only annual financial statements.
In nutshell, do you have any suggestions on how to obtain good data for ml factor analysis (good quality, as cheap as possible, as older as possible)?
The text was updated successfully, but these errors were encountered: