diff --git a/docs/get_data.md b/docs/get_data.md index 8fcd4c2..e6e1af6 100644 --- a/docs/get_data.md +++ b/docs/get_data.md @@ -2,14 +2,22 @@ This documentation provides details of the Python codes needed to generate the data in [data/raw](https://github.com/gimseng/game_stats/blob/master/data/raw). -There are three steps to the process: +There are 4 steps to the process: 1. [get_game_list.py](https://github.com/gimseng/game_stats/blob/master/src/data/get_game_list.py): The website contains a search function whereby if no search text is enetered, it will generate all the game with their summary info. This code makes use of this to allow us to obtain their game ID and store them (or rather the link to each of the game's page) in a list (which is in the [list_game_url.csv](https://github.com/gimseng/game_stats/blob/master/data/interim/list_game_url.csv) file). These game links will need to be scrapped in the next stage. The most challenging part for me was to figure out how to deal with AJAX. Eventually, after understanding it, it seems rather straighforward to pass headers and payload data to request the dynamically generated pages. 2. [get_game_info.py](https://github.com/gimseng/game_stats/blob/master/src/data/get_game_info.py): -We now scrap each game link generated in the first step. This is a rather time-consuming task, since it has to scrap through order of thousands of webpages. It took about 3 hours when I ran it. The scraping is rather straighforward but sometimes the data are not always consistent. For e.g. time played might be entere '13h' vs '13 hours' and sometimes for never-ending games, it has entries like '200-400 hours'. At this point, there is some minimal data cleaning which have been done, but I will leave the majority of data cleaning to the later stage of the project. A majority of the work involves figuring out where the texts of relevance are stored in the html hierarchy. A bit of patient and a lot of browser developer's inspect were needed to construct the codes. The output of the code is a list of games with their information, stored in [all_game.csv](https://github.com/gimseng/game_stats/blob/master/data/raw/all_game.csv) +We now scrap each game link generated in the first step. This is a rather time-consuming task, since it has to scrap through order of thousands of webpages. It took about 3 hours when I ran it. The scraping is rather straighforward but sometimes the data are not always consistent. For e.g. time played might be entere '13h' vs '13 hours' and sometimes for never-ending games, it has entries like '200-400 hours'. At this point, there is some minimal data cleaning which have been done, but I will leave the majority of data cleaning to the later stage of the project. A majority of the work involves figuring out where the texts of relevance are stored in the html hierarchy. A bit of patient and a lot of browser developer's inspect were needed to construct the codes. The output of the code is a list of games with their information, stored in [all_game.csv](https://github.com/gimseng/game_stats/blob/master/data/raw/all_game.csv). + + +3. [get_play_info.py](https://github.com/gimseng/game_stats/blob/master/src/data/get_play_info.py): + +Once you have understood the first two steps, for each game URL, this final step involves scrapping the playtime entries. For popular games, this are enormous list of user-logged data. Most importantly, it contains the platform, time-played, ratings of each entry. This is the key data for data analytics in the next stages. The logic of the codes is no different from the above two steps, though the layout of the page needs some inspection to figure out where the relevant data are stored. The code outputs the file [all_play.csv](https://github.com/gimseng/game_stats/blob/master/data/raw/all_play.csv). + +4. [get_user_info.py](https://github.com/gimseng/game_stats/blob/master/src/data/get_user_info.py): + +Some (but not all) entries of playtime are generated by registered users. In their user page, there are general (demographic) information for each user. This is straighforwardly scraped to obtain their gender, age and location. These will provide interesting data for gamer clustering or behaviors. The output is then the file [all_user.csv](https://github.com/gimseng/game_stats/blob/master/data/raw/all_user.csv). -3.