Create two new folders that will serve as targets for files generated by most of these scrapers:
$ mkdir data && mkdir content
To start, we scrape high-level data about all projects with the following request:
curl -i -X GET "https://devpost.com/software/search?query=&page=0"
This simple request returns page 0, or the first page, of projects with an open search query (i.e. searching ALL projects). This data is sorted by date posted, descending.
DevPost scans the client's User Agent and recognizes that this request is NOT coming from a browser, thus returning a JSON object of all of the projects on that given page. We can iterate through this until the array size returned is 0 (i.e. page doesn't exist).
An example:
{
"software": [
{
"class_name": "Software",
"name": "PostDev",
"tagline": "Analyze DevPost - by hackers for hackers",
"slug": "postdev-k286c0",
"url": "https://devpost.com/software/postdev-k286c0",
"members": [
"jayrav13",
"nsamarin",
"otmichael",
"snowiswhite"
],
"tags": [
"python",
"flask",
"react",
"redux",
"mdl",
"markovify",
"machine-learning",
"ibm-watson",
"amazon-ec2"
],
"winner": false,
"photo": null,
"has_video": false,
"like_count": 2,
"comment_count": 0
}
],
"total_count": 1
}
Execute data.py
from the root of the project will return all of these for you, saved in data/
(each page has its own file).
Now, we want to grab the hacker-written description of all of these projects. To do so, we can execute content.py
to begin the scraping process. This file collects all projects from the files in data/
and begins the process of generating one file per project containing the full HTML page of each project and saves it to content/
in the following format:
{
"url": "https://devpost.com/software/trip-py-planner",
"content": "..."
}
Here, the "content" key's value is the HTML page representing this project at the time of the request. I did this so I could deal with the lxml
scraping later and also being able to keep the HTML metadata for later.
The biggest challenge: making the 65,000+ requests in an acceptable amount of time. My solution came from this StackOverflow response that provided direction on how to make a mass number of requests in Python.
NOTE: Be wary of how many threads you make available to this program in content.py
. I used 8 threads to successfully complete this in a few hours. I also learned the hard way that using 100 threads is a bad idea.
Execute the last file, consolidate.py
, combine data from both data/
and content/
to generate a final file, data.json
, at the root of the project. The JSON structure is the SAME as what we saw before, except with a new key: description
.
List number of files:
ls -al | wc -l
List directory size:
du -h -c . -l
...where .
is the directory you're looking to calculate size for.