-
Notifications
You must be signed in to change notification settings - Fork 3
Added the first custom crawler documentation files #25
base: master
Are you sure you want to change the base?
Conversation
Added folder / files for custom crawls
This is a response to #19 |
Hey @ebenp 🎉 It looks like you deleted the main README in this PR, are you able to revert that? |
README.md
Outdated
@@ -1,74 +0,0 @@ | |||
# Data Together Learning Materials |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need the main README, can you revert to have it in this commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped some comments in!
custom-crawls/README.md
Outdated
## Lessons | ||
|
||
1. What is custom crawling? | ||
* Why do some websites need custom crawls? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use 2 spaces not tabs, NBD, but nice to be consistent
custom-crawls/README.md
Outdated
|
||
## Prerequisites | ||
|
||
* You care about a dataset that exists on the web and downloading the data could be benifit from some automation of downloading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could cut out "You care about a dataset that exists on the web and..."
As the prereq on this is probably more complex (e.g., you have a dataset that can't be downloaded automatically.)
We may want to revisit the DR guides we prepared and pull in some language:
https://edgi-govdata-archiving.github.io/guides/
custom-crawls/README.md
Outdated
|
||
## Prerequisites | ||
|
||
* You care about a dataset that exists on the web and downloading the data could be benifit from some automation of downloading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/benifit/benefit
custom-crawls/README.md
Outdated
After going through this tutorial you will know | ||
|
||
* What a custom crawler is and why some websites need one | ||
* What should your custom crawler extract from a webpage? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have this second point phrased not as a question, so
something like: "What your custom crawler needs to extract from a web page"
custom-crawls/README.md
Outdated
|
||
* What a custom crawler is and why some websites need one | ||
* What should your custom crawler extract from a webpage? | ||
* How to write a custom crawler that works with DataTogether |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/DataTogether/Data Together
custom-crawls/README.md
Outdated
|
||
## Key Concepts | ||
|
||
* Custom Crawler: An auotmated way to download data and prepare it for upload into the DataTogether network. This is usually a script file that is written specifically for a dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/auotmated/automated
s/DataTogether/Data Together
custom-crawls/README.md
Outdated
## Key Concepts | ||
|
||
* Custom Crawler: An auotmated way to download data and prepare it for upload into the DataTogether network. This is usually a script file that is written specifically for a dataset. | ||
* Morph.io: An online service that automates and saves user created scripts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to morph.io? Morph.io
custom-crawls/README.md
Outdated
|
||
* Custom Crawler: An auotmated way to download data and prepare it for upload into the DataTogether network. This is usually a script file that is written specifically for a dataset. | ||
* Morph.io: An online service that automates and saves user created scripts. | ||
* Archivertools: An Python package to aid in accessing Morph.io and DataTogether APIs using an Archiver class. This package also contains some common scraping functions. Currently written in Python 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete extra space
s/DataTogether/Data Together
wondering if you want to link to archivertools? (And how is this package published?)
custom-crawls/README.md
Outdated
3. Some example custom crawls scripts and implementation | ||
|
||
## Next Steps | ||
Look at the other resources under DataTogether for more bakground on DataTogether and storing datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/DataTogether/Data Together
s/bakground/background
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed! Thanks!
custom-crawls/lessons/lesson-name.md
Outdated
@@ -0,0 +1,11 @@ | |||
# Lesson: Name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can delete this if you aren't using separate lessons for now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Will do.
Updated the text and files. |
Bump. Any thoughts? |
I think this looks great @ebenp 🎉 just made a minor formatting tweak -- my suggestion is that this is ready to flesh out during the sprint, but we could decide on the Thursday call :) |
Sounds great! Thanks @dcwalk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Okay, so can I merge this? |
Yup!
…On Fri, Oct 27, 2017, 11:13 AM dcwalk ***@***.***> wrote:
Okay, so can I merge this?
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#25 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABatxf5uo_kL6O3eSnBL4k2HRXk2WTCGks5swfM0gaJpZM4PLNvI>
.
|
Okay, just realized this is only the table of contents, without the tutorial. I'd like to wait and add them all together. Maybe in the coming week I can help with review/writing of those content. (or am I missing something and it is there?) |
So I think initially this was to provide some structure to start writing and the the tutorial was started in the readme. I'm fine with adding a tutorial.md and standardizing the readme and contribing docs here in the PR if you want to take that on. Or feel free to close out this structure one and start a new one with the standard documentation and have a new tutorial file with what's in the readme. Sorry, I think this is product of the PR sitting for a while and documentation changing. |
Splitting out lesson stubs
ok, I'm (slowly) moving on this. Split out the topic areas and will begin to flush them out |
Here's some structure for the custom crawlers documentation