Skip to content
This repository has been archived by the owner on Nov 7, 2019. It is now read-only.

Added the first custom crawler documentation files #25

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

ebenp
Copy link

@ebenp ebenp commented Sep 3, 2017

Here's some structure for the custom crawlers documentation

@ebenp
Copy link
Author

ebenp commented Sep 3, 2017

This is a response to #19
Also, my first ever pull request!

cc @jeffreyliu @dcwalk @weatherpattern

@dcwalk
Copy link
Member

dcwalk commented Sep 3, 2017

Hey @ebenp 🎉

It looks like you deleted the main README in this PR, are you able to revert that?

@dcwalk dcwalk self-requested a review September 3, 2017 21:56
README.md Outdated
@@ -1,74 +0,0 @@
# Data Together Learning Materials
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need the main README, can you revert to have it in this commit?

Copy link
Member

@dcwalk dcwalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some comments in!

## Lessons

1. What is custom crawling?
* Why do some websites need custom crawls?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use 2 spaces not tabs, NBD, but nice to be consistent


## Prerequisites

* You care about a dataset that exists on the web and downloading the data could be benifit from some automation of downloading
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could cut out "You care about a dataset that exists on the web and..."
As the prereq on this is probably more complex (e.g., you have a dataset that can't be downloaded automatically.)

We may want to revisit the DR guides we prepared and pull in some language:
https://edgi-govdata-archiving.github.io/guides/


## Prerequisites

* You care about a dataset that exists on the web and downloading the data could be benifit from some automation of downloading
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/benifit/benefit

After going through this tutorial you will know

* What a custom crawler is and why some websites need one
* What should your custom crawler extract from a webpage?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this second point phrased not as a question, so
something like: "What your custom crawler needs to extract from a web page"


* What a custom crawler is and why some websites need one
* What should your custom crawler extract from a webpage?
* How to write a custom crawler that works with DataTogether
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/DataTogether/Data Together


## Key Concepts

* Custom Crawler: An auotmated way to download data and prepare it for upload into the DataTogether network. This is usually a script file that is written specifically for a dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/auotmated/automated
s/DataTogether/Data Together

## Key Concepts

* Custom Crawler: An auotmated way to download data and prepare it for upload into the DataTogether network. This is usually a script file that is written specifically for a dataset.
* Morph.io: An online service that automates and saves user created scripts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to morph.io? Morph.io


* Custom Crawler: An auotmated way to download data and prepare it for upload into the DataTogether network. This is usually a script file that is written specifically for a dataset.
* Morph.io: An online service that automates and saves user created scripts.
* Archivertools: An Python package to aid in accessing Morph.io and DataTogether APIs using an Archiver class. This package also contains some common scraping functions. Currently written in Python 3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete extra space
s/DataTogether/Data Together

wondering if you want to link to archivertools? (And how is this package published?)

3. Some example custom crawls scripts and implementation

## Next Steps
Look at the other resources under DataTogether for more bakground on DataTogether and storing datasets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/DataTogether/Data Together
s/bakground/background

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed! Thanks!

@@ -0,0 +1,11 @@
# Lesson: Name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can delete this if you aren't using separate lessons for now!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Will do.

@ebenp
Copy link
Author

ebenp commented Sep 7, 2017

Updated the text and files.

@ebenp
Copy link
Author

ebenp commented Sep 7, 2017

@ebenp
Copy link
Author

ebenp commented Oct 12, 2017

Bump. Any thoughts?

@dcwalk
Copy link
Member

dcwalk commented Oct 24, 2017

I think this looks great @ebenp 🎉 just made a minor formatting tweak -- my suggestion is that this is ready to flesh out during the sprint, but we could decide on the Thursday call :)

@ebenp
Copy link
Author

ebenp commented Oct 25, 2017

Sounds great! Thanks @dcwalk

Copy link
Contributor

@jeffreyliu jeffreyliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@dcwalk
Copy link
Member

dcwalk commented Oct 27, 2017

Okay, so can I merge this?

@jeffreyliu
Copy link
Contributor

jeffreyliu commented Oct 27, 2017 via email

@dcwalk
Copy link
Member

dcwalk commented Oct 27, 2017

Okay, just realized this is only the table of contents, without the tutorial. I'd like to wait and add them all together. Maybe in the coming week I can help with review/writing of those content.

(or am I missing something and it is there?)

@ebenp
Copy link
Author

ebenp commented Oct 27, 2017

So I think initially this was to provide some structure to start writing and the the tutorial was started in the readme. I'm fine with adding a tutorial.md and standardizing the readme and contribing docs here in the PR if you want to take that on. Or feel free to close out this structure one and start a new one with the standard documentation and have a new tutorial file with what's in the readme. Sorry, I think this is product of the PR sitting for a while and documentation changing.

Splitting out lesson stubs
@ebenp
Copy link
Author

ebenp commented Nov 2, 2017

ok, I'm (slowly) moving on this. Split out the topic areas and will begin to flush them out

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

Successfully merging this pull request may close these issues.

4 participants