From 64415c929ceec87fc56e9b4b0fb52c622b896885 Mon Sep 17 00:00:00 2001 From: ebenp Date: Sun, 3 Sep 2017 11:20:04 -0400 Subject: [PATCH 01/10] Added custom crawler files --- README.md | 92 ++++++++++++++++---------------------------------- lesson-name.md | 11 ++++++ 2 files changed, 41 insertions(+), 62 deletions(-) create mode 100644 lesson-name.md diff --git a/README.md b/README.md index 9500127..d89883c 100644 --- a/README.md +++ b/README.md @@ -1,74 +1,42 @@ -# Data Together Learning Materials +# Tutorial: Custom Crawlers -This primer introduces key concepts for community-based data stewardship and contains a series of tutorials explaining Data Together and showing how to add content to the network, annotate content that’s already on the network, and reinforce content that is already stored on the network. -As a [GitBook](https://www.gitbook.com/), it can be read in many different formats. +_Note: This tutorial is a work in progress. Please add your feedback to [datatogether/learning](https://github.com/datatogether/learning/issues)!_ -## First Steps +## Prerequisites -Check out the [Table of Contents](SUMMARY.md) or the sidebar on the left. Topics are broken down into _Tutorials_ with distinct _Lessons_ within each! +* You care about a dataset that exists on the web and downloading the data could be benifit from some automation of downloading -The [first tutorial](add-dataset/) covers adding a dataset to Data Together and reviews how Data Together is different from other forms of preserving data. +## Learning Objectives -## Getting Help +After going through this tutorial you will know -You can get help by any of the following methods: +* What a custom crawler is and why some websites need one +* What should your custom crawler extract from a webpage? +* How to write a custom crawler that works with DataTogether -- add a question to the [datatogether/learning](https://github.com/datatogether/learning/issues) issue tracker -- speak with us on the [Archivers Slack](https://slack.archivers.space) +## Key Concepts ---- +* Custom Crawler: An auotmated way to download data and prepare it for upload into the DataTogether network. This is usually a script file that is written specifically for a dataset. +* Morph.io: An online service that automates and saves user created scripts. +* Archivertools: An Python package to aid in accessing Morph.io and DataTogether APIs using an Archiver class. This package also contains some common scraping functions. Currently written in Python 3. -## Contributing -We welcome your input! If you notice any errors, would like to submit changes, or add any content, you can contribute improvements to this documentation on [GitHub](https://github.com/datatogether/learning): [github.com/datatogether/learning](https://github.com/datatogether/learning). +## Lessons -### Cloning this Repo +1. What is custom crawling? + * Why do some websites need custom crawls? + * What should your custom crawler extract from the webpage? + * Examples of sites needing custom crawlers +2. Introduction/tutorial for Morph + * What is Morph.io? + * How to setup a Morph.io account + * Getting a DataTogether API key, and making sure Morph can access it +2. A tutorial for Archivertools package + * What does it do? + * Installing the package + * + * Using the Archiver class +3. Some example custom crawls scripts and implementation -You can clone a copy of this repository using the following command line: - -```bash -$ git clone git@github.com:datatogether/learning.git -``` - -### Installing Dependencies - -To install GitBook, you will need [Node.js](https://nodejs.org/en/) (v4.0.0 or above) on your system and you must be running Windows, Mac OS X, Linux, or Unix. - -It is easiest to install `gitbook-cli` with [npm](https://www.npmjs.com/), the Node.js package manager. From your terminal, run the following command: - -```bash -$ npm install gitbook-cli -g -``` - -Additional instructions for setting up and installing GitBook can be found in the [GitBook Toolchain Documentation](https://toolchain.gitbook.com/setup.html) - -### Running Locally - -Once you make changes to the contents, you can preview them by running a local GitBook server: - -```bash -$ gitbook serve -``` - -After starting the server using the command above, visit `http://localhost:4000` (or whatever address was indicated by the `gitbook serve` command) in your web browser. - -### Deploying - -The [`scripts/`](scripts/) folder has all you need to rebuild the GitBook materials in multiple formats and publish to `gh-pages` and [datatogether.github.io/learning](https://datatogether.github.io/learning/): - -```bash -$ bash scripts/build_formats.sh -$ bash scripts/publish_gh-pages.sh -``` - -You may need to install Calibre's ebook-convert cli tools. For Mac OS X, these can be copied from the Calibre application: - -```bash -$ ln -s /Applications/calibre.app/Contents/MacOS/ebook-convert /usr/local/bin -``` - -## License - -Data Together Learning Materials are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. - -See the [`LICENSE`](/LICENSE) file for details. +## Next Steps +Look at the other resources under DataTogether for more bakground on DataTogether and storing datasets \ No newline at end of file diff --git a/lesson-name.md b/lesson-name.md new file mode 100644 index 0000000..b654fd8 --- /dev/null +++ b/lesson-name.md @@ -0,0 +1,11 @@ +# Lesson: Name + +## Goals + +## Steps + +### Step 1: + +### Step N: + +## Next Steps From f3718565afdfe89992ee6e5714d0302eb00b40f0 Mon Sep 17 00:00:00 2001 From: ebenp Date: Sun, 3 Sep 2017 11:22:01 -0400 Subject: [PATCH 02/10] Added custom crawl files Added folder / files for custom crawls --- .DS_Store | Bin 0 -> 6148 bytes custom-crawls/.DS_Store | Bin 0 -> 6148 bytes README.md => custom-crawls/README.md | 0 .../lessons/lesson-name.md | 0 4 files changed, 0 insertions(+), 0 deletions(-) create mode 100644 .DS_Store create mode 100644 custom-crawls/.DS_Store rename README.md => custom-crawls/README.md (100%) rename lesson-name.md => custom-crawls/lessons/lesson-name.md (100%) diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..cf32db0ca20fdd58c74b42cff5b84b80e80f6aa6 GIT binary patch literal 6148 zcmeHK&1%~~5T141WQ7U!p^zN+(whr54#_DXT=x*jp`g$n+LB0?i3O6@SW-mlIOsLn zH|iy~kf#Y`_QyfNts$2Z!VJuQv!j`n_FJ>#0RT2A!X7{y0640`hJ(XjL_Jr!A`R== zL}Ypm2?>0F5zN3}i6+NOWI&zWHlA1k8RSquzY=1|f-;*1u?#TkPcV(6Ebn&jSYxC4 zW^;>k-s0P>AKpZkUhd~bHt=WP^ehU;fwfPhj2~@3_!5oF)AqZMGR^%c9nWgv-@is_ zD9eE?(lFEOeD?DR<~okUozwR2eBST(JEFJ0x9EuZ!9lkp_V&Ar1?M~OdxuA7$>q=a z)#Cb>;hEu2Qp;P5OZbh(nk~oYJ#QqF=`wtMbRvhtu*gRMw>QW4I%mz=i430+J)Jz~ zOD?=w8q4$?zhSxa+7`$p28aP-pw583CU~pvB%2r@28aP!t~p5eWAmF zxdv$@28e;D3~ZQbTh;$h|L*^vCee%-AO>C)1FSjph6CJ^t*tAYQ?*v0o}nsHah1Wx l6f|@x##mK~cTu&VUnm37wOAQM4+{Me&@|9M4E$FH9sq6ged7QC literal 0 HcmV?d00001 diff --git a/custom-crawls/.DS_Store b/custom-crawls/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..c5cdd560d7b8bece1c104415ba36a095c9377bfc GIT binary patch literal 6148 zcmeHKJFdb&477PjAkn0x+$(T{6@n9Rfu{#_5h)`2RGf>WG5!<}1v)4|+NFyS1xS?EZ7@O^zk8F|=1;TMgCnxC-{l5L{M%j-8#+^wkN9pnW<9DrX z(Wn3wpaN8Y3Q&O;E06_tzIgGuJdO%bfzMaK?uP<5tcgRQUmX~{1prPFcEj9z31G1R zuqF=`Eo-SGgIZ^>CaIe5mEE}u; zTlkay|DMDZ6`%rNrGQR1yUiM(l(lvAaaL;!{1i80h5~3oFM#>1>UW|4VV!Xwg3PC literal 0 HcmV?d00001 diff --git a/README.md b/custom-crawls/README.md similarity index 100% rename from README.md rename to custom-crawls/README.md diff --git a/lesson-name.md b/custom-crawls/lessons/lesson-name.md similarity index 100% rename from lesson-name.md rename to custom-crawls/lessons/lesson-name.md From d472687984952538ea80916d3b0c0f713debaa45 Mon Sep 17 00:00:00 2001 From: ebenp Date: Sun, 3 Sep 2017 11:30:27 -0400 Subject: [PATCH 03/10] Delete .DS_Store --- .DS_Store | Bin 6148 -> 0 bytes 1 file changed, 0 insertions(+), 0 deletions(-) delete mode 100644 .DS_Store diff --git a/.DS_Store b/.DS_Store deleted file mode 100644 index cf32db0ca20fdd58c74b42cff5b84b80e80f6aa6..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 6148 zcmeHK&1%~~5T141WQ7U!p^zN+(whr54#_DXT=x*jp`g$n+LB0?i3O6@SW-mlIOsLn zH|iy~kf#Y`_QyfNts$2Z!VJuQv!j`n_FJ>#0RT2A!X7{y0640`hJ(XjL_Jr!A`R== zL}Ypm2?>0F5zN3}i6+NOWI&zWHlA1k8RSquzY=1|f-;*1u?#TkPcV(6Ebn&jSYxC4 zW^;>k-s0P>AKpZkUhd~bHt=WP^ehU;fwfPhj2~@3_!5oF)AqZMGR^%c9nWgv-@is_ zD9eE?(lFEOeD?DR<~okUozwR2eBST(JEFJ0x9EuZ!9lkp_V&Ar1?M~OdxuA7$>q=a z)#Cb>;hEu2Qp;P5OZbh(nk~oYJ#QqF=`wtMbRvhtu*gRMw>QW4I%mz=i430+J)Jz~ zOD?=w8q4$?zhSxa+7`$p28aP-pw583CU~pvB%2r@28aP!t~p5eWAmF zxdv$@28e;D3~ZQbTh;$h|L*^vCee%-AO>C)1FSjph6CJ^t*tAYQ?*v0o}nsHah1Wx l6f|@x##mK~cTu&VUnm37wOAQM4+{Me&@|9M4E$FH9sq6ged7QC From 17f33821e259a2c52a705197dcd41a0550f73b9c Mon Sep 17 00:00:00 2001 From: ebenp Date: Sun, 3 Sep 2017 11:30:43 -0400 Subject: [PATCH 04/10] Delete .DS_Store --- custom-crawls/.DS_Store | Bin 6148 -> 0 bytes 1 file changed, 0 insertions(+), 0 deletions(-) delete mode 100644 custom-crawls/.DS_Store diff --git a/custom-crawls/.DS_Store b/custom-crawls/.DS_Store deleted file mode 100644 index c5cdd560d7b8bece1c104415ba36a095c9377bfc..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 6148 zcmeHKJFdb&477PjAkn0x+$(T{6@n9Rfu{#_5h)`2RGf>WG5!<}1v)4|+NFyS1xS?EZ7@O^zk8F|=1;TMgCnxC-{l5L{M%j-8#+^wkN9pnW<9DrX z(Wn3wpaN8Y3Q&O;E06_tzIgGuJdO%bfzMaK?uP<5tcgRQUmX~{1prPFcEj9z31G1R zuqF=`Eo-SGgIZ^>CaIe5mEE}u; zTlkay|DMDZ6`%rNrGQR1yUiM(l(lvAaaL;!{1i80h5~3oFM#>1>UW|4VV!Xwg3PC From b5847eae7dae6ff8f79f49db763a33f127166349 Mon Sep 17 00:00:00 2001 From: ebenp Date: Thu, 7 Sep 2017 19:06:44 -0400 Subject: [PATCH 05/10] added back readme.md --- README.md | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 74 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..9500127 --- /dev/null +++ b/README.md @@ -0,0 +1,74 @@ +# Data Together Learning Materials + +This primer introduces key concepts for community-based data stewardship and contains a series of tutorials explaining Data Together and showing how to add content to the network, annotate content that’s already on the network, and reinforce content that is already stored on the network. +As a [GitBook](https://www.gitbook.com/), it can be read in many different formats. + +## First Steps + +Check out the [Table of Contents](SUMMARY.md) or the sidebar on the left. Topics are broken down into _Tutorials_ with distinct _Lessons_ within each! + +The [first tutorial](add-dataset/) covers adding a dataset to Data Together and reviews how Data Together is different from other forms of preserving data. + +## Getting Help + +You can get help by any of the following methods: + +- add a question to the [datatogether/learning](https://github.com/datatogether/learning/issues) issue tracker +- speak with us on the [Archivers Slack](https://slack.archivers.space) + +--- + +## Contributing + +We welcome your input! If you notice any errors, would like to submit changes, or add any content, you can contribute improvements to this documentation on [GitHub](https://github.com/datatogether/learning): [github.com/datatogether/learning](https://github.com/datatogether/learning). + +### Cloning this Repo + +You can clone a copy of this repository using the following command line: + +```bash +$ git clone git@github.com:datatogether/learning.git +``` + +### Installing Dependencies + +To install GitBook, you will need [Node.js](https://nodejs.org/en/) (v4.0.0 or above) on your system and you must be running Windows, Mac OS X, Linux, or Unix. + +It is easiest to install `gitbook-cli` with [npm](https://www.npmjs.com/), the Node.js package manager. From your terminal, run the following command: + +```bash +$ npm install gitbook-cli -g +``` + +Additional instructions for setting up and installing GitBook can be found in the [GitBook Toolchain Documentation](https://toolchain.gitbook.com/setup.html) + +### Running Locally + +Once you make changes to the contents, you can preview them by running a local GitBook server: + +```bash +$ gitbook serve +``` + +After starting the server using the command above, visit `http://localhost:4000` (or whatever address was indicated by the `gitbook serve` command) in your web browser. + +### Deploying + +The [`scripts/`](scripts/) folder has all you need to rebuild the GitBook materials in multiple formats and publish to `gh-pages` and [datatogether.github.io/learning](https://datatogether.github.io/learning/): + +```bash +$ bash scripts/build_formats.sh +$ bash scripts/publish_gh-pages.sh +``` + +You may need to install Calibre's ebook-convert cli tools. For Mac OS X, these can be copied from the Calibre application: + +```bash +$ ln -s /Applications/calibre.app/Contents/MacOS/ebook-convert /usr/local/bin +``` + +## License + +Data Together Learning Materials are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. + +See the [`LICENSE`](/LICENSE) file for details. From a9598bd9ddcaff0d3d767b868a1cb5151df1f1b5 Mon Sep 17 00:00:00 2001 From: ebenp Date: Thu, 7 Sep 2017 19:10:55 -0400 Subject: [PATCH 06/10] Deleted lesson-name.md --- custom-crawls/lessons/lesson-name.md | 11 ----------- 1 file changed, 11 deletions(-) delete mode 100644 custom-crawls/lessons/lesson-name.md diff --git a/custom-crawls/lessons/lesson-name.md b/custom-crawls/lessons/lesson-name.md deleted file mode 100644 index b654fd8..0000000 --- a/custom-crawls/lessons/lesson-name.md +++ /dev/null @@ -1,11 +0,0 @@ -# Lesson: Name - -## Goals - -## Steps - -### Step 1: - -### Step N: - -## Next Steps From 41c1b4a34c841f8f1cba9de8987d3cfa63c1cd9f Mon Sep 17 00:00:00 2001 From: ebenp Date: Thu, 7 Sep 2017 19:36:14 -0400 Subject: [PATCH 07/10] Updated readme.md to address PR comments --- custom-crawls/README.md | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/custom-crawls/README.md b/custom-crawls/README.md index d89883c..7630015 100644 --- a/custom-crawls/README.md +++ b/custom-crawls/README.md @@ -4,39 +4,39 @@ _Note: This tutorial is a work in progress. Please add your feedback to [datatog ## Prerequisites -* You care about a dataset that exists on the web and downloading the data could be benifit from some automation of downloading +* You would like to provide a custom representation of data on a website. This can include difficult-to-scrape dynamic content such database views, web application or search form results, but can also include "crawlable" content that may be useful in a different data representation (e.g. a csv version of an HTML table). ## Learning Objectives After going through this tutorial you will know * What a custom crawler is and why some websites need one -* What should your custom crawler extract from a webpage? -* How to write a custom crawler that works with DataTogether +* What your custom crawler needs to extract from a webpage +* How to write a custom crawler that works with Data Together ## Key Concepts -* Custom Crawler: An auotmated way to download data and prepare it for upload into the DataTogether network. This is usually a script file that is written specifically for a dataset. -* Morph.io: An online service that automates and saves user created scripts. -* Archivertools: An Python package to aid in accessing Morph.io and DataTogether APIs using an Archiver class. This package also contains some common scraping functions. Currently written in Python 3. +* Custom Crawler: An automated way to download data and prepare it for upload into the Data Together network. This is usually a script file that is written specifically for a dataset. +* [Morph.io](https://morph.io/): An online service that automates and saves user created scripts. +* Archivertools: An Python package to aid in accessing Morph.io and Data Together APIs using an Archiver class. This package also contains some common scraping functions. Currently written in Python 3. ## Lessons 1. What is custom crawling? - * Why do some websites need custom crawls? - * What should your custom crawler extract from the webpage? - * Examples of sites needing custom crawlers + * Why do some websites need custom crawls? + * What should your custom crawler extract from the webpage? + * Examples of sites needing custom crawlers 2. Introduction/tutorial for Morph - * What is Morph.io? - * How to setup a Morph.io account - * Getting a DataTogether API key, and making sure Morph can access it + * What is Morph.io? + * How to setup a Morph.io account + * Getting a DataTogether API key, and making sure Morph can access it 2. A tutorial for Archivertools package - * What does it do? - * Installing the package - * - * Using the Archiver class + * What does it do? + * Installing the package + * + * Using the Archiver class 3. Some example custom crawls scripts and implementation ## Next Steps -Look at the other resources under DataTogether for more bakground on DataTogether and storing datasets \ No newline at end of file +Look at the other resources under Data Together for more background on DataTogether and storing datasets From 295a126b787b66b932561c05fd753d1e5e390e3d Mon Sep 17 00:00:00 2001 From: ebenp Date: Thu, 7 Sep 2017 19:37:50 -0400 Subject: [PATCH 08/10] Added archivertools links --- custom-crawls/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/custom-crawls/README.md b/custom-crawls/README.md index 7630015..5734753 100644 --- a/custom-crawls/README.md +++ b/custom-crawls/README.md @@ -18,7 +18,7 @@ After going through this tutorial you will know * Custom Crawler: An automated way to download data and prepare it for upload into the Data Together network. This is usually a script file that is written specifically for a dataset. * [Morph.io](https://morph.io/): An online service that automates and saves user created scripts. -* Archivertools: An Python package to aid in accessing Morph.io and Data Together APIs using an Archiver class. This package also contains some common scraping functions. Currently written in Python 3. +* [Archivertools](https://github.com/datatogether/archivertools): An Python package to aid in accessing Morph.io and Data Together APIs using an Archiver class. This package also contains some common scraping functions. Currently written in Python 3. ## Lessons From b8ecb2f83b7af63325536a5486f8c3f71b936934 Mon Sep 17 00:00:00 2001 From: dcwalk Date: Mon, 23 Oct 2017 23:21:50 -0400 Subject: [PATCH 09/10] minor formatting tweaks --- custom-crawls/README.md | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/custom-crawls/README.md b/custom-crawls/README.md index 5734753..ccb3ef9 100644 --- a/custom-crawls/README.md +++ b/custom-crawls/README.md @@ -20,23 +20,23 @@ After going through this tutorial you will know * [Morph.io](https://morph.io/): An online service that automates and saves user created scripts. * [Archivertools](https://github.com/datatogether/archivertools): An Python package to aid in accessing Morph.io and Data Together APIs using an Archiver class. This package also contains some common scraping functions. Currently written in Python 3. - ## Lessons 1. What is custom crawling? - * Why do some websites need custom crawls? - * What should your custom crawler extract from the webpage? - * Examples of sites needing custom crawlers -2. Introduction/tutorial for Morph - * What is Morph.io? - * How to setup a Morph.io account - * Getting a DataTogether API key, and making sure Morph can access it -2. A tutorial for Archivertools package - * What does it do? - * Installing the package - * - * Using the Archiver class -3. Some example custom crawls scripts and implementation + * Why do some websites need custom crawls? + * What should your custom crawler extract from the webpage? + * Examples of sites needing custom crawlers +1. Introduction/tutorial for Morph + * What is Morph.io? + * How to setup a Morph.io account + * Getting a DataTogether API key, and making sure Morph can access it +1. A tutorial for Archivertools package + * What does it do? + * Installing the package + * + * Using the Archiver class +1. Some example custom crawls scripts and implementation ## Next Steps + Look at the other resources under Data Together for more background on DataTogether and storing datasets From 01805db9b38077259f524ff4ac674927fbb70f8c Mon Sep 17 00:00:00 2001 From: ebenp Date: Thu, 2 Nov 2017 19:24:51 -0400 Subject: [PATCH 10/10] split out lesson stubs Splitting out lesson stubs --- .gitignore | 1 + custom-crawls/archivertools Tutorial.md | 7 +++++++ custom-crawls/morph.io Tutorial.md | 4 ++++ custom-crawls/what is custom crawling.md | 4 ++++ 4 files changed, 16 insertions(+) create mode 100644 custom-crawls/archivertools Tutorial.md create mode 100644 custom-crawls/morph.io Tutorial.md create mode 100644 custom-crawls/what is custom crawling.md diff --git a/.gitignore b/.gitignore index 4cb12d8..de77a57 100644 --- a/.gitignore +++ b/.gitignore @@ -14,3 +14,4 @@ _book *.epub *.mobi *.pdf +.DS_Store diff --git a/custom-crawls/archivertools Tutorial.md b/custom-crawls/archivertools Tutorial.md new file mode 100644 index 0000000..8884d24 --- /dev/null +++ b/custom-crawls/archivertools Tutorial.md @@ -0,0 +1,7 @@ +1. A tutorial for Archivertools package + * What does it do? + * Installing the package + * + * Using the Archiver class +1. Some example custom crawls scripts and implementation + diff --git a/custom-crawls/morph.io Tutorial.md b/custom-crawls/morph.io Tutorial.md new file mode 100644 index 0000000..7d20f93 --- /dev/null +++ b/custom-crawls/morph.io Tutorial.md @@ -0,0 +1,4 @@ +1. Introduction/tutorial for Morph + * What is Morph.io? + * How to setup a Morph.io account + * Getting a DataTogether API key, and making sure Morph can access it \ No newline at end of file diff --git a/custom-crawls/what is custom crawling.md b/custom-crawls/what is custom crawling.md new file mode 100644 index 0000000..0a8d3a1 --- /dev/null +++ b/custom-crawls/what is custom crawling.md @@ -0,0 +1,4 @@ +1. What is custom crawling? + * Why do some websites need custom crawls? + * What should your custom crawler extract from the webpage? + * Examples of sites needing custom crawlers