Adding in non Archiver scraping web links #13

ebenp · 2017-09-14T23:19:10Z

Looking up code syntax I found the following blog post and referenced github repo. I wondered if links such as the example below should be tracked as non Archiving scraping web links under research/web_scraping.

I'm not quite sure what the best format is for folks to add links, comment, and edit and don't have any sense of how frequently such a resource would be updated.

I'm interested in people's thoughts on 1) if this belongs in research/web_scraping or somewhere else and 2) how to go about a useful PR on the topic including preferred tracking format and any document organization.
cc @b5 @jeffreyliu @weatherpattern @mhucka

Example links:
http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-data-journalists/
https://github.com/stanfordjournalism/search-script-scrape

jeffreyliu · 2017-09-14T23:28:05Z

Hm, so I'm not clear on what the distinction between this and web_scraping would be? I'm also not entirely clear on what non-archiving scraping means in this context. Could you clarify those points? I think these resources should definitely be included, but perhaps just under the web_scraping section.

Perhaps for each category, there should be an issue for suggesting links to include, and those that we think are good resources should be added via PR?

ebenp · 2017-09-14T23:39:08Z

I don't think there should be a distinction between this and web_harvesting.
Sorry, I just looked and meant to mention this location to save.
https://github.com/datatogether/research/tree/master/web_harvesting

I was thinking this would be a readme or a google sheet link inside the web_harvesting folder as the location to save this if that makes sense.

The only scraping distinction I was thinking of is between links such as these and scraping that we do with datatogether archiving that has archivertools and morph.io usage. I think those examples should be kept out of this research repo.

mhucka · 2017-09-18T05:25:57Z

You're right, this needs clarification. I hope to get back to this this week.

mhucka · 2017-09-21T06:18:10Z

@ebenp @jeffreyliu Finally looking at this, I now remember the original idea behind the two directories. One is for cataloging software systems that do web archiving/scraping/etc., and the other is meant to be research on approaches to doing that (i.e., overall approach, algorithms, examples of software that does it, etc.). I struggled with how to name the directories, and clearly failed badly.

What if web_harvesting were renamed to harvesting_approaches or something similar?

Regarding the distinction between scraping and archiving, I might be wrong, but I think there is a difference, because a system to scrape web pages does not necessary have to archive or store the results. For exaple, I've written a system that scrapes pages to get info and store specific bits of info in a custom database, but it doesn't archive the whole page or harvest the page/site in the way that we talk about those things in Archivers & Data Together.

IMHO, the term "harvesting" could mean either scraping or archiving, although looking around, I now see that Wikipedia basically makes "web scraping" synonymous with "web harvesting" and "web data extraction", so I guess it's closer to the meaning of scraping.

ebenp · 2017-09-23T14:27:17Z

web harvesting makes sense to me and I also really like the detail given above about what harvesting and archiving is in terms of Data Together. Maybe those definitions could end up in the directory readme.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding in non Archiver scraping web links #13

Adding in non Archiver scraping web links #13

ebenp commented Sep 14, 2017 •

edited

Loading

jeffreyliu commented Sep 14, 2017

ebenp commented Sep 14, 2017 •

edited

Loading

mhucka commented Sep 18, 2017

mhucka commented Sep 21, 2017 •

edited

Loading

ebenp commented Sep 23, 2017

Adding in non Archiver scraping web links #13

Adding in non Archiver scraping web links #13

Comments

ebenp commented Sep 14, 2017 • edited Loading

jeffreyliu commented Sep 14, 2017

ebenp commented Sep 14, 2017 • edited Loading

mhucka commented Sep 18, 2017

mhucka commented Sep 21, 2017 • edited Loading

ebenp commented Sep 23, 2017

ebenp commented Sep 14, 2017 •

edited

Loading

ebenp commented Sep 14, 2017 •

edited

Loading

mhucka commented Sep 21, 2017 •

edited

Loading