Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding in non Archiver scraping web links #13

Open
ebenp opened this issue Sep 14, 2017 · 5 comments
Open

Adding in non Archiver scraping web links #13

ebenp opened this issue Sep 14, 2017 · 5 comments

Comments

@ebenp
Copy link

ebenp commented Sep 14, 2017

Looking up code syntax I found the following blog post and referenced github repo. I wondered if links such as the example below should be tracked as non Archiving scraping web links under research/web_scraping.

I'm not quite sure what the best format is for folks to add links, comment, and edit and don't have any sense of how frequently such a resource would be updated.

I'm interested in people's thoughts on 1) if this belongs in research/web_scraping or somewhere else and 2) how to go about a useful PR on the topic including preferred tracking format and any document organization.
cc @b5 @jeffreyliu @weatherpattern @mhucka

Example links:
http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-data-journalists/
https://github.com/stanfordjournalism/search-script-scrape

@jeffreyliu
Copy link
Collaborator

Hm, so I'm not clear on what the distinction between this and web_scraping would be? I'm also not entirely clear on what non-archiving scraping means in this context. Could you clarify those points? I think these resources should definitely be included, but perhaps just under the web_scraping section.

Perhaps for each category, there should be an issue for suggesting links to include, and those that we think are good resources should be added via PR?

@ebenp
Copy link
Author

ebenp commented Sep 14, 2017

I don't think there should be a distinction between this and web_harvesting.
Sorry, I just looked and meant to mention this location to save.
https://github.com/datatogether/research/tree/master/web_harvesting

I was thinking this would be a readme or a google sheet link inside the web_harvesting folder as the location to save this if that makes sense.

The only scraping distinction I was thinking of is between links such as these and scraping that we do with datatogether archiving that has archivertools and morph.io usage. I think those examples should be kept out of this research repo.

@mhucka
Copy link
Member

mhucka commented Sep 18, 2017

You're right, this needs clarification. I hope to get back to this this week.

@mhucka
Copy link
Member

mhucka commented Sep 21, 2017

@ebenp @jeffreyliu Finally looking at this, I now remember the original idea behind the two directories. One is for cataloging software systems that do web archiving/scraping/etc., and the other is meant to be research on approaches to doing that (i.e., overall approach, algorithms, examples of software that does it, etc.). I struggled with how to name the directories, and clearly failed badly.

What if web_harvesting were renamed to harvesting_approaches or something similar?

Regarding the distinction between scraping and archiving, I might be wrong, but I think there is a difference, because a system to scrape web pages does not necessary have to archive or store the results. For exaple, I've written a system that scrapes pages to get info and store specific bits of info in a custom database, but it doesn't archive the whole page or harvest the page/site in the way that we talk about those things in Archivers & Data Together.

IMHO, the term "harvesting" could mean either scraping or archiving, although looking around, I now see that Wikipedia basically makes "web scraping" synonymous with "web harvesting" and "web data extraction", so I guess it's closer to the meaning of scraping.

@ebenp
Copy link
Author

ebenp commented Sep 23, 2017

web harvesting makes sense to me and I also really like the detail given above about what harvesting and archiving is in terms of Data Together. Maybe those definitions could end up in the directory readme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants