Add design doc #54

VKorelsky · 2021-04-10T23:20:59Z

Scraping Phase 1 - Design Doc Proposal

Scrapping pipeline MVP

marianelamin · 2021-04-11T12:54:45Z

docs/designs/scrapping-workflow-phase-1-design.md

+
+![top_20_sources](../images/top_20_sources_osvp_excel.png)
+
+This suggests that we should start writing a scraper for [El Pitazo](https://elpitazo.net/) first. 


There was some work already started for el pitazo on #48 , I copied that into a Google Colab for better review. Might be worth to take a look.

There was some work already started for el pitazo on #48, I copied that into a Google Colab for better review. Might be worth taking a look.

While the code from the PRs/Notebooks work, they are not a scraping module that can be imported or used by users of this library.

My guess is that writing a more "formal" scraper to the library (or another library—a scraping library) is a new task to consider.

There were many benefits in doing simple but decoupled modules. I think we should strive to separate the following topics into different repositories or modules (inside c4v-py but in a specific module).

Information Extraction (IE)

Crawling (e.g. link discovery)

Scraping (e.g. HTML parsing, RSS feed scraper, or data cleaning utils)

Classification

In this case, we focus on IE.

Information Extraction

The information extraction library/module.

As said earlier, it can be in its own repository (c4v-ie see IE), or it can be its own module inside this one. For example, from c4vpy.information_extraction import scraper or something along those lines.

@Edilmo said that all the work should be inside this library. I am all for it.

So, to resolve this conversation, should we create a task of creating a first scraping module using el pitazo as a first example? After Luis (the intern) practices using Scrapy, this can be a good first issue.

@m4r4c00ch0 I agree. Just wanted to clarify, these notebooks I have been mentioning are just a proof of concept. I suggest we take what we can use from them (copy/paste with some modifications) and make it part of a formal module that belongs to the library.

@m4r4c00ch0 I was planning on adding the requests + beautifulsoup solution (#48) into a module within src/ and adding its unit test. However, I like more the idea of Luis creating a module + test using Scrapy.

I will merge #48, only the notebook will be merged into notebooks/ so the code serves as a snippet on how to scrape elpitazo.

So, to resolve this conversation, should we create a task of creating a first scraping module using el pitazo as a first example? After Luis (the intern) practices using Scrapy

yep @m4r4c00ch0 that sounds good.
I was chatting with @LDiazN today and we discussed that this would be a first task.
I think he'll reach out to you to discuss specifics, since you've implemented scraping in python before.

@m4r4c00ch0 I was planning on adding the requests + beautifulsoup solution (#48) into a module within src/ and adding its unit test. However, I like more the idea of Luis creating a module + test using Scrapy.

I will merge #48, only the notebook will be merged into notebooks/ so the code serves as a snippet on how to scrape elpitazo.

can you, instead, copy the snippet and paste it in a folder named snippets or something? That would be much easier to try! For example, I do not have Jupyter and I cannot easily see the code on Github when you push notebooks. Open the PR with the code and, since it is a snippet-like code, the review should be fast.

If you create the PR, could you, please, resolve this thread @dieko95 ?

Regarding the notebook stuff @m4r4c00ch0, I think it'd be good that you maybe install jupyter - I can see that the cookie cutter data science repo structure suggests keeping notebooks in the repo as well.

This seems to imply that the process to review these will then require checking out the branch locally and opening the notebook there, to have a look at it. I agree that it's not a great review experience.

We can also maybe consider tools like this one, which could actually help a lot with this: https://www.reviewnb.com/

docs/designs/scrapping-workflow-phase-1-design.md

marianelamin

LGTM

docs/designs/scrapping-workflow-phase-1-design.md

asciidiego

I proposed a subtle change to the document and a new task related to the scraper.

I also added more context which, if you feel like it'd be useful, could be added to the design document itself.

asciidiego · 2021-04-11T15:24:07Z

docs/designs/scrapping-workflow-phase-1-design.md

+A sample design can simply be to:
+* Define a scraper interface, which takes in a URL (+ anything else needed) and outputs scraped data.
+* Implement a top level method, which contains a map of URL domains to scraper implementations
+* Output is written to data frame


Suggested change

* Output is written to data frame

* Output is a string

To make things even simpler, since the output is just text content, I propose we write a Str -> Str interface (URL -> Text Content). That way, if we want a data_frame (or any other kind of Python Object), we can write simple adapters. These are easier to test and decoupled from libraries (and extendable).

For example, to tackle the point in line 35, if Angostura requires data_frames, we (or Angostura team) can write those adapters without much hassle, test them, and they can become plugins of the c4v-py library. Something along the lines of c4v-py.community.angostura or c4v-py.contrib.angostura.

I've seen this approach being used many times by projects. If a certain package inside the community packages proves itself to be robust or coreish enough, it becomes part of the core modules.

Yeah I like that approach 👍🏽
The one thing I'm thinking about though is if that top level method takes all the urls, then it would be better if the resulting data structure keeps the relationship of URLs -> scraped_data.

That way we know what to persist for each URL.

I was thinking we could maybe have an internal object in our lib that is used in this top level loop.
Then the adapter is just a wrapper around that data frame that implements this interface

wdyt?

I would make it the other way, but I like your point on persistence. It's important to discuss that.

Ideally, our "object" should be agnostic of the persistence mechanisms. In fact, it should not be an object at all. After all, it is just data. Objects imply behavior.

As a core library, we have to think about our users. They might (and probably will!) be interested in saving the data as CSV, data frames, XML, JSON, or Excel files. We need flexibility. This is why our data should be a super simple data structure (in the strict computer science of the word).

For example, a (Hash)Map—a dictionary in Python. The keys could be the URLs themselves—or a combination of parameters, such as the and timestamp—or a custom hashed computation. In the case of the values, we would have the scraped data itself. To give an example, another map.

This approach is already similar to how one would interact with a NoSQL database, but we are only using Plain Old Python Data Structures.

The problem when one creates their own objects, whether or not they use a 3rd-party library inside, is that many libraries know how to deal with Python data structures (e.g. Dict or OrderedDict), but have no idea about custom objects. For instance, CodeForVenezuelaData or CodeForVenezuelaDataFrame or even pandas data frames themselves. So, to use those libraries, we would be forces to write unnecessary glue code.

Thus, to tackle the persistence problem, we must decouple the "business" logic (should I call it non-profit logic? hehe) from the cross-cutting concerns (i.e. persistence); not couple ourselves to a framework--in this case pandas.

Below I propose a sample architecture:

# follows the contract of URL -> DATA (whatever DATA might be, probably yet another data structure) scraped_data = { "url1": { ... } "url2": { ... } }

Then, in our wrapper module, we have trivial transformational code:

# wrappers/pandas.py import DataFrame from pandas def scrapedDataAsPandasDataFrame(scraped_data): DataFrame.from_dict(scraped_data)

Obviously, you cannot put any map into a data frame, so our contract would have to take that into account.

Then, if you NEED data frames for any reason to persist the data, you can just have something like...

# persistence/whatever.py class BigQueryRepository: def persist(data): df = scrapedDataAsPandasDataFrame(data) persist(df)

Besides separating concerns, the testability of this system is higher, since we would only have functions of signature Dict -> ?, instead of 3rdParty -> ?. If we must couple to something, I'd rather that it is the language itself.

HAVING SAID ALL OF THAT

I still agree with @Edilmo in that we should FIRST make it work as fast as you can, and then we can start doing the relevant refactors. That is precisely what I told @LDiazN, to get working code ASAP, and then we can start separating concerns and testing them appropriately. However, it's great that we can discuss all of this while Luis is working on the prototype! As the saying goes... when is the best time to change the code? Before you even wrote it! (if you are wondering where that saying came, I think it was from Raymond H. but I don't recall atm)

First, make it possible. Then, make it beautiful. Then, make it fast.
Source

Made note of this. It is not complicated to add this conversions. I open an issue #66 to define the scraper interface.

I'm going to add the scraper interface design to the tasks board. Will pair on it with @LDiazN possibly tomorrow when we chat.

@m4r4c00ch0 I agree with your points. A simple map + wrappers will do just fine. I'll work on this with Luis tomorrow
For now we should aim to just get it working and running in an airflow job, over the test data. That will give us the foundation for everything else + clear up any unknown unknowns associated to hooking it up to airflow. Once the full thing is working end to end then we can start expanding etc.

asciidiego · 2021-04-11T15:30:50Z

docs/designs/scrapping-workflow-phase-1-design.md

+
+![top_20_sources](../images/top_20_sources_osvp_excel.png)
+
+This suggests that we should start writing a scraper for [El Pitazo](https://elpitazo.net/) first. 


There was some work already started for el pitazo on #48, I copied that into a Google Colab for better review. Might be worth taking a look.

While the code from the PRs/Notebooks work, they are not a scraping module that can be imported or used by users of this library.

My guess is that writing a more "formal" scraper to the library (or another library—a scraping library) is a new task to consider.

There were many benefits in doing simple but decoupled modules. I think we should strive to separate the following topics into different repositories or modules (inside c4v-py but in a specific module).

Information Extraction (IE)

Crawling (e.g. link discovery)

Scraping (e.g. HTML parsing, RSS feed scraper, or data cleaning utils)

Classification

In this case, we focus on IE.

Information Extraction

The information extraction library/module.

As said earlier, it can be in its own repository (c4v-ie see IE), or it can be its own module inside this one. For example, from c4vpy.information_extraction import scraper or something along those lines.

@Edilmo said that all the work should be inside this library. I am all for it.

So, to resolve this conversation, should we create a task of creating a first scraping module using el pitazo as a first example? After Luis (the intern) practices using Scrapy, this can be a good first issue.

marianelamin · 2021-04-18T03:28:31Z

docs/designs/scrapping-workflow-phase-1-design.md

+
+2. The python function exposed by the module is called with these rows passed in.
+3. Scraping happens.
+4. Scraping output is persisted.


@VKorelsky maybe we should have another step where the persisted data goes through a cleaning/filtering process into a new column or another table, perhaps. TBD.

What would be some of the things we would not want to pass to the classifier?

swearing? certain webpages? (on phase 2 there will be an improvement here)

data we identify as non-related

data we know for sure does not provide value

Some context:
When we started looking at Twitter posts (a year ago), we noticed that some businesses liked to use trending hashtags in their posts just to get visibility (free booster for their ads) and for example we would see posts like the following:
#sinluz #sinagua ven y gana un premio este sabado en el hipodromo de caracas

This post has nothing to do with our public services issued and ends up being noise to our classifier. Unfortunately, this issue was too frequent that could not be ignored.

VKorelsky added 3 commits April 10, 2021 23:18

Doc

e5852a4

Distribution of sources from initial excel

edbb145

Small edit

9485795

marianelamin reviewed Apr 11, 2021

View reviewed changes

docs/designs/scrapping-workflow-phase-1-design.md Outdated Show resolved Hide resolved

marianelamin approved these changes Apr 11, 2021

View reviewed changes

marianelamin reviewed Apr 11, 2021

View reviewed changes

docs/designs/scrapping-workflow-phase-1-design.md Show resolved Hide resolved

VKorelsky commented Apr 11, 2021

View reviewed changes

docs/designs/scrapping-workflow-phase-1-design.md Show resolved Hide resolved

asciidiego reviewed Apr 11, 2021

View reviewed changes

edit doc

2e4404f

VKorelsky marked this pull request as ready for review April 18, 2021 18:50

marianelamin reviewed Apr 23, 2021

View reviewed changes

marianelamin self-requested a review April 23, 2021 02:38

Edilmo mentioned this pull request Oct 30, 2021

Luis/docs #94

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add design doc #54

Add design doc #54

VKorelsky commented Apr 10, 2021 •

edited

Loading

marianelamin Apr 11, 2021 •

edited

Loading

asciidiego Apr 11, 2021

marianelamin Apr 11, 2021

dieko95 Apr 11, 2021

VKorelsky Apr 13, 2021 •

edited

Loading

asciidiego Apr 15, 2021 •

edited

Loading

VKorelsky Apr 18, 2021

marianelamin left a comment

asciidiego left a comment

asciidiego Apr 11, 2021

asciidiego Apr 11, 2021

VKorelsky Apr 13, 2021 •

edited

Loading

asciidiego Apr 15, 2021

marianelamin Apr 18, 2021

VKorelsky Apr 18, 2021

asciidiego Apr 11, 2021

marianelamin Apr 18, 2021


		![top_20_sources](../images/top_20_sources_osvp_excel.png)

		This suggests that we should start writing a scraper for [El Pitazo](https://elpitazo.net/) first.

Add design doc #54

Are you sure you want to change the base?

Add design doc #54

Conversation

VKorelsky commented Apr 10, 2021 • edited Loading

Scraping Phase 1 - Design Doc Proposal

marianelamin Apr 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Information Extraction

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VKorelsky Apr 13, 2021 • edited Loading

Choose a reason for hiding this comment

asciidiego Apr 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marianelamin left a comment

Choose a reason for hiding this comment

asciidiego left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VKorelsky Apr 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Information Extraction

Choose a reason for hiding this comment

VKorelsky commented Apr 10, 2021 •

edited

Loading

marianelamin Apr 11, 2021 •

edited

Loading

VKorelsky Apr 13, 2021 •

edited

Loading

asciidiego Apr 15, 2021 •

edited

Loading

VKorelsky Apr 13, 2021 •

edited

Loading