Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add design doc #54

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Add design doc #54

wants to merge 4 commits into from

Conversation

VKorelsky
Copy link
Collaborator

@VKorelsky VKorelsky commented Apr 10, 2021

Scraping Phase 1 - Design Doc Proposal

Scrapping pipeline MVP


![top_20_sources](../images/top_20_sources_osvp_excel.png)

This suggests that we should start writing a scraper for [El Pitazo](https://elpitazo.net/) first.
Copy link
Collaborator

@marianelamin marianelamin Apr 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some work already started for el pitazo on #48 , I copied that into a Google Colab for better review. Might be worth to take a look.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some work already started for el pitazo on #48, I copied that into a Google Colab for better review. Might be worth taking a look.

While the code from the PRs/Notebooks work, they are not a scraping module that can be imported or used by users of this library.

My guess is that writing a more "formal" scraper to the library (or another library—a scraping library) is a new task to consider.

There were many benefits in doing simple but decoupled modules. I think we should strive to separate the following topics into different repositories or modules (inside c4v-py but in a specific module).

  • Information Extraction (IE)
    • Crawling (e.g. link discovery)
    • Scraping (e.g. HTML parsing, RSS feed scraper, or data cleaning utils)
  • Classification

In this case, we focus on IE.

Information Extraction

The information extraction library/module.

As said earlier, it can be in its own repository (c4v-ie see IE), or it can be its own module inside this one. For example, from c4vpy.information_extraction import scraper or something along those lines.

@Edilmo said that all the work should be inside this library. I am all for it.

So, to resolve this conversation, should we create a task of creating a first scraping module using el pitazo as a first example? After Luis (the intern) practices using Scrapy, this can be a good first issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m4r4c00ch0 I agree. Just wanted to clarify, these notebooks I have been mentioning are just a proof of concept. I suggest we take what we can use from them (copy/paste with some modifications) and make it part of a formal module that belongs to the library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m4r4c00ch0 I was planning on adding the requests + beautifulsoup solution (#48) into a module within src/ and adding its unit test. However, I like more the idea of Luis creating a module + test using Scrapy.

I will merge #48, only the notebook will be merged into notebooks/ so the code serves as a snippet on how to scrape elpitazo.

Copy link
Collaborator Author

@VKorelsky VKorelsky Apr 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, to resolve this conversation, should we create a task of creating a first scraping module using el pitazo as a first example? After Luis (the intern) practices using Scrapy

yep @m4r4c00ch0 that sounds good.
I was chatting with @LDiazN today and we discussed that this would be a first task.
I think he'll reach out to you to discuss specifics, since you've implemented scraping in python before.

Copy link
Collaborator

@asciidiego asciidiego Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m4r4c00ch0 I was planning on adding the requests + beautifulsoup solution (#48) into a module within src/ and adding its unit test. However, I like more the idea of Luis creating a module + test using Scrapy.

I will merge #48, only the notebook will be merged into notebooks/ so the code serves as a snippet on how to scrape elpitazo.

can you, instead, copy the snippet and paste it in a folder named snippets or something? That would be much easier to try! For example, I do not have Jupyter and I cannot easily see the code on Github when you push notebooks. Open the PR with the code and, since it is a snippet-like code, the review should be fast.

If you create the PR, could you, please, resolve this thread @dieko95 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the notebook stuff @m4r4c00ch0, I think it'd be good that you maybe install jupyter - I can see that the cookie cutter data science repo structure suggests keeping notebooks in the repo as well.

This seems to imply that the process to review these will then require checking out the branch locally and opening the notebook there, to have a look at it. I agree that it's not a great review experience.

We can also maybe consider tools like this one, which could actually help a lot with this: https://www.reviewnb.com/

Copy link
Collaborator

@marianelamin marianelamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@asciidiego asciidiego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I proposed a subtle change to the document and a new task related to the scraper.

I also added more context which, if you feel like it'd be useful, could be added to the design document itself.

A sample design can simply be to:
* Define a scraper interface, which takes in a URL (+ anything else needed) and outputs scraped data.
* Implement a top level method, which contains a map of URL domains to scraper implementations
* Output is written to data frame
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Output is written to data frame
* Output is a string

To make things even simpler, since the output is just text content, I propose we write a Str -> Str interface (URL -> Text Content). That way, if we want a data_frame (or any other kind of Python Object), we can write simple adapters. These are easier to test and decoupled from libraries (and extendable).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, to tackle the point in line 35, if Angostura requires data_frames, we (or Angostura team) can write those adapters without much hassle, test them, and they can become plugins of the c4v-py library. Something along the lines of c4v-py.community.angostura or c4v-py.contrib.angostura.

I've seen this approach being used many times by projects. If a certain package inside the community packages proves itself to be robust or coreish enough, it becomes part of the core modules.

Copy link
Collaborator Author

@VKorelsky VKorelsky Apr 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I like that approach 👍🏽
The one thing I'm thinking about though is if that top level method takes all the urls, then it would be better if the resulting data structure keeps the relationship of URLs -> scraped_data.

That way we know what to persist for each URL.

I was thinking we could maybe have an internal object in our lib that is used in this top level loop.
Then the adapter is just a wrapper around that data frame that implements this interface

wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make it the other way, but I like your point on persistence. It's important to discuss that.

Ideally, our "object" should be agnostic of the persistence mechanisms. In fact, it should not be an object at all. After all, it is just data. Objects imply behavior.

As a core library, we have to think about our users. They might (and probably will!) be interested in saving the data as CSV, data frames, XML, JSON, or Excel files. We need flexibility. This is why our data should be a super simple data structure (in the strict computer science of the word).

For example, a (Hash)Map—a dictionary in Python. The keys could be the URLs themselves—or a combination of parameters, such as the and timestamp—or a custom hashed computation. In the case of the values, we would have the scraped data itself. To give an example, another map.

This approach is already similar to how one would interact with a NoSQL database, but we are only using Plain Old Python Data Structures.

The problem when one creates their own objects, whether or not they use a 3rd-party library inside, is that many libraries know how to deal with Python data structures (e.g. Dict or OrderedDict), but have no idea about custom objects. For instance, CodeForVenezuelaData or CodeForVenezuelaDataFrame or even pandas data frames themselves. So, to use those libraries, we would be forces to write unnecessary glue code.

Thus, to tackle the persistence problem, we must decouple the "business" logic (should I call it non-profit logic? hehe) from the cross-cutting concerns (i.e. persistence); not couple ourselves to a framework--in this case pandas.

Below I propose a sample architecture:

# follows the contract of URL -> DATA (whatever DATA might be, probably yet another data structure)
scraped_data = {
  "url1": { ... }
  "url2": { ... }
}

Then, in our wrapper module, we have trivial transformational code:

# wrappers/pandas.py
import DataFrame from pandas

def scrapedDataAsPandasDataFrame(scraped_data):
    DataFrame.from_dict(scraped_data)

Obviously, you cannot put any map into a data frame, so our contract would have to take that into account.

Then, if you NEED data frames for any reason to persist the data, you can just have something like...

# persistence/whatever.py

class BigQueryRepository:
    def persist(data):
        df = scrapedDataAsPandasDataFrame(data)
        persist(df)

Besides separating concerns, the testability of this system is higher, since we would only have functions of signature Dict -> ?, instead of 3rdParty -> ?. If we must couple to something, I'd rather that it is the language itself.

HAVING SAID ALL OF THAT

I still agree with @Edilmo in that we should FIRST make it work as fast as you can, and then we can start doing the relevant refactors. That is precisely what I told @LDiazN, to get working code ASAP, and then we can start separating concerns and testing them appropriately. However, it's great that we can discuss all of this while Luis is working on the prototype! As the saying goes... when is the best time to change the code? Before you even wrote it! (if you are wondering where that saying came, I think it was from Raymond H. but I don't recall atm)

First, make it possible. Then, make it beautiful. Then, make it fast.
Source

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made note of this. It is not complicated to add this conversions. I open an issue #66 to define the scraper interface.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to add the scraper interface design to the tasks board. Will pair on it with @LDiazN possibly tomorrow when we chat.

@m4r4c00ch0 I agree with your points. A simple map + wrappers will do just fine. I'll work on this with Luis tomorrow
For now we should aim to just get it working and running in an airflow job, over the test data. That will give us the foundation for everything else + clear up any unknown unknowns associated to hooking it up to airflow. Once the full thing is working end to end then we can start expanding etc.


![top_20_sources](../images/top_20_sources_osvp_excel.png)

This suggests that we should start writing a scraper for [El Pitazo](https://elpitazo.net/) first.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some work already started for el pitazo on #48, I copied that into a Google Colab for better review. Might be worth taking a look.

While the code from the PRs/Notebooks work, they are not a scraping module that can be imported or used by users of this library.

My guess is that writing a more "formal" scraper to the library (or another library—a scraping library) is a new task to consider.

There were many benefits in doing simple but decoupled modules. I think we should strive to separate the following topics into different repositories or modules (inside c4v-py but in a specific module).

  • Information Extraction (IE)
    • Crawling (e.g. link discovery)
    • Scraping (e.g. HTML parsing, RSS feed scraper, or data cleaning utils)
  • Classification

In this case, we focus on IE.

Information Extraction

The information extraction library/module.

As said earlier, it can be in its own repository (c4v-ie see IE), or it can be its own module inside this one. For example, from c4vpy.information_extraction import scraper or something along those lines.

@Edilmo said that all the work should be inside this library. I am all for it.

So, to resolve this conversation, should we create a task of creating a first scraping module using el pitazo as a first example? After Luis (the intern) practices using Scrapy, this can be a good first issue.

@VKorelsky VKorelsky marked this pull request as ready for review April 18, 2021 18:50

2. The python function exposed by the module is called with these rows passed in.
3. Scraping happens.
4. Scraping output is persisted.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VKorelsky maybe we should have another step where the persisted data goes through a cleaning/filtering process into a new column or another table, perhaps. TBD.

What would be some of the things we would not want to pass to the classifier?

  • swearing? certain webpages? (on phase 2 there will be an improvement here)
  • data we identify as non-related
  • data we know for sure does not provide value

Some context:
When we started looking at Twitter posts (a year ago), we noticed that some businesses liked to use trending hashtags in their posts just to get visibility (free booster for their ads) and for example we would see posts like the following:
#sinluz #sinagua ven y gana un premio este sabado en el hipodromo de caracas

This post has nothing to do with our public services issued and ends up being noise to our classifier. Unfortunately, this issue was too frequent that could not be ignored.

@marianelamin marianelamin self-requested a review April 23, 2021 02:38
@Edilmo Edilmo mentioned this pull request Oct 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants