Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make a scraper API #109

Closed
ArtskydJ opened this issue Jun 12, 2019 · 1 comment
Closed

Make a scraper API #109

ArtskydJ opened this issue Jun 12, 2019 · 1 comment

Comments

@ArtskydJ
Copy link
Owner

It would be nice to have a documented API for others to write scrapers.

See #86 (comment)
The current status is not nice.

I might need to make 2 interfaces.

  1. Sites like Gocomics, Arcamax, Comics kingdom, etc. that hold multiple strips series.
  2. Sites like Dilbert, XKCD, Freefall, etc. that just have one strip series.

Goals:

  1. Eliminate some of the gocomics-specific weirdness. (titleAuthorDate, are you serious?)
  2. Look into generators. I don't want to buffer everything into memory. I used to parse the whole gocomics site in memory, and then had memory issues. I need to be writing to disk as I go (for the multiple strip-series sites.) But I'd also like to abstract away the writing to disk.
    1. I could pass in a function for them to write to disk.
    2. I could make a generator function.
    3. I could expose writing to the disk as a library
    4. other things that I'm not thinking about yet
    5. do I write a bunch of stuff to disk when I am in the process? i forget if this is even relevant/necessary...
@ArtskydJ
Copy link
Owner Author

I made a scraper API. It has 1 interface. It does not eliminate the gocomics-specific weirdness.

I saw that the current gocomics scraper actually was keeping the whole thing in memory before finally writing to disk, so I kept the same behavior.

After I implement a another multi-comics site scraper, I might have a better idea of how to make an API more specific to multi-comics sites.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant