artvee-scraper

artvee-scraper is an easy to use library for fetching public domain artwork from Artvee.

Artvee Web-scraper

Overview

Artvee-scraper is a web scraper which concurrently extracts artwork from Artvee. Callbacks are notified asynchronously for each scraped artwork so that user-defined actions may be taken. These actions are typically used to store the artwork, which can subsequently be used for display, machine learning, or other applications.

If you are seeking a command line utility, please note that it has been relocated to a separate project - artvee-scraper-cli. Alternatively, you may still use artvee-scraper 3.0.1.

Installation

Using PyPI

$ python -m pip install artvee-scraper

Python 3.10+ is officially supported.

Getting Started

Create callbacks (lambda, function, method).

# Use a lambda to log the event
log_event = lambda artwork, thrown: logger.info(
    "Processing '%s' by %s", artwork.title, artwork.artist
)

# Write the artwork to a file as JSON format
def on_artwork_received(artwork: Artwork, thrown: Exception | None = None) -> None:
    if thrown is None:
        with open(f"/tmp/{artwork.resource}.json", "w", encoding="UTF-8") as fout:
            json.dump(artwork.to_dict(), fout, ensure_ascii=False)

Initialize the scraper.

scraper = ArtveeScraper() # scrapes all categories by default

Register callbacks. The callbacks will be notified asynchronously for each event in the order that they are registered.
```
scraper.register_listener(log_event).register_listener(on_artwork_received)
```
Start scraping. Use either the context manager construct, or join to block until done.
Example 1 - using context manager
```
with scraper as s:
    s.start() # blocks until done
```
Example 2 - using join()
```
scraper.start()
  ... // do other things
scraper.join() # blocks until done
```

Examples

Create app.py

import logging
import os

from artvee_scraper.artvee_client import CategoryType
from artvee_scraper.artwork import Artwork
from artvee_scraper.scraper import ArtveeScraper

# Set up logging configuration
logging.basicConfig(
    level=logging.DEBUG,
    format="%(asctime)s.%(msecs)03d %(levelname)s [%(threadName)s] %(module)s.%(funcName)s(%(lineno)d) | %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S"
)
logger = logging.getLogger(__name__)


def handle_event(artwork: Artwork, thrown: Exception | None = None) -> None:
    """A callback for handling the result of an artwork processing event."""

    if thrown is not None:
        # An error occurred; the artwork is partially populated (missing artwork.image.raw)
        logger.error("Failed to process artist=%s, title=%s, url=%s; %s", artwork.artist, artwork.title, artwork.url, thrown)
    else:
        file_path = os.path.expanduser(f"~/Downloads/{artwork.resource}.jpg") # create a unique filename
        logger.info("Writing %s to %s", artwork.title, file_path)

        # Write the raw image bytes to a file. 
        with open(file_path, "wb") as fout:
            fout.write(artwork.image.raw)


def main():
    # Choose which categories to scrape. Using `list(CategoryType)` creates a list of all categories.
    categories = [CategoryType.ABSTRACT, CategoryType.DRAWINGS]

    # Initialize the scraper
    scraper = ArtveeScraper(categories=categories)

    # Register listener functions
    scraper.register_listener(handle_event)

    # Start scraping
    with scraper as s:
        s.start() # blocks until done


if __name__ == "__main__":
    main()

Run app.py

me@linux-desktop:~$ python app.py
2038-01-19 19:36:36.839 DEBUG [MainThread] scraper.start(125) | Starting
2038-01-19 19:36:36.839 DEBUG [Thread-1 (_exec)] scraper._exec(152) | Executing scraper for categories [<CategoryType.ABSTRACT: 'abstract'>, <CategoryType.DRAWINGS: 'drawings'>]
2038-01-19 19:36:36.839 DEBUG [Thread-1 (_exec)] artvee_client.get_page_count(113) | Retrieving page count; category=abstract
2038-01-19 19:36:36.854 DEBUG [Thread-1 (_exec)] connectionpool._new_conn(1051) | Starting new HTTPS connection (1): artvee.com:443
2038-01-19 19:36:37.737 DEBUG [Thread-1 (_exec)] connectionpool._make_request(546) | https://artvee.com:443 "GET /c/abstract/page/1/?per_page=70 HTTP/11" 301 0
2038-01-19 19:36:37.827 DEBUG [Thread-1 (_exec)] connectionpool._make_request(546) | https://artvee.com:443 "GET /c/abstract/?per_page=70 HTTP/11" 200 19573
2038-01-19 19:36:37.955 DEBUG [Thread-1 (_exec)] scraper._exec(160) | Category abstract has 108 page(s)
2038-01-19 19:36:37.955 DEBUG [Thread-1 (_exec)] scraper._exec(166) | Processing category abstract, page (1/108)
2038-01-19 19:36:37.955 DEBUG [Thread-1 (_exec)] artvee_client.get_metadata(152) | Retrieving metadata; category=abstract, page=1
    ...

API Reference

API documentation is available on Read the Docs.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
artvee_scraper		artvee_scraper
docs		docs
tests		tests
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

artvee-scraper

Overview

Installation

Getting Started

Examples

API Reference

About

Releases 6

Packages

Languages

License

zduclos/artvee-scraper

Folders and files

Latest commit

History

Repository files navigation

artvee-scraper

Overview

Installation

Getting Started

Examples

API Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages