Skip to content

Latest commit

 

History

History
61 lines (40 loc) · 1.44 KB

README.md

File metadata and controls

61 lines (40 loc) · 1.44 KB

Trawler

A job scheduler and analysis tool for webscraping (and other) tasks.

Node.js Package

Datasources

Curently the following datasources are implemented:

  • tiktok get video metadata per hashtag, download them and analyse the text using easyOCR

  • gab (nazi-twitter) crawl posts for user

  • onionlist download tor-catalogue from onionlist.org

  • google dorking fint interesting files and download them

  • facebook posts and reactions scrape facebook posts, comments and reactions (like, heart, etc)

Features

  • simple configuration of actions/datasources, also from 3rd party modules/repos
  • job monitoring and scheduling
  • schedule jobs
  • sqlite, csv and json browser
  • separation of datasets/artifacts (one archive per crawl)
  • scalable amount of workers (also on other machines)

Architecture

Frontend and API

  • GUI to create and schedule jobs
  • Displays pending, running and done jobs
  • Display csv and sqlite datasets

Worker(s)

  • Can be distributed (workers and c&c on different locations/servers)
  • Jobs are managed through json files (and can be distrubuted with an adapter like pouchDB)
  • Multithreaded

Install

Using Docker-Compose

Install using docker-compose by running:

docker-compose up