This project allows searching for texts in the web. A high level explanation for the method is:
- Given a base URL, the program crawls on it and all of the URLs found within its webpage.
- For each webpage, its contents are saved in Elasticsearch, allowing for quick searching, filtering, etc.
- For crawling, the program uses Depth-First Search (DFS) for finding additional webpages. The algorithm has 3 conditions that can stop it: runtime (in seconds), depth, or total URLs visited.
- All visited URLs are saved in Redis to prevent visiting them again, and allowing concurrency.
- Everytime a new URL is found, it is added to a Kafka topic to allow multiple threads/containers to access the queue.
The project exposes API to create a new crawl, which returns a crawl ID (while the crawl continues on a different thread). Another request returns the crawl's status (given its ID).
In order to run this project on your machine, you'll need an account in OpenSearch (or Elasticsearch), and add your API key and your index to the application properties. Then, clone the project, and from the project's root folder, run the following command:
docker-copmose up -d
Make sure that all the containers are up and running (Redis, Zookeeper, Kafka):
docker ps
And run the project if all containers are up and running.
Creating a new crawl on www.cnn.com via Swagger UI (notice the crawl ID in the response body):
Getting the crawl's status (notice the stop reason - timeout of 60 seconds):
And searching in OpenSearch (Elasticsearch):