Skip to content

Spring Boot application exposing API for web crawling, creating a Search Engine with Elasticsearch, and utilizing Kafka and Redis

License

Notifications You must be signed in to change notification settings

roeishc/search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Search Engine

This project allows searching for texts in the web. A high level explanation for the method is:

  1. Given a base URL, the program crawls on it and all of the URLs found within its webpage.
  2. For each webpage, its contents are saved in Elasticsearch, allowing for quick searching, filtering, etc.

In-depth explanation

  • For crawling, the program uses Depth-First Search (DFS) for finding additional webpages. The algorithm has 3 conditions that can stop it: runtime (in seconds), depth, or total URLs visited.
  • All visited URLs are saved in Redis to prevent visiting them again, and allowing concurrency.
  • Everytime a new URL is found, it is added to a Kafka topic to allow multiple threads/containers to access the queue.

The project exposes API to create a new crawl, which returns a crawl ID (while the crawl continues on a different thread). Another request returns the crawl's status (given its ID).

In order to run this project on your machine, you'll need an account in OpenSearch (or Elasticsearch), and add your API key and your index to the application properties. Then, clone the project, and from the project's root folder, run the following command:

docker-copmose up -d

Make sure that all the containers are up and running (Redis, Zookeeper, Kafka):

docker ps

And run the project if all containers are up and running.

Example

Creating a new crawl on www.cnn.com via Swagger UI (notice the crawl ID in the response body):




Getting the crawl's status (notice the stop reason - timeout of 60 seconds):




And searching in OpenSearch (Elasticsearch):



About

Spring Boot application exposing API for web crawling, creating a Search Engine with Elasticsearch, and utilizing Kafka and Redis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages