Search Engine

This project allows searching for texts in the web. A high level explanation for the method is:

Given a base URL, the program crawls on it and all of the URLs found within its webpage.
For each webpage, its contents are saved in Elasticsearch, allowing for quick searching, filtering, etc.

In-depth explanation

For crawling, the program uses Depth-First Search (DFS) for finding additional webpages. The algorithm has 3 conditions that can stop it: runtime (in seconds), depth, or total URLs visited.
All visited URLs are saved in Redis to prevent visiting them again, and allowing concurrency.
Everytime a new URL is found, it is added to a Kafka topic to allow multiple threads/containers to access the queue.

The project exposes API to create a new crawl, which returns a crawl ID (while the crawl continues on a different thread). Another request returns the crawl's status (given its ID).

In order to run this project on your machine, you'll need an account in OpenSearch (or Elasticsearch), and add your API key and your index to the application properties. Then, clone the project, and from the project's root folder, run the following command:

docker-copmose up -d

Make sure that all the containers are up and running (Redis, Zookeeper, Kafka):

docker ps

And run the project if all containers are up and running.

Example

Creating a new crawl on www.cnn.com via Swagger UI (notice the crawl ID in the response body):

Getting the crawl's status (notice the stop reason - timeout of 60 seconds):

And searching in OpenSearch (Elasticsearch):

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search Engine

In-depth explanation

Example

About

Releases

Packages

Languages

License

roeishc/search-engine

Folders and files

Latest commit

History

Repository files navigation

Search Engine

In-depth explanation

Example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages