Skip to content

Releases: PhongCT1105/SyntheSearch

SyntheSearch v1.0.0

19 Nov 01:18
98ded1b
Compare
Choose a tag to compare

SyntheSearch

Authors

Phong Cao, Hien Hoang, Doanh Phung, Minh Bui

🚀 Introduction

SyntheSearch is a web application designed to streamline the research process for students and researchers by efficiently locating relevant research papers. Researchers often spend hours sifting through papers, hoping to find the studies that best match their interests. SyntheSearch aims to reduce this time by intelligently suggesting the most relevant papers and generating a synthesis to reveal how the studies interrelate, offering users an insightful overview that saves time and enhances understanding.

🌱 Inspiration

The inspiration for SyntheSearch came from our own experiences as students. Before HackUMass XII, one team member struggled to find research papers on machine-learning applications in cancer detection. The process of locating credible sources was exhausting and time-consuming, even with optimized library search tools. This frustration inspired us to develop a more efficient search engine that leverages Large Language Models (LLM) and vector databases to quickly surface relevant research and summarize findings.

🔨 How We Built the Project

We chose Python for the back end because of its extensive frameworks for AI development. Databricks was used to streamline our machine-learning pipeline. Here’s how we approached building SyntheSearch:

1. Data Collection

We started by scraping data from the CORE collection of open-access research papers.

2. Embedding

Using LangChain, we implemented OpenAI's text-embedding-3-large model to convert paper texts into vector embeddings.

3. Storage

We utilized LanceDB as our vector database, storing the embedded vectors for fast and efficient retrieval.

4. Summarization and Synthesis

We employed OpenAI’s gpt-4o-mini model to generate summaries, suggestions, and synthesized insights.

5. Front-end

We built the user interface using React.js with a TypeScript template, providing a clean and responsive experience for users.

🛠 Technologies Used

  • Backend: Python, FastAPI, LangChain, OpenAI, LanceDB
  • Frontend: React.js, TypeScript, TailwindCSS, Vite
  • Database: LanceDB (for fast vector storage and retrieval)
  • AI Models: OpenAI's GPT-4 mini, text-embedding-3-large

⚠️ Challenges Faced

One major challenge was handling GitHub workflows. We encountered frequent issues with pull request conflicts, which slowed our progress as we resolved merge conflicts. Additionally, our team sometimes struggled with communication, resulting in duplicated work when members unintentionally tackled the same tasks.

🧠 Lessons Learned

This project was an invaluable learning experience. As it was our first LLM project, we gained hands-on experience with GenAI technologies, particularly the power of vector databases. We learned the importance of clear team communication, and we now have a deeper understanding of LLMs and their capabilities in revolutionizing information retrieval.

🚧 Next Steps:

  • Expand the dataset of research papers for broader coverage.
  • Improve the paper recommendation algorithms and synthesis engine.
  • Continue optimizing the UI for a more intuitive user experience.

📦 Built With

  • Core: CORE
  • GenAI: OpenAI, LangChain
  • GitHub: GitHub (for version control)
  • LanceDB: LanceDB (for vector storage)
  • LLM: Large Language Models
  • Python: Python (backend)
  • RAG: Retrieval Augmented Generation (RAG)
  • React: React.js (frontend)
  • TailwindCSS: TailwindCSS (styling)
  • TypeScript: TypeScript (for type safety)
  • Vite: Vite (bundling)

Thank you for checking out SyntheSearch! We hope it helps you save time and enhance your research workflow. Stay tuned for future updates and improvements!