Releases: PhongCT1105/SyntheSearch
SyntheSearch v1.0.0
SyntheSearch
Authors
Phong Cao, Hien Hoang, Doanh Phung, Minh Bui
🚀 Introduction
SyntheSearch is a web application designed to streamline the research process for students and researchers by efficiently locating relevant research papers. Researchers often spend hours sifting through papers, hoping to find the studies that best match their interests. SyntheSearch aims to reduce this time by intelligently suggesting the most relevant papers and generating a synthesis to reveal how the studies interrelate, offering users an insightful overview that saves time and enhances understanding.
🌱 Inspiration
The inspiration for SyntheSearch came from our own experiences as students. Before HackUMass XII, one team member struggled to find research papers on machine-learning applications in cancer detection. The process of locating credible sources was exhausting and time-consuming, even with optimized library search tools. This frustration inspired us to develop a more efficient search engine that leverages Large Language Models (LLM) and vector databases to quickly surface relevant research and summarize findings.
🔨 How We Built the Project
We chose Python for the back end because of its extensive frameworks for AI development. Databricks was used to streamline our machine-learning pipeline. Here’s how we approached building SyntheSearch:
1. Data Collection
We started by scraping data from the CORE collection of open-access research papers.
2. Embedding
Using LangChain, we implemented OpenAI's text-embedding-3-large
model to convert paper texts into vector embeddings.
3. Storage
We utilized LanceDB as our vector database, storing the embedded vectors for fast and efficient retrieval.
4. Summarization and Synthesis
We employed OpenAI’s gpt-4o-mini
model to generate summaries, suggestions, and synthesized insights.
5. Front-end
We built the user interface using React.js with a TypeScript template, providing a clean and responsive experience for users.
🛠 Technologies Used
- Backend: Python, FastAPI, LangChain, OpenAI, LanceDB
- Frontend: React.js, TypeScript, TailwindCSS, Vite
- Database: LanceDB (for fast vector storage and retrieval)
- AI Models: OpenAI's GPT-4 mini, text-embedding-3-large
⚠️ Challenges Faced
One major challenge was handling GitHub workflows. We encountered frequent issues with pull request conflicts, which slowed our progress as we resolved merge conflicts. Additionally, our team sometimes struggled with communication, resulting in duplicated work when members unintentionally tackled the same tasks.
🧠 Lessons Learned
This project was an invaluable learning experience. As it was our first LLM project, we gained hands-on experience with GenAI technologies, particularly the power of vector databases. We learned the importance of clear team communication, and we now have a deeper understanding of LLMs and their capabilities in revolutionizing information retrieval.
🚧 Next Steps:
- Expand the dataset of research papers for broader coverage.
- Improve the paper recommendation algorithms and synthesis engine.
- Continue optimizing the UI for a more intuitive user experience.
📦 Built With
- Core: CORE
- GenAI: OpenAI, LangChain
- GitHub: GitHub (for version control)
- LanceDB: LanceDB (for vector storage)
- LLM: Large Language Models
- Python: Python (backend)
- RAG: Retrieval Augmented Generation (RAG)
- React: React.js (frontend)
- TailwindCSS: TailwindCSS (styling)
- TypeScript: TypeScript (for type safety)
- Vite: Vite (bundling)
Thank you for checking out SyntheSearch! We hope it helps you save time and enhance your research workflow. Stay tuned for future updates and improvements!