Empowering Vision Through Voice. Revolutionizing indoor mobility with real-time, adaptive AI-enabled guidance for seamless navigation in complex spaces.
Devpost: https://devpost.com/software/pathsense-athy2r
Won Hack the North 2024 (grand winner) & 1st Place "Best Use of Cohere"
This innovative application aims to enhance mobility and independence for visually impaired individuals by providing a voice-first indoor navigation solution. It combines advanced technologies to offer accessible, spoken navigation guidance in indoor spaces.
- The user interacts with the system through voice commands.
- Voiceflow manages the conversation flow, using its decision tree structure to understand the user's intent (space or object) and desired destination.
- Multiple TAPO cameras stream 1080p video of the environment (over Wifi).
- The CV pipeline (Detectron, DPT, gpt-4o mini) processes the video streams in real-time:
- Detecting objects
- Estimating depth
- Breaking down images into sub-bounding boxes
- Generating parallel narrations for each sub-box
- Combining narrations for a comprehensive scene description
- This CV data, along with MappedIn SDK floor plan information, is stored in JSON format in the Convex database.
- Cohere's reranking is used to find the most relevant information from both CV tags and mapping data based on the user's query.
- MappedIn SDK generates a route from the current location to the desired destination or object.
- Voiceflow, integrated with Cohere, uses the reranked information and MappedIn route to generate appropriate responses and navigation instructions.
- These text instructions are converted to speech using Unreal Engine's text-to-speech capabilities.
- The user receives spoken guidance, which is continuously updated based on their movement and changes in the environment.
This integrated system provides a comprehensive, real-time, and adaptive navigation solution specifically designed for visually impaired users, leveraging cutting-edge AI and computer vision technologies to enhance accessibility in indoor spaces.
-
Computer Vision (CV) Pipeline
- Object Detection: Detectron is used to identify and locate objects within the video streams.
- Depth Estimation: DPT (Dense Prediction Transformer) provides depth information for each pixel in the image.
- Image Processing and Scene Analysis:
- The system breaks down images into various sub-bounding boxes for detailed analysis.
- GPT-4 Vision (mini) generates narrations for each sub-bounding box in parallel.
- The narrations for all sub-boxes are then combined to provide a comprehensive description of the scene.
- Data Integration: All CV data is attached in JSON format and stored in a Convex database using MappedIn SDK.
-
Semantic Search and Reranking (Cohere)
- Cohere's Rerank API endpoint is used for powerful semantic search capabilities.
- Function: Given a
query
and a list ofdocuments
, Rerank indexes the documents from most to least semantically relevant to the query. - Data Source: Documents are composed of two types of data stored in the Convex database:
- Tags generated from the computer vision pipeline
- Floor plan information from the MappedIn SDK
- Reranking Process: The system performs reranking on each of these objects, connecting both CV and mapping data for comprehensive search results.
-
Voice Interaction and Workflow Management (Voiceflow)
- Decision Tree Structure: Programmatically created with numerous steps, loops, and cases for complex voice interactions.
- User Intent Differentiation: The decision tree structure distinguishes between user requests for:
- General "spaces" (e.g., rooms, areas)
- Specific "objects" (e.g., persons, desks, sponsor booths)
- Goal: To understand the user's final destination or object of interest through a series of questions and interactions.
- Integration: Voiceflow integrates with Cohere's reranking to understand relevant tags created from computer vision models and mapping data.
- Scope: Handles text-to-text responses and creates complex workflows for answering questions. Does not handle speech-to-text or text-to-speech conversion.
-
Speech Processing
- Speech-to-Text: Utilizes a Groq model based on Open AI's Whisper transcription model, chosen for its very low latency, which is crucial for real-time speech processing.
- Text-to-Speech: Implemented using Unreal Engine, which doesn't specify an AI model but provides very fast processing.
-
Indoor Mapping and Navigation (MappedIn SDK)
- Route Generation: Provides a route from the current location to the final destination.
- Detailed Navigation: Once a space is selected, the system offers more specific navigation guidance.
- Depth-Based Guidance: Utilizes depth information from the CV pipeline to provide more accurate positioning.
- Object-Relative Navigation: Uses the relative positioning between objects and depth data to guide the user more precisely.
git clone https://github.com/akjadhav/hackthenorth-2024.git && cd hackthenorth-2024
npm run dev