From e0021f645f546c6fd66cfe7ef29d65d1fd827ff7 Mon Sep 17 00:00:00 2001 From: Bart Trzynadlowski Date: Mon, 5 Feb 2024 15:01:21 -0800 Subject: [PATCH] Update README.md --- README.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 79f68243..6e3130ce 100644 --- a/README.md +++ b/README.md @@ -9,29 +9,29 @@ ## Source Code Tour -To help orient newcomers to the code base, we will trace the complete path that data takes through the system, from speech to displayed summary, when the system is up and running. +To help orient newcomers to the code base, we will trace the complete path that data takes through the system, from speech to displayed summary. ### Streaming Bluetooth Device Example: Xiao ESP32S3 Sense Bluetooth-based devices, like the Xiao ESP32S3 Sense board in this example, connect to the iOS client application (`clients/ios`) and communicate with it continuously. -1. Audio is continuously picked up by the Sense board's microphone at 16 KHz and encoded to AAC. This reduces its size to about a third of its original size, which is important because transmission consumes the most power. Packets are broadcast via BLE as fast as they are recorded in the board firmware's `loop()` function found in `clients/xiao-esp32s3-sense/firmware/src/main.cpp`. +1. Audio is continuously picked up by the Sense board's microphone at 16 KHz and encoded to AAC. This reduces packets to a third of their original size, which is important because transmission consumes the most power. Packets are broadcast via BLE as fast as they are recorded in the board firmware's `loop()` function found in `clients/xiao-esp32s3-sense/firmware/src/main.cpp`. -2. Packets enter the iOS app in `peripheral(_:,didUpdateValueFor:,error:)` in `clients/ios/UntitledAI/Services/BLEManager.swift`. The iOS app passes complete frames to the server via a socket. *Frame* here refers to an AAC frame and there is a sequence numbering mechanism used to detect dropped BLE packets. AAC frames can be independently decoded, so we simply drop incomplete frames that woudl cause downstream transcription models to choke. The iOS app passes AAC frames to the server via a socket. +2. Packets enter the iOS app in `peripheral(_:,didUpdateValueFor:,error:)` in `clients/ios/UntitledAI/Services/BLEManager.swift`. The iOS app passes complete frames to the server via a socket. *Frame* here refers to an AAC frame and there is a sequence numbering mechanism used to detect dropped BLE packets. AAC frames are independent, allowing us to drop incomplete frames that would cause downstream transcription models to choke. 3. Frames enter the server socket in `on_audio_data()` in `untitledai/server/capture_socket.py`. The `CaptureSocketApp` object is created with the FastAPI server in `main.py`. The capture session's UUID is used to look up the appropriate `StreamingCaptureHandler` and the data is forwarded there. -4. In `untitledai/server/streaming_capture_handler.py`, the audio data is appended to a file on disk and then passed along to a transcription service for real-time transcription and conversation endpoint detection. The `CaptureFile` object describes the location on disk to which the entire capture is written. There is also a `CaptureSegmentFile`, which stores the audio for the current conversation only. You can think of these as "children" of the parent capture file and a new one is created each time a conversation endpoint is detected. +4. In `untitledai/server/streaming_capture_handler.py`, the audio data is appended to a file on disk and then passed along to a transcription service for real-time transcription and conversation endpoint detection. The `CaptureFile` object describes the location to which the entire capture is written. There is also a `CaptureSegmentFile`, which stores audio for the current conversation only. You can think of these as "children" of the parent capture file. A new one is created each time a conversation endpoint is detected. -5. The transcription service uses a streaming transcription model (Deepgram at the time of this writing, with a local option planned) that delivers utterances to `handle_utterance()`. This in turn passes the utterance, which includes timestamps, to the endpointing service. When the endpointing service determines a conversation has ended, `on_endpoint()` is invoked. Now a conversation is finished and the completed conversation segment file can be transcribed more thoroughly and summarized. A task is created and dispatched to the server's async background processing queue, which is processed continuously in `main.py` (`process_queue()`). The task, still in `streaming_capture_handler.py` simply calls the `process_conversation_from_audio()` on `ConversationService`, an instance of which is created as part of the server app's shared state (`AppState`). +5. The transcription service uses a streaming transcription model (Deepgram at the time of this writing, with a local option planned) that delivers utterances to `handle_utterance()`. This in turn passes the utterance, which includes timestamps, to the endpointing service. When the endpointing service determines a conversation has ended, `on_endpoint()` is invoked. The completed conversation segment file can be transcribed more thoroughly and summarized. A task is created and dispatched to the server's async background processing queue, which is processed continuously in `main.py` (`process_queue()`). The task, still in `streaming_capture_handler.py`, simply calls `process_conversation_from_audio()` on `ConversationService`, an instance of which is created as part of the server app's shared state (`AppState`). -6. `ConversationService` in `untitledai/services/conversation/conversation_service.py` transcribes the conversation audio using a non-streaming model, creates summaries, and associates a location with the conversation based on location data that has been sent to server from the iOS app. All of this information is committed to a local database as well as to the local capture directory in the form of JSON files for easy inspection. Finally, a notification is sent via `send_notification()` on a `NotificationService` instance (defined in `untitled/services/notification/notification_service.py`). This uses the socket connection to push the newly-created conversation to the iOS app. +6. `ConversationService` in `untitledai/services/conversation/conversation_service.py` transcribes the conversation audio using a non-streaming model, creates summaries, and associates a location with the conversation based on location data sent to the server from the iOS app. All this is committed to a local database as well as to the local capture directory in the form of JSON files for easy inspection. Finally, a notification is sent via `send_notification()` on a `NotificationService` instance (defined in `untitled/services/notification/notification_service.py`). This uses the socket connection to push the newly-created conversation to the iOS app. 7. Back in the iOS app: `ConversationsViewModel` in `clients/ios/UntitledAI/ViewModels/ConversationsViewModel.swift` subscribes to conversation messages and updates a published property whenever they arrive. The view model object is instantiated in `ContentView`, the top-level SwiftUI view, and handed to `ConversationsView`. 8. `ConversationsView` observes the view model and updates a list view whenever it changes, thereby displaying conversations to the user. -That is the end-to-end process beginning from a capture device client, transiting through the server, and ending at the iOS client. +That sums up the end-to-end process, which begins in a capture device client, transits through the server, and ends at the iOS client. ### Chunked and Spooled Audio Example: Apple Watch @@ -51,4 +51,4 @@ The server also supports non-real time syncing of capture data in chunks, which 7. Lastly, if the capture session ends and the `/process_capture` route is used, a final `ProcessAudioChunkTask` is submitted to finalize any remaining conversation that may have been ongoing in the final chunk. -Chunked uploads enter the server differently than streaming audio, use a different conversation endpointing method, but then follow the same path back to the iOS app. \ No newline at end of file +Chunked uploads enter the server differently than streaming audio, use a different conversation endpointing method, but then follow the same path back to the iOS app.