Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU Usage (100% per stream) in AWS Transcribe Streaming #3257

Open
1 task
blundercode opened this issue Jan 17, 2025 · 1 comment
Open
1 task

High CPU Usage (100% per stream) in AWS Transcribe Streaming #3257

blundercode opened this issue Jan 17, 2025 · 1 comment
Labels
bug This issue is a bug. needs-triage This issue or PR still needs to be triaged.

Comments

@blundercode
Copy link

blundercode commented Jan 17, 2025

Describe the bug

The AWS Transcribe Streaming SDK C++ implementation is consuming excessive CPU resources when processing audio streams. Each individual stream consumes approximately 100% CPU usage, scaling linearly with multiple streams (e.g., 3 streams = 300% CPU usage). This appears inefficient for an operation that should primarily be handling audio data transmission to AWS Transcribe service.

I have tested using the CRT-HTTP version also and I get similar results. Will follow up with CRT-HTTP Docker version if requested.

It will slightly fluctuate on CPU usage but will mostly stick around 100%. I have tested on Macbook M1 running docker and then multiple Linux EC2 instance types and had the same results.

Is this performance intended/expected?

Image

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

  1. Minimal CPU usage for streaming audio to AWS Transcribe service
  2. Efficient handling of multiple concurrent streams without linear CPU scaling
  3. CPU usage should primarily be focused on audio data transmission rather than processing

Current Behavior

  1. Each individual stream consumes 100% CPU
  2. Multiple streams scale linearly (e.g., 3 streams = 300% CPU)
  3. CPU usage monitored through top command shows excessive utilization
  4. The high CPU usage persists throughout the entire streaming session
  5. Behavior is consistent across multiple test runs

Reproduction Steps

Here is the minimal reproduction steps in a single Dockerfile using the sample code.

Dockerfile

FROM public.ecr.aws/lts/ubuntu:22.04_stable

RUN apt-get update && \
  apt-get install build-essential cmake git libcurl4-openssl-dev zlib1g-dev libssl-dev curl ffmpeg -y

#Build sdk from source
RUN git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp && \
    cd aws-sdk-cpp && \
    mkdir build && \
    cd build && \
    cmake .. -G "Unix Makefiles" -DBUILD_ONLY="transcribestreaming;transcribe" && \
    make install

#Build transcribe samples
RUN git clone https://github.com/awsdocs/aws-doc-sdk-examples.git && \
    cd aws-doc-sdk-examples/cpp/example_code/transcribe-streaming && \
    mkdir build && \
    cd build && \
    cmake .. -G "Unix Makefiles" && \
    make

# Download and convert the test file
RUN cd /aws-doc-sdk-examples/cpp/example_code/transcribe-streaming/.media && \
    rm -f transcribe-test-file.wav && \
    curl -L "https://ia800202.us.archive.org/26/items/desophisticiselenchis/desophisticiselenchis_01_aristotle_pdf557.wav" -o original.wav && \
    ffmpeg -i original.wav -ar 8000 transcribe-test-file.wav && \
    rm original.wav

Please note:

  • Test file: Using a longer audio file from archive.org (converted to match original specs)

Steps:

  1. Build the Docker container using provided Dockerfile:
docker build -t transcribe-cpu-test-example .
  1. Run the container with AWS credentials:
docker run -d \
-e AWS_ACCESS_KEY_ID=<key> \
-e AWS_SECRET_ACCESS_KEY=<secret> \
-e AWS_SESSION_TOKEN=<token> \
--name transcribe-container \
transcribe-cpu-test-example \
tail -f /dev/null
  1. In first terminal, run:
docker exec -it transcribe-container bash
top  # Keep this running to monitor CPU
  1. In second terminal, execute:
docker exec -it transcribe-container bash
/aws-doc-sdk-examples/cpp/example_code/transcribe-streaming/build/get_transcript

Repeat step 4 in additional terminals to observe CPU scaling with multiple streams

You will notice high cpu usage.

Possible Solution

Potential memory leaks or inefficient resource handling in the streaming implementation.

Additional Information/Context

  • This is just a single example I have seen it in my own implementation with different file types also
  • Issue affects scalability of applications requiring multiple concurrent streams

AWS CPP SDK version used

Latest

Compiler and Version used

gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Operating System and version

Ubuntu 22.04 LTS (running in Docker container)

@blundercode blundercode added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 17, 2025
@blundercode
Copy link
Author

Seems I might have to submit another bug that is unrelated, bundling with CRT-HTTP breaks the sample code.

Hits this error: Transcribe streaming error Request Timeout Has Expired

This is unrelated to the current issue though just noting for later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. needs-triage This issue or PR still needs to be triaged.
Projects
None yet
Development

No branches or pull requests

1 participant