You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description:
I have been testing Faster-Whisper with NVIDIA Triton Inference Server and noticed a significant performance discrepancy compared to running the model directly in Python.
Direct Python Inference:
Running the model using the following code:
model = WhisperModel("large-v3", device="cuda", compute_type="int8")
Processing a single file takes approximately 0.1 seconds.
Inference via NVIDIA Triton(localhost):
Serving the same model on Triton and sending audio files via HTTP.
Processing the same file takes approximately 0.2 seconds.
Observations:
Since Triton receives files over HTTP, I suspected that there might be idle periods where the GPU is not fully utilized.
However, monitoring GPU usage with nvidia-smi and gpustat shows a consistent GPU core utilization of ~97%, without noticeable idle gaps.
Question:
Why does inference take twice as long when using NVIDIA Triton compared to direct inference in Python? Is there an inherent overhead in Triton that causes this delay, even though GPU utilization appears to be consistently high?
import asyncio
import time
import os
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from faster_whisper import WhisperModel
import librosa
import random
# 모델 초기화
model = WhisperModel("large-v3", device="cuda", compute_type="int8")
# 테스트할 오디오 파일 리스트 (100개 생성)
audio_files = ["~/Downloads/audio/after.wav"] * 100
# 음성 인식 함수 (싱글 스레드 실행)
def transcribe_audio(audio_file):
if not os.path.exists(audio_file):
return {"file": audio_file, "error": "File not found"}
start_time = time.time()
try:
segments, info = model.transcribe(audio_file, without_timestamps=True)
language = info.language if info.language else "Unknown"
except Exception as e:
return {"file": audio_file, "error": str(e)}
elapsed_time = time.time() - start_time
return {"file": audio_file, "time": elapsed_time, "language": language}
# 비동기 실행 함수 (100개 동시에 실행)
async def transcribe_all():
loop = asyncio.get_running_loop()
executor = ThreadPoolExecutor(max_workers=10) # 동시에 10개씩 실행
start_time = time.time()
# 100개 오디오 파일을 동시에 실행
tasks = [loop.run_in_executor(executor, transcribe_audio, audio_file) for audio_file in audio_files]
results = await asyncio.gather(*tasks)
end_time = time.time()
total_time = end_time - start_time # 전체 실행 시간
avg_time = sum(r["time"] for r in results if "time" in r) / len([r for r in results if "time" in r])
# 결과 출력
print("\n=== 🕒 Whisper Batch Inference Results ===")
print(f"Total execution time: {total_time:.2f} seconds")
print(f"Average transcription time per file: {avg_time:.4f} seconds")
print(f"Total files processed: {len(results)}")
print("\nSample results:")
for result in results[:5]: # 샘플 5개만 출력
print(result)
# 실행
if __name__ == "__main__":
asyncio.run(transcribe_all())
Description:
I have been testing Faster-Whisper with NVIDIA Triton Inference Server and noticed a significant performance discrepancy compared to running the model directly in Python.
Direct Python Inference:
Running the model using the following code:
model = WhisperModel("large-v3", device="cuda", compute_type="int8")
Processing a single file takes approximately 0.1 seconds.
Inference via NVIDIA Triton(localhost):
Serving the same model on Triton and sending audio files via HTTP.
Processing the same file takes approximately 0.2 seconds.
Observations:
Since Triton receives files over HTTP, I suspected that there might be idle periods where the GPU is not fully utilized.
However, monitoring GPU usage with nvidia-smi and gpustat shows a consistent GPU core utilization of ~97%, without noticeable idle gaps.
Question:
Why does inference take twice as long when using NVIDIA Triton compared to direct inference in Python? Is there an inherent overhead in Triton that causes this delay, even though GPU utilization appears to be consistently high?
setup
nvidia RTX 4080TI,
NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4
below is the full code.
Direct Python Inference
using Triton
triton tree structure is following
faster-whisper
--1
----model.py
--config.pbtxt
--Dockerfile
--client_async.py
---model.py---
---config.pbtxt---
---client_async.py ---
----Dockerfile----
----docker build----
docker build -t tritonserver-with-faster-whisper:v.2.0 .
----docker run----
docker run --gpus=all --ipc=host --rm --net=host -v ~/.cache/huggingface:/root/.cache/huggingface tritonserver-with-faster-whisper:v.2.0 tritonserver --backend-config=python,execution-thread-count=1 --model-repository=/workspace/triton --log-verbose=2 --http-port=9000 --grpc-port=9001 --metrics-port=9002
----test----
activate conda env
python client_async.py
The text was updated successfully, but these errors were encountered: