Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triton ensemble model configuration for transducer models #371

Closed
pavankumar-ds opened this issue Apr 27, 2023 · 13 comments
Closed

Triton ensemble model configuration for transducer models #371

pavankumar-ds opened this issue Apr 27, 2023 · 13 comments

Comments

@pavankumar-ds
Copy link

pavankumar-ds commented Apr 27, 2023

Hello, could you please give a reference configuration for the ensemble transducer model in the example repository for a pure triton-based inference? Specifically, how do we interface the variable y from the scorer back to the decoder input? Also, the template misses the joiner_decoder_proj and joiner_decoder_proj parts.

@uni-saurabh-vyas
Copy link
Contributor

I am working on it now, and managed to follow the docs to make triton server up and running(all components including ensemble transducer are up and running), but when I am try to start client script it fails.

python3 decode_manifest_triton.py --manifest-filename /mnt/efs/dspavankumar/e/tamil_icefall/data/test_re/icefall_manifests/cuts_1.jsonl.gz --server-addr 0.0.0.0 --server-port 8001 --streaming  --model-name transducer --chunk_size 16 --context 2

tritonclient.utils.InferenceServerException: [StatusCode.INVALID_ARGUMENT] in ensemble 'transducer', inference request for sequence 10107 to model 'feature_extractor' must specify the START flag on the first request of the sequence

One weird thing I noticed was that when I start the server, I see these warnings/errors

Cleaning up...
free(): invalid pointer
free(): invalid pointer

Is this possibly related to memory leak issue ?
triton-inference-server/server#3777

@uni-saurabh-vyas
Copy link
Contributor

Also, I am trying pretrained model from section "Deploy onnx with arbitrary pruned_transducer_stateless_X(2,3,4,5) model for Chinese or English recipes"
at https://github.com/k2-fsa/sherpa/tree/master/triton

After downloading the model files, I am getting the following error:

./pruned_transducer_stateless3/export_onnx.py \
    --exp-dir ./icefall_librispeech_streaming_pruned_transducer_stateless3_giga_0.9_20220625/exp \
    --tokenizer-file ./icefall_librispeech_streaming_pruned_transducer_stateless3_giga_0.9_20220625/data/lang_bpe_500/bpe.model \
    --epoch 999 \
    --avg 1 \
    --streaming-model 1 \
    --causal-convolution 1 \
    --onnx 1 \
    --left-context 64 \
    --right-context 4 \
    --fp16
    sp.load(params.tokenizer_file)
File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/sentencepiece/__init__.py", line 905, in Load
  return self.LoadFromFile(model_file)
File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
  return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

@uni-saurabh-vyas
Copy link
Contributor

@csukuangfj

I am getting this error when I try to run default streaming example provided in sherpa/triton folder (https://github.com/k2-fsa/sherpa/tree/master/triton/model_repo_streaming)

tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] in ensemble 'transducer', Failed to process the request(s) for model instance 'feature_extractor_0_0', message: Exception: ('Invalid first chunk size', 9360, 14880)

https://github.com/k2-fsa/sherpa/blob/master/triton/model_repo_streaming/feature_extractor/1/model.py#L46

Did you guys encounter this issue as well ? What is your current status on transducer setup for triton, is it stable for you guys ?
I would appreciate if you can suggest some pointers to address this issue, I can spend some time to fix issues if there any known issues, or if you want me try something else to make it work.

Also, as mentioned in previous comment(#371 (comment)), I suspect its related kaldifeats library memory leak issue, if this is a known issue, do you suggest to try using a different library for feature extraction ?

@csukuangfj
Copy link
Collaborator

@yuekaizhang

Could you help to have a look at this issue?

@yuekaizhang
Copy link
Collaborator

yuekaizhang commented May 4, 2023

I am working on it now, and managed to follow the docs to make triton server up and running(all components including ensemble transducer are up and running), but when I am try to start client script it fails.

python3 decode_manifest_triton.py --manifest-filename /mnt/efs/dspavankumar/e/tamil_icefall/data/test_re/icefall_manifests/cuts_1.jsonl.gz --server-addr 0.0.0.0 --server-port 8001 --streaming  --model-name transducer --chunk_size 16 --context 2

tritonclient.utils.InferenceServerException: [StatusCode.INVALID_ARGUMENT] in ensemble 'transducer', inference request for sequence 10107 to model 'feature_extractor' must specify the START flag on the first request of the sequence

One weird thing I noticed was that when I start the server, I see these warnings/errors

Cleaning up...
free(): invalid pointer
free(): invalid pointer

Is this possibly related to memory leak issue ? triton-inference-server/server#3777

Hi, thanks for trying this triton recipe.

  1. inference request for sequence 10107 to model 'feature_extractor' must specify the START flag on the first request of the sequence
    This error may be caused by outdated request. At the beginning of the service startup, due to insufficient warming up, if a request is cleared due to timeout, it will cause later arriving chunks to lose their start flag. You may first try to warmup service with small batch size and concurrency.
  2. free(): invalid pointer This warning (which I have no idea yet) should be fine.
  3. tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] in ensemble 'transducer', Failed to process the request(s) for model instance 'feature_extractor_0_0', message: Exception: ('Invalid first chunk size', 9360, 14880) This issue is caused by --context 2, you should use --encoder_right_context which is for icefall models. https://github.com/k2-fsa/sherpa/blob/master/triton/client/decode_manifest_triton.py#L161 This is for wenet models.

@uni-saurabh-vyas
Copy link
Contributor

uni-saurabh-vyas commented May 4, 2023

Hi @yuekaizhang
thanks for your response.

I ensured that the config parameters in $model_repo_path/*/config.pbtxt are matching properties as per onnx export log file
icefall_librispeech_streaming_pruned_transducer_stateless3_giga_0.9_20220625/exp/onnx_export.log

For reference:

ENCODER_LEFT_CONTEXT: 64
ENCODER_RIGHT_CONTEXT: 4
ENCODER_DIM: 512
DECODER_DIM: 512
VOCAB_SIZE: 500
DECODER_CONTEXT_SIZE: 2
CNN_MODULE_KERNEL: 31
ENCODER_LAYERS: 12
All params:{'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'encoder_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'decoder_dim': 512, 'joiner_dim': 512, 'model_warm_step': 3000, 'env_info': {'k2-version': '1.23.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '62e404dd3f3a811d73e424199b3408e309c06e1a', 'k2-git-date': 'Mon Jan 30 02:26:16 2023', 'lhotse-version': '1.12.0', 'torch-version': '1.13.0', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.1', 'icefall-git-branch': None, 'icefall-git-sha1': None, 'icefall-git-date': None, 'icefall-path': '/mnt/efs/dspavankumar/tools/icefall', 'k2-path': '/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/k2/__init__.py', 'lhotse-path': '/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': 'ip-10-40-5-20', 'IP address': '127.0.0.1'}, 'epoch': 1111, 'iter': 0, 'avg': 1, 'exp_dir': PosixPath('icefall_librispeech_streaming_pruned_transducer_stateless3_giga_0.9_20220625/exp'), 'tokenizer_file': './icefall_librispeech_streaming_pruned_transducer_stateless3_giga_0.9_20220625/data/lang_bpe_500/bpe.model', 'onnx': True, 'context_size': 2, 'left_context': 64, 'right_context': 4, 'streaming_model': True, 'fp16': True, 'dynamic_chunk_training': False, 'causal_convolution': True, 'short_chunk_size': 25, 'num_left_chunks': 4, 'blank_id': 0, 'vocab_size': 500}

Then I ran the client again python3 decode_manifest_triton.py --encoder_right_context 4 --chunk_size 16 --manifest-filename /mnt/efs/dspavankumar/e/tamil_icefall/data/test_re/icefall_manifests/cuts.jsonl.gz --server-addr 0.0.0.0 --server-port 8001 --streaming --model-name transducer

Still getting same error

task-48: 0/221
task-49: 0/221
Traceback (most recent call last):
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 485, in <module>
    asyncio.run(main())
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/asyncio/base_events.py", line 649, in run_until_c
omplete
    return future.result()
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 433, in main
    ans_list = await asyncio.gather(*tasks)
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 316, in send_streaming
    response = await triton_client.infer(model_name,
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/tritonclient/grpc/aio/__init__.py",
 line 727, in infer
    raise_error_grpc(rpc_error)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/tritonclient/grpc/__init__.py", lin
e 62, in raise_error_grpc
    raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] in ensemble 'transducer', Failed to process the request(s) fo
r model instance 'feature_extractor_0_1', message: Exception: ('Invalid first chunk size', 12640, 14880)

At:
  /mnt/efs/dspavankumar/tools/sherpa/triton/model_repo_streaming_pretrained/feature_extractor/1/model.py(47): add_wavs
  /mnt/efs/dspavankumar/tools/sherpa/triton/model_repo_streaming_pretrained/feature_extractor/1/model.py(221): execute

" This error may be caused by outdated request. At the beginning of the service startup, due to insufficient warming up, if a request is cleared due to timeout, it will cause later arriving chunks to lose their start flag. You may first try to warmup service with small batch size and concurrency."

I also tried with --num-tasks 1 argument, but it still fails.

/mnt/efs/dspavankumar/tools/sherpa/triton/client$ python3 decode_manifest_triton.py --num-tasks 1 --encoder_right_context 4 --chunk_size 16 --manifest-filename /mnt/efs/dspavankumar/e/tamil_icefall/data/test_re/icefall_manifests/cuts.jsonl.gz --server-addr 0.0.0.0 --server-port 8001 --streaming --model-name transducer
task-0: 0/11077
/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/lhotse/audio.py:164: UserWarning: You requested a subset of a recording that is read from disk via a bash command. Expect large I/O overhead if you are going to read many chunks like these, since every time we will read the whole file rather than its subset.
  warnings.warn(
Traceback (most recent call last):
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 485, in <module>
    asyncio.run(main())
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 433, in main
    ans_list = await asyncio.gather(*tasks)
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 316, in send_streaming
    response = await triton_client.infer(model_name,
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/tritonclient/grpc/aio/__init__.py", line 727, in infer
    raise_error_grpc(rpc_error)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/tritonclient/grpc/__init__.py", line 62, in raise_error_grpc
    raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.INVALID_ARGUMENT] in ensemble 'transducer', inference request for sequence 10086 to model 'feature_extractor' must specify the START flag on the first request of the sequence

@yuekaizhang
Copy link
Collaborator

Hi @yuekaizhang thanks for your response.

I ensured that the config parameters in $model_repo_path/*/config.pbtxt are matching properties as per onnx export log file icefall_librispeech_streaming_pruned_transducer_stateless3_giga_0.9_20220625/exp/onnx_export.log

For reference:

ENCODER_LEFT_CONTEXT: 64
ENCODER_RIGHT_CONTEXT: 4
ENCODER_DIM: 512
DECODER_DIM: 512
VOCAB_SIZE: 500
DECODER_CONTEXT_SIZE: 2
CNN_MODULE_KERNEL: 31
ENCODER_LAYERS: 12
All params:{'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'encoder_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'decoder_dim': 512, 'joiner_dim': 512, 'model_warm_step': 3000, 'env_info': {'k2-version': '1.23.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '62e404dd3f3a811d73e424199b3408e309c06e1a', 'k2-git-date': 'Mon Jan 30 02:26:16 2023', 'lhotse-version': '1.12.0', 'torch-version': '1.13.0', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.1', 'icefall-git-branch': None, 'icefall-git-sha1': None, 'icefall-git-date': None, 'icefall-path': '/mnt/efs/dspavankumar/tools/icefall', 'k2-path': '/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/k2/__init__.py', 'lhotse-path': '/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': 'ip-10-40-5-20', 'IP address': '127.0.0.1'}, 'epoch': 1111, 'iter': 0, 'avg': 1, 'exp_dir': PosixPath('icefall_librispeech_streaming_pruned_transducer_stateless3_giga_0.9_20220625/exp'), 'tokenizer_file': './icefall_librispeech_streaming_pruned_transducer_stateless3_giga_0.9_20220625/data/lang_bpe_500/bpe.model', 'onnx': True, 'context_size': 2, 'left_context': 64, 'right_context': 4, 'streaming_model': True, 'fp16': True, 'dynamic_chunk_training': False, 'causal_convolution': True, 'short_chunk_size': 25, 'num_left_chunks': 4, 'blank_id': 0, 'vocab_size': 500}

Then I ran the client again python3 decode_manifest_triton.py --encoder_right_context 4 --chunk_size 16 --manifest-filename /mnt/efs/dspavankumar/e/tamil_icefall/data/test_re/icefall_manifests/cuts.jsonl.gz --server-addr 0.0.0.0 --server-port 8001 --streaming --model-name transducer

Still getting same error

task-48: 0/221
task-49: 0/221
Traceback (most recent call last):
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 485, in <module>
    asyncio.run(main())
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/asyncio/base_events.py", line 649, in run_until_c
omplete
    return future.result()
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 433, in main
    ans_list = await asyncio.gather(*tasks)
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 316, in send_streaming
    response = await triton_client.infer(model_name,
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/tritonclient/grpc/aio/__init__.py",
 line 727, in infer
    raise_error_grpc(rpc_error)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/tritonclient/grpc/__init__.py", lin
e 62, in raise_error_grpc
    raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] in ensemble 'transducer', Failed to process the request(s) fo
r model instance 'feature_extractor_0_1', message: Exception: ('Invalid first chunk size', 12640, 14880)

At:
  /mnt/efs/dspavankumar/tools/sherpa/triton/model_repo_streaming_pretrained/feature_extractor/1/model.py(47): add_wavs
  /mnt/efs/dspavankumar/tools/sherpa/triton/model_repo_streaming_pretrained/feature_extractor/1/model.py(221): execute

" This error may be caused by outdated request. At the beginning of the service startup, due to insufficient warming up, if a request is cleared due to timeout, it will cause later arriving chunks to lose their start flag. You may first try to warmup service with small batch size and concurrency."

I also tried with --num-tasks 1 argument, but it still fails.

/mnt/efs/dspavankumar/tools/sherpa/triton/client$ python3 decode_manifest_triton.py --num-tasks 1 --encoder_right_context 4 --chunk_size 16 --manifest-filename /mnt/efs/dspavankumar/e/tamil_icefall/data/test_re/icefall_manifests/cuts.jsonl.gz --server-addr 0.0.0.0 --server-port 8001 --streaming --model-name transducer
task-0: 0/11077
/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/lhotse/audio.py:164: UserWarning: You requested a subset of a recording that is read from disk via a bash command. Expect large I/O overhead if you are going to read many chunks like these, since every time we will read the whole file rather than its subset.
  warnings.warn(
Traceback (most recent call last):
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 485, in <module>
    asyncio.run(main())
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 433, in main
    ans_list = await asyncio.gather(*tasks)
  File "/mnt/efs/dspavankumar/tools/sherpa/triton/client/decode_manifest_triton.py", line 316, in send_streaming
    response = await triton_client.infer(model_name,
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/tritonclient/grpc/aio/__init__.py", line 727, in infer
    raise_error_grpc(rpc_error)
  File "/mnt/efs/dspavankumar/tools/miniconda3/envs/icefall_env/lib/python3.10/site-packages/tritonclient/grpc/__init__.py", line 62, in raise_error_grpc
    raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.INVALID_ARGUMENT] in ensemble 'transducer', inference request for sequence 10086 to model 'feature_extractor' must specify the START flag on the first request of the sequence

https://github.com/k2-fsa/sherpa/blob/master/triton/client/decode_manifest_triton.py#L381-L383

Here, please check your first_chunk_ms, decoding_window_length, decode_window_length = (args.chunk_size + 2 + args.encoder_right_context) * args.subsampling + 3 decode_window_length should be (16 + 2 + 4)*4 + 3 = 91, first_chunk_ms = (decode_window_length + add_frames) * frame_shift_ms add_frames should be 2.

@uni-saurabh-vyas
Copy link
Contributor

I have checked, these values are correct.

ipdb> decode_window_length
91
ipdb> print(args.chunk_size )
16
ipdb> print(args.encoder_right_context)
4
ipdb> print(args.subsampling)
4
ipdb> print(add_frames)
2
ipdb> print(frame_shift_ms)
10

@yuekaizhang
Copy link
Collaborator

Failed to process the request(s) fo
r model instance 'feature_extractor_0_1', message: Exception: ('Invalid first chunk size', 12640, 14880)

If the values are correct, could you trace back to figure out how do you get this 12640 number?

@uni-saurabh-vyas
Copy link
Contributor

Hi @yuekaizhang

I noticed that in the wav_segs(https://github.com/k2-fsa/sherpa/blob/master/triton/client/decode_manifest_triton.py#L269), in the last segment, the number of samples(length) are different from all other segments, causing an issue.

So after adding
del(wav_segs[-1])

at https://github.com/k2-fsa/sherpa/blob/master/triton/client/decode_manifest_triton.py#L282 problem is fixed.

Do you think this is a bug?

@yuekaizhang
Copy link
Collaborator

yuekaizhang commented May 9, 2023

I am not sure. If it is a bug, it will exist in feature_extractor/1/model.py rather than this client here. Could you make sure that https://github.com/k2-fsa/sherpa/blob/master/triton/model_repo_streaming/feature_extractor/1/model.py#L52 here assert len(self.wav) > 0 always hold? Otherwise, there is a problem somewhere.

Since if you keep that last seg, I don't understand https://github.com/k2-fsa/sherpa/blob/master/triton/model_repo_streaming/feature_extractor/1/model.py#L45 why len(self.wav) become 0 except for first chunk.

How do you fix this previous issue inference request for sequence 10107 to model 'feature_extractor' must specify the START flag on the first request of the sequence I think it may be related to the outdate request.

@uni-saurabh-vyas
Copy link
Contributor

Good observation,
so that error was caused due to few very short cuts present in jsonl, I used a different cuts file(which didnt have very short (<0.3 seconds cuts)), and I think that might have fixed that particular issue.

@yuekaizhang
Copy link
Collaborator

Okay, close the issue since it is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants