IdentifyLanguage on TranscribeStreamingClient throws ERR_STREAM_PREMATURE_CLOSE #6422

pbudzon · 2024-08-30T10:34:02Z

Checkboxes for prior research

I've gone through Developer Guide and API reference
I've checked AWS Forums and StackOverflow.
I've searched for previous similar issues and didn't find any solution.

Describe the bug

I have a very simple JS application that uses TranscribeStreamingClient to transcribe a live stream. It works perfectly fine when LanguageCode is provided for TranscribeStreamingClient. However, when attempting to use IdentifyLanguage instead, it throws the following error:

Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
    at PassThrough.<anonymous> (node:internal/streams/pipeline:417:14)
    at PassThrough.emit (node:events:531:35)
    at emitCloseNT (node:internal/streams/destroy:147:10)
    at process.processTicksAndRejections (node:internal/process/task_queues:81:21) {
  code: 'ERR_STREAM_PREMATURE_CLOSE'
}

SDK version number

@aws-sdk/[email protected]
[email protected]

Which JavaScript Runtime is this issue in?

Node.js

Details of the browser/Node.js/ReactNative version

v20.16.0

Reproduction Steps

Command to run:

/bin/ffmpeg -re -loglevel quiet  -i <some_video_file> -vn -ac 1 -c:a pcm_s16le -ar 16000 -f s16le - | node simple.js

simple.js :

const REGION = 'eu-west-1';
const sampleRate = 16000;
const { TranscribeStreamingClient, StartStreamTranscriptionCommand } = require('@aws-sdk/client-transcribe-streaming');
const stream = require('stream');
const passthroughStream = new stream.PassThrough({ highWaterMark: 128 });
process.stdin.on('data', (data) => {
  passthroughStream.write(data);
});

const audioStream = async function* audioStream() {
  try {
    for await (const payloadChunk of passthroughStream) {
      yield { AudioEvent: { AudioChunk: payloadChunk } };
    }
  } catch (error) {
    console.log('Error reading passthrough stream or yielding audio chunk.');
  }
};

const startTranscribe = async function startTranscribe() {
    console.log('starting transcribe');
    const tsClient = new TranscribeStreamingClient({ region: REGION });
    const tsParams = {
//        LanguageCode: 'en-US', // THIS WORKS
        IdentifyLanguage: true, // THIS DOES NOT WORK
        MediaEncoding: 'pcm',
        MediaSampleRateHertz: sampleRate,
        AudioStream: audioStream(),
        ShowSpeakerLabel: true,
    };

    const tsCmd = new StartStreamTranscriptionCommand(tsParams);
    const tsResponse = await tsClient.send(tsCmd);
    const tsStream = stream.Readable.from(tsResponse.TranscriptResultStream);

    for await (const chunk of tsStream) {
        if (chunk.TranscriptEvent.Transcript.Results.length > 0) {
           console.log(JSON.stringify(chunk.TranscriptEvent.Transcript.Results))
        }
    }
}

startTranscribe();

Observed Behavior

When running the above, the following output is given:

starting transcribe
An error was encountered in a non-retryable streaming request.
<FULL_PATH_REDACTED>/node_modules/@aws-sdk/eventstream-handler-node/dist-cjs/index.js:119
        throw err;
        ^

Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
    at PassThrough.<anonymous> (node:internal/streams/pipeline:417:14)
    at PassThrough.emit (node:events:531:35)
    at emitCloseNT (node:internal/streams/destroy:147:10)
    at process.processTicksAndRejections (node:internal/process/task_queues:81:21) {
  code: 'ERR_STREAM_PREMATURE_CLOSE'
}

Node.js v20.16.0

Expected Behavior

Similar output to when LanguageCode is used is expected, output when running with LanguageCode: ... looks like this:

starting transcribe
[{"Alternatives":[{"Items":[{"Content":"So","EndTime":0.16,"StartTime":0.13,"Type":"pronunciation","VocabularyFilterMatch":false},{"Content":"3","EndTime":0.48,"StartTime":0.16,"Type":"pronunciation","VocabularyFilterMatch":false}],"Transcript":"So 3"}],"ChannelId":"ch_0","EndTime":0.975,"IsPartial":true,"ResultId":"e4ad4f1b-1fac-4ad1-8b3b-af3630307694","StartTime":0}]
...

Possible Solution

Is IdentifyLanguage not supported with streaming transcriptions? The documentation suggests it should work, and there doesn't appear to be any additional requirements to use it (vs when LanguageCode is used).

Additional Information/Context

The code is based on this example: https://github.com/aws-samples/amazon-transcribe-streaming-live-closed-captions

The text was updated successfully, but these errors were encountered:

zshzbh · 2024-08-30T17:16:58Z

Hey @pbudzon ,

Thanks for the feedback!

It seems that you need to add LanguageOptions in tsParams. If you include LanguageOptions in your request, you must also include IdentifyLanguage.(Ref)

For example :

  IdentifyLanguage: true,
  LanguageOptions: "en-US,de-DE",

For a list of languages supported with Amazon Transcribe streaming, refer to the Supported languages
table.

I just put sample code in my real working sample repo -
https://github.com/zshzbh/MM---AWS-JS-SDK-V3-Sample-/blob/main/client-transcribe-streaming/chunkSample/index.js#L58

Please let me know if you still have this issue!

Thanks!
Maggie

pbudzon · 2024-08-30T17:29:24Z

Hi @zshzbh

Nowhere in the documentation is that requirement listed - in various places it's actually pointed out that including suggested languages (LanguageOptions) can have negative impact on the transcription if the languages you select do not match the actual language spoken (see here - scroll down a little to the big red box ). In another project, we use transcribe with language detection in batch (non-streaming) mode (using python sdk), and do not have to provide suggested languages. All the documentation suggests this is also possible in streaming mode.

The SDK documentation you've linked, actually specifically says the suggested languages are optional:

IdentifyLanguage: If you include IdentifyLanguage, you can **optionally** include a list of language codes, 
using LanguageOptions, that you think may be present in your audio stream. 
Including language options can improve transcription accuracy.

And

LanguageOptions:  Specify two or more language codes that represent the languages you think may 
be present in your media; including more than five is not recommended. If you're unsure what
languages are present, do not include this parameter.

We very much do not want to include suggested languages list in the transcription, as from our experience in batch transcriptions it can produce poor quality transcriptions for our use case (we have various speakers with varying dialects so auto detection without suggestions works best for us). We'd ideally like to use the same in streaming.

pbudzon · 2024-09-04T08:43:25Z

Bump on this - is this a documentation issue and LanguageOptions is indeed required, or is this an SDK bug that requires the option even though it shouldn't?
@zshzbh

zshzbh · 2024-09-10T15:51:31Z

Hey @pbudzon,

Sorry for the late response, I was OOO last week. Yes I think there's a gap on our doc. Nice finding!

On service doc , it indicates that

If you include IdentifyLanguage, you must include a list of language codes, using LanguageOptions, that you think may be present in your audio stream.

But on AWS JS SDK doc, it indicates that

If you include IdentifyLanguage, you can optionally include a list of language codes, using LanguageOptions, that you think may be present in your audio stream. Including language options can improve transcription accuracy.

This is a doc gap, I tried on my end, and it works fine with LanguageOptions. If I delete LanguageOptions, then I would get an error. It seems that the SDK doc needs to update.

const command = new StartStreamTranscriptionCommand({
  MediaEncoding: MediaEncoding.PCM,
  MediaSampleRateHertz: sampleRate,
  IdentifyLanguage: true,
  // LanguageOptions: "en-US,de-DE",
  AudioStream: audioStream(),
});

We will update the doc. Could you please try to add LanguageOptions to see if it unblocks your work?

Sorry for this inconvenience! Again, thanks for the finding! I will post the ticket number for internal reference once I have one.

Thanks!
Maggie

zshzbh · 2024-09-11T22:39:53Z

Hey @pbudzon,

We are targeting to release the fix soon, I will keep you updated once the fix has been released.

Thanks!
Maggie

pbudzon added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 30, 2024

zshzbh self-assigned this Aug 30, 2024

zshzbh added documentation This is a problem with documentation. and removed guidance General information and guidance, answers to FAQs, or recommended best practices/resources. labels Sep 10, 2024

kuhe added the service-api This issue is due to a problem in a service API, not the SDK implementation. label Sep 11, 2024

zshzbh added queued This issues is on the AWS team's backlog p2 This is a standard priority issue and removed response-requested Waiting on additional info and feedback. Will move to \"closing-soon\" in 7 days. p3 This is a minor priority issue labels Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IdentifyLanguage on TranscribeStreamingClient throws ERR_STREAM_PREMATURE_CLOSE #6422

IdentifyLanguage on TranscribeStreamingClient throws ERR_STREAM_PREMATURE_CLOSE #6422

pbudzon commented Aug 30, 2024 •

edited

Loading

zshzbh commented Aug 30, 2024

pbudzon commented Aug 30, 2024

pbudzon commented Sep 4, 2024

zshzbh commented Sep 10, 2024 •

edited

Loading

zshzbh commented Sep 11, 2024

IdentifyLanguage on TranscribeStreamingClient throws ERR_STREAM_PREMATURE_CLOSE #6422

IdentifyLanguage on TranscribeStreamingClient throws ERR_STREAM_PREMATURE_CLOSE #6422

Comments

pbudzon commented Aug 30, 2024 • edited Loading

Checkboxes for prior research

Describe the bug

SDK version number

Which JavaScript Runtime is this issue in?

Details of the browser/Node.js/ReactNative version

Reproduction Steps

Observed Behavior

Expected Behavior

Possible Solution

Additional Information/Context

zshzbh commented Aug 30, 2024

pbudzon commented Aug 30, 2024

pbudzon commented Sep 4, 2024

zshzbh commented Sep 10, 2024 • edited Loading

zshzbh commented Sep 11, 2024

pbudzon commented Aug 30, 2024 •

edited

Loading

zshzbh commented Sep 10, 2024 •

edited

Loading