Releases · argmaxinc/WhisperKit

05 Nov 18:01

v0.9.3

80070fb

v0.9.3 Latest

Latest

This release adds a number of useful callbacks that you can receive updates from while the transcription is processing:

/// A callback that provides transcription segments as they are discovered.
/// - Parameters:
///   - segments: An array of `TranscriptionSegment` objects representing the transcribed segments
public typealias SegmentDiscoveryCallback = (_ segments: [TranscriptionSegment]) -> Void

/// A callback that reports changes in the model's state.
/// - Parameters:
///   - oldState: The previous state of the model, if any
///   - newState: The current state of the model
public typealias ModelStateCallback = (_ oldState: ModelState?, _ newState: ModelState) -> Void

/// A callback that reports changes in the transcription process.
/// - Parameter state: The current `TranscriptionState` of the transcription process
public typealias TranscriptionStateCallback = (_ state: TranscriptionState) -> Void

Thanks you @iandundas for the excellent contribution! ✨

What's Changed

Add public callbacks to help expose internal state a little more by @iandundas in #240

Full Changelog: v0.9.2...v0.9.3

Contributors

iandundas

Assets 2

0 Join discussion

02 Nov 20:20

ZachNagengast

v0.9.2

b100f22

v0.9.2

Highlights

With this release we are launching a comprehensive suite of benchmarks that you can run yourself on your own devices - or view the results that we've run on a wide variety of devices via our WhisperKit Benchmarks HuggingFace space! This was a huge effort kicked off by @Abhinay1997 so we're very excited to bring it to main. Read more in the discussion here and let us know what you think!

Along with this, there are also several bug fixes and improvements included in this release based on recent reported issues, see below for the relevant PRs.

What's Changed

Fix expo release script by @ZachNagengast in #220
Fix progress for vad by @ZachNagengast in #223
Regression Test Pipeline by @Abhinay1997 in #120
Update xcconfig tracking and provisioning by @ZachNagengast in #234
Fix audio processing edge case by @ZachNagengast in #237

Full Changelog: v0.9.0...v0.9.2

Contributors

ZachNagengast and Abhinay1997

Assets 2

0 Join discussion

09 Oct 02:10

ZachNagengast

v0.9.0

a8a40c9

v0.9.0

Highlights

Package Updates

With #216 the default for checking whether a model is supported on the device uses the model repo config.json as a source of truth. The need for this came about with the release of the new large-v3 turbo model, which is listed in the model repo as openai_whisper-large-v3-v20240930, which was recommended for devices that would crash if attempting to load. This situation can now be mitigated by updating this config.json without the need for a new release and can be called directly with the new static method recommendedRemoteModels:

    let recommendedModel =  await WhisperKit.recommendedRemoteModels().default
	let pipe  = WhisperKit(model: recommendModel)

The existing interface for WhisperKit.recommendedModels() remains the same, but now returns a ModelSupport object with a list of supported models for the current device.

public struct ModelSupport: Codable, Equatable {
    public let `default`: String
    public let supported: [String]
    public var disabled: [String] = []
}

Also, in an ongoing effort to improve modularity, extensibility, and code structure, there is a new way to initialize WhisperKit: using the new WhisperKitConfig class. The parameters are exactly the same and the previous init method is still in place, but this can assist in defining WhisperKit settings and protocol objects ahead of time and initialize WhisperKit more cleanly:

let pipe = try? await WhisperKit(model: "your-custom-model", modelRepo: "username/your-model-repo")

New:

let config = WhisperKitConfig(model: "your-custom-model", modelRepo: "username/your-model-repo") // Initialize config
config.model = "your-custom-model" // Alternatively set parameters directly
let pipe = try? await WhisperKit(config) // Pass into WhisperKit initializer

WhisperAX example app and CLI

Thanks to some memory and audio processing optimizations in #195, #216, and #217, (shout out to @keleftheriou for finding a big improvement there) we've updated the example implementations to use VAD by default with a concurrentWorkerCount of 4. This will significantly improve default inference speed on long files for devices that support async prediction, as well as real time streaming for devices/model combinations that are greater than 1 real-time factor.

⚠️ Deprecations and changed interfaces

The extension on Process.processor is now ProcessInfo.processor and includes a new property ProcessInfo.hwModel which will return a similar string as uname(&utsname) for non-macs.
public func modelSupport(for deviceName: String) -> (default: String, disabled: [String]) is now a disfavored overload in preference of public func modelSupport(for deviceName: String, from config: ModelSupportConfig? = nil) -> ModelSupport

What's Changed

Make additional initializers, functions, members public for extensibility by @bpkeene in #192
Fix start time logic for file loading by @ZachNagengast in #195
Change static var stored properties to static let by @fumoboy007 in #190
Add VoiceActivityDetector base class by @a2they in #199
Set default concurrentWorkerCount by @atiorh in #205
Improving modularity and code structure by @a2they in #212
Add model support config fetching from model repo by @ZachNagengast in #216
Example app VAD default + memory reduction by @ZachNagengast in #217

New Contributors

@bpkeene made their first contribution in #192
@fumoboy007 made their first contribution in #190
@a2they made their first contribution in #199
@atiorh made their first contribution in #205
@1amageek made their first contribution in #216
@keleftheriou made their first contribution in #217

Full Changelog: v0.8.0...v0.9.0

Contributors

ZachNagengast, fumoboy007, and 5 other contributors

Assets 2

0 Join discussion

12 Jul 18:50

ZachNagengast

v0.8.0

02763ca

v0.8.0

With this release, we had a huge focus on reliability in terms of memory usage (especially for large files), common crashes, and various correctness errors that the community has reported in issues.

Highlights

Memory-efficient Handling of Large Files: WhisperKit is much more memory-efficient for large files with some improvements to #158 by @finnvoor. This change speeds up the audio resampling significantly and removes a few other unnecessary data copies. It also fixes a buffer misalignment issue that caused #183 . For more aggressive memory savings, the default audio file chunking size can be configured through maxReadFrameSize. Here is the memory chart for a ~200 MB compressed audio file from #174, showing up to 3x faster resampling with 50% less memory. Note that WhisperKit requires uncompressed Float values for the MLModel input, so the compressed file becomes roughly ~1 GB minimum after read and resample to 16khz 1 channel.

Before	After

Progress Bar: @finnvoor also contributed a fix to the progress when in VAD chunking mode. WhisperAX now shows an indicator while the file is being resampled and the overall progress of the decoding. Note that this is not an exactly linear progress bar because it is based on how many windows have completed decoding, so it will speed up toward the end of the process as more windows complete.
Various other improvements: We also did a pass on our current issues and resolved many of them, if you have one pending please test out this version to verify they are fixed. Thanks again to everyone that contributes to these issues, it helps immensely to make WhisperKit better for everyone 🚀.

What's Changed

Remove purported OGG support from CLI by @iandundas in #153
Resample audio files in 10mb chunks by @finnvoor in #158
feat: add version output by @chenrui333 in #148
Fix TEST_HOST name mismatch by @CongLeSolutionX in #177
feat: copy text with eager decoding, add keyboard shortcut by @iGerman00 in #178
Fix progress when using VAD chunking by @finnvoor in #179
Fix indeterminate tests by @ZachNagengast in #180
Fix resampling large files by @ZachNagengast in #183

New Contributors

@iandundas made their first contribution in #153
@chenrui333 made their first contribution in #148
@CongLeSolutionX made their first contribution in #177
@iGerman00 made their first contribution in #178

Full Changelog: v0.7.2...v0.8.0

Contributors

iandundas, chenrui333, and 4 other contributors

Assets 2

30 May 13:11

ZachNagengast

v0.7.2

aa4bb90

v0.7.2

Early stopping now keeps track of the chunked window internally when running async transcription via the VAD chunking method. This will give further control for stopping specific windows based on your custom criteria in the TranscriptionCallback.

What's Changed

Fix early stopping for VAD by @ZachNagengast in #155

Full Changelog: v0.7.1...v0.7.2

Contributors

ZachNagengast

Assets 2

25 May 22:55

ZachNagengast

v0.7.1

3aa94e8

v0.7.1

Hotifx for shouldEarlyStop logic

What's Changed

Ensures early stopping flag on TextDecoder is always reset at the beginning of a new loop

Full Changelog: v0.7.0...v0.7.1

Assets 2

24 May 10:02

ZachNagengast

v0.7.0

c829f9a

v0.7.0

This is a very exciting release because we're seeing yet another massive speedup in offline throughput thanks to VAD based chunking 🚀

Highlights

Energy VAD based chunking 🗣️ @jkrukowski
- There is a new decoding option called chunkingStrategy which can significantly speed up your single file transcriptions with minimal WER downsides.
- It works by finding a clip point in the middle of the longest silence (lowest audio energy) in the last 15s of a 30s window and uses that to split up all the audio ahead of time so it can be asynchronously decoded in parallel.
- Heres a video of it in action, comparing .none chunking strategy with .vad

vad.chunking.mp4

Detect language helper:
- You can now call detectLanguage with just an audio path as input from the main whisperKit object. This will return a simple language code and probability back as a tuple, and has minimal logging/timing.
- Example:

let whisperKit = try await WhisperKit()
let (language, probs) = try await whisperKit.detectLanguage(audioPath: "your/audio/path/spanish.wav")
print(language) // "es"

WhisperKit via Expo @seb-sep
- For anyone that's been wanting to use WhisperKit in react native, @seb-sep is maintaining a repo that makes it easy, and also setup an automation that will automatically update it with each new WhisperKit release, check it out here: https://github.com/seb-sep/whisper-kit-expo
Bug fixes and enhancements:
- @jiangdi0924 and @fengcunhan contributed some nice fixes in this release with #136 and #138 (see below)
- Also moved the decoding progress callback to be fully async so that it doesn't block the decoder thread

What's Changed

Fix language detection by @jkrukowski in #133
Fix the reset operation exception in transcribeFile in the Demo. by @jiangdi0924 in #136
gh action for making pr to whisper-kit-expo on whisperkit release by @seb-sep in #137
add reStartRecordingLive function by @fengcunhan in #138
Added @_disfavoredOverload for deprecated methods by @jkrukowski in #143
VAD audio chunking by @jkrukowski in #135
Async Progress Callback by @ZachNagengast in #145
Detect language helper by @ZachNagengast in #146

New Contributors

@jiangdi0924 made their first contribution in #136
@seb-sep made their first contribution in #137
@fengcunhan made their first contribution in #138

Full Changelog: v0.6.1...v0.7.0

Contributors

jkrukowski, fengcunhan, and 3 other contributors

Assets 2

01 May 08:38

ZachNagengast

v0.6.1

c20943d

v0.6.1

Smaller patch release with some nice improvements and two new contributors 🙌

Highlights

Tokenizer no longer requires a HubApi request to succeed if the files are already downloaded
- This was a big request from the community and should enable offline transcription as long as everything is downloaded already
- Also made the function public so you can bundle the tokenizer with the app along with the model files
@smpanaro found a really nice speedup across the board by using IOSurface backed MLMultiArrays
- Especially noticeable on older devices
General cleanup, including a nice bug fix from @couche1 when streaming via the CLI

What's Changed

Memory and Latency Regression Tests by @Abhinay1997 in #99
- @Abhinay1997 is building out this regression test suite so we can be sure we're always shipping code that has the same or better speed, accuracy, memory, etc
Fix audio file requirement for streaming mode by @couche1 in #121
Use IOSurface-backed MLMultiArrays for float16 by @smpanaro in #130
Cleanup by @ZachNagengast in #132

New Contributors

@couche1 made their first contribution in #121
@smpanaro made their first contribution in #130

Full Changelog: v0.6.0...v0.6.1

Contributors

ZachNagengast, smpanaro, and 2 other contributors

Assets 2

18 Apr 06:22

ZachNagengast

v0.6.0

076b670

v0.6.0

Highlights

Async batch transcription is here 🎉 contributed by @jkrukowski
- With this release, you can now simultaneously transcribe multiple audio files at once, fully utilizing the new async prediction APIs released with iOS17/macOS14 (see the wwdc video here).
- New interface with audioPaths input:
- ```
  let audioPaths = [
      "/path/to/file1.wav",
      "/path/to/file2.wav"
  ]
  let whisperKit = try await WhisperKit()
  let transcriptionResults: [[TranscriptionResult]?] = await whisperKit.transcribe(audioPaths: audioPaths)
```
- You can also use it via the CLI using the new argument --audio-folder "path/to/folder/"
- Future work will be chunking up single files to significantly speed up long-form transcription
- Note that this entails breaking changes and deprecations, see below for the full upgrade guide.
Several bug fixes, accuracy improvements, and quality of life upgrades by @hewigovens @shawiz and @jkrukowski
- Every issue raised and PR merged from the community helps make WhisperKit better every release, thank you and keep them coming! 🙏

⚠️ Upgrade Guide

We aim to minimize breaking changes, so with this update we added a few deprecation flags for changed interfaces, which will be removed later but for now are still usable and will not throw build errors. There are some breaking changes for lower level and newer methods so if you do notice build errors click the dropdown below to see the full guide.

Full Upgrade Guide

API changes

Deprecations

`WhisperKit`

Deprecated

public func transcribe(
    audioPath: String,
    decodeOptions: DecodingOptions? = nil,
    callback: TranscriptionCallback = nil
) async throws -> TranscriptionResult?

use instead

public func transcribe(
    audioPath: String,
    decodeOptions: DecodingOptions? = nil,
    callback: TranscriptionCallback = nil
) async throws -> [TranscriptionResult]

Deprecated

public func transcribe(
    audioArray: [Float],
    decodeOptions: DecodingOptions? = nil,
    callback: TranscriptionCallback = nil
) async throws -> TranscriptionResult?

use instead

public func transcribe(
    audioArray: [Float],
    decodeOptions: DecodingOptions? = nil,
    callback: TranscriptionCallback = nil
) async throws -> [TranscriptionResult]

`TextDecoding`

Deprecated

func decodeText(
    from encoderOutput: MLMultiArray,
    using decoderInputs: DecodingInputs,
    sampler tokenSampler: TokenSampling,
    options decoderOptions: DecodingOptions,
    callback: ((TranscriptionProgress) -> Bool?)?
) async throws -> [DecodingResult]

use instead

func decodeText(
    from encoderOutput: MLMultiArray,
    using decoderInputs: DecodingInputs,
    sampler tokenSampler: TokenSampling,
    options decoderOptions: DecodingOptions,
    callback: ((TranscriptionProgress) -> Bool?)?
) async throws -> DecodingResult

Deprecated

func detectLanguage(
    from encoderOutput: MLMultiArray,
    using decoderInputs: DecodingInputs,
    sampler tokenSampler: TokenSampling,
    options: DecodingOptions,
    temperature: FloatType
) async throws -> [DecodingResult]

use instead

func detectLanguage(
    from encoderOutput: MLMultiArray,
    using decoderInputs: DecodingInputs,
    sampler tokenSampler: TokenSampling,
    options: DecodingOptions,
    temperature: FloatType
) async throws -> DecodingResult

Breaking changes

removed Transcriber protocol

`AudioProcessing`

static func loadAudio(fromPath audioFilePath: String) -> AVAudioPCMBuffer?

becomes

static func loadAudio(fromPath audioFilePath: String) throws -> AVAudioPCMBuffer

`AudioStreamTranscriber`

public init(
    audioProcessor: any AudioProcessing, 
    transcriber: any Transcriber, 
    decodingOptions: DecodingOptions, 
    requiredSegmentsForConfirmation: Int = 2, 
    silenceThreshold: Float = 0.3, 
    compressionCheckWindow: Int = 20, 
    useVAD: Bool = true, 
    stateChangeCallback: AudioStreamTranscriberCallback?
)

becomes

public init(
    audioEncoder: any AudioEncoding,
    featureExtractor: any FeatureExtracting,
    segmentSeeker: any SegmentSeeking,
    textDecoder: any TextDecoding,
    tokenizer: any WhisperTokenizer,
    audioProcessor: any AudioProcessing,
    decodingOptions: DecodingOptions,
    requiredSegmentsForConfirmation: Int = 2,
    silenceThreshold: Float = 0.3,
    compressionCheckWindow: Int = 20,
    useVAD: Bool = true,
    stateChangeCallback: AudioStreamTranscriberCallback?
)

`TextDecoding`

func prepareDecoderInputs(withPrompt initialPrompt: [Int]) -> DecodingInputs?

becomes

func prepareDecoderInputs(withPrompt initialPrompt: [Int]) throws -> DecodingInputs

What's Changed

Add microphoneUnavailable error by @hewigovens in #113
Improve token timestamps and language detection by @ZachNagengast in #114
Respect skipSpecialTokens option in the decodingCallback function by @shawiz in #115
Disallow invalid --language values by @jkrukowski in #116
Run tests in parallel on CI by @jkrukowski in #117
Async batch predictions by @jkrukowski in #107

New Contributors

@hewigovens made their first contribution in #113
@shawiz made their first contribution in #115

Full Changelog: v0.5.0...v0.6.0

Contributors

shawiz, hewigovens, and 2 other contributors

Assets 2

0 Join discussion

30 Mar 08:38

ZachNagengast

v0.5.0

61df12a

v0.5.0

This is a HUGE release with some great new features and fixes 🙌

Highlights

Timestamp logits filter by @jkrukowski
- Significantly improves the amount of timestamp tokens in a particular window, which helps a lot with segmentation
- This is on by default but can be disabled using the decoding option withoutTimestamps: true
Language detection by @Abhinay1997
- New function on the TextDecoding protocol which runs a single forward pass and reads the language logits to find the most likely language for the input audio
- Enabled by default for decoding options whereusePrefilPrompt: false and the language: nil and it is not an English only model.
First token log prob thresholds fallback check by @jkrukowski
- This feature is not in the original openai implementation but helps reduce hallucinations quite a bit.
- Often, fallbacks due to log prob threshold are immediately identifiable by the first token, so this reduces the amount of forward passes needed to move to a higher temperature
Distil whisper support
- Recently distil-large-v3 was released which massively speeds up predictions at minimal quality loss. We've converted and optimized 4 distil models to use in WhisperKit on CoreML, they're really fast!
- distil-large-v3
  distil-large-v3_594MB
  distil-large-v3_turbo
  distil-large-v3_turbo_600MB
- Note that these do not yet have word timestamp alignment heads, so can't be used with wordTimestamps: true
- It can be run via CLI as well:
  - swift run whisperkit-cli transcribe --model-prefix "distil" --model "large-v3_turbo_600MB" --verbose --audio-path ~/your_audio.wav

⚠️ Experimental new stream mode

We added an experimental new mode for streaming in WhisperAX called "Eager streaming mode". We're still refining this feature but we think it can soon be a great way to do real-time transcription with Whisper. Give it a try in Testflight or take a look a the code and let us know how it can be improved.

Recommended settings for the best performance for this iteration are:

Max tokens per loop < 100
Max fallback count < 2
Prompt and cache prefill true

Looking for feedback on:

Token confirmation numbers that work well
Model, device, and settings combinations that work well

RPReplay_Final1711775397.MP4

What's Changed

CLI Task Handling in #85
Added TimestampRulesFilter implementation by @jkrukowski in #45
Support distil whisper models in #88
Language Detection by @Abhinay1997 in #78
Tokenizer refactor, tests cleanup by @jkrukowski in #87
First token logProb thresholding by @jkrukowski in #90
[#93] Add missing settings to decoding options by @cgfarmer4 in #94
"Eager" streaming mode via word timestamps in #95

New Contributors

@Abhinay1997 made their first contribution in #78

Full Changelog: v0.4.1...v0.5.0

Contributors

cgfarmer4, jkrukowski, and Abhinay1997

Assets 2

Releases: argmaxinc/WhisperKit

v0.9.3

What's Changed

Contributors

v0.9.2

Highlights

What's Changed

Contributors

v0.9.0

Highlights

Package Updates

WhisperAX example app and CLI

⚠️ Deprecations and changed interfaces

What's Changed

New Contributors

Contributors

v0.8.0

Highlights

What's Changed

New Contributors

Contributors

v0.7.2

What's Changed

Contributors

v0.7.1

What's Changed

v0.7.0

Highlights

What's Changed

New Contributors

Contributors

v0.6.1

Highlights

What's Changed

New Contributors

Contributors

v0.6.0

Highlights

⚠️ Upgrade Guide

API changes

Deprecations

WhisperKit

TextDecoding

Breaking changes

AudioProcessing

AudioStreamTranscriber

TextDecoding

What's Changed

New Contributors

Contributors

v0.5.0

Highlights

⚠️ Experimental new stream mode

What's Changed

New Contributors

Contributors

`WhisperKit`

`TextDecoding`

`AudioProcessing`

`AudioStreamTranscriber`

`TextDecoding`