Release v0.8: Low Code Framework to Efficiently Build Custom LLMs on Your Data · ludwig-ai/ludwig

Full Release Blog Post Here: https://predibase.com/blog/ludwig-v0-8-open-source-toolkit-to-build-and-fine-tune-custom-llms-on-your-data

What's Changed

Make fill_value a medium impact parameter in preprocessing by @arnavgarg1 in #3155
Allow auto cherry-pick into release-0.7 by @tgaddair in #3157
Fixed confidence_penalty for newer versions of pytorch by @tgaddair in #3156
Fixed set explanations by @tgaddair in #3160
Bump to hummingbird 0.4.8 by @tgaddair in #3164
Update SequenceGeneratorDecoder to output predictions and probabilities by @jeffkinnison in #3152
Disable sampling in preprocessing or when it results in too few rows by @ShreyaR in #3117
Unpin pyarrow by @tgaddair in #3167
Make Horovod an optional dependency when using Ray by @tgaddair in #3166
Skip sample_ratio validation when using Dask to prevent materialization of DF by @tgaddair in #3174
Fix TorchVision channel preprocessing by @geoffreyangus in #3173
Bump Ludwig to v0.7.1 by @tgaddair in #3179
Add fallback mirrors to dataset API by @abidwael in #3168
Log cached dataset write paths during cache miss by @arnavgarg1 in #3181
Re-enable benchmark tests on Sarcos dataset by @abidwael in #3169
Disable passthrough decoder for all feature types by @arnavgarg1 in #3151
Log when cached dataset can't be found by @arnavgarg1 in #3192
Remove hard dependency on ludwig[tree]. Check model.type() instead of instanceof(model). by @justinxzhao in #3184
Add sequence decoder integration tests by @jeffkinnison in #3175
int: [REBASE] Remove unnecessary JSON schema code by @ksbrar in #3196
fix: [REBASE] Hoist uniqueItemProperties to top of feature JSON schema by @ksbrar in #3183
Guarantee determinism when sampling (either overall via sample_ratio, or while balancing data) by @arnavgarg1 in #3191
Use tmpdir for more files generated during tests by @tgaddair in #3197
Removed .vscode and added to .gitignore by @tgaddair in #3201
Revert vscode by @tgaddair in #3202
Fixes dict_hash discrepancy by @w4nderlust in #3195
Reset Ray address by @arnavgarg1 in #3200
Unpin scikit-learn. by @justinxzhao in #3185
Fixed learning_rate_scheduler params in automl by @tgaddair in #3203
Fix Docker image dependencies and add tests for minimal install by @tgaddair in #3186
Bump to v0.7.2 by @tgaddair in #3208
fix: [REBASE] Misc. JSON schema fixes by @ksbrar in #3187
fix: [REBASE] Streamline GBM defaults schema by @ksbrar in #3188
Update sequence/text feature max_sequence_length default to None by @geoffreyangus in #3205
Add category onehot encoder for both ECD and GBM by @tgaddair in #3057
Enable transformer encoder and disable embed encoder from SequenceCombiner by @abidwael in #3154
Add auxiliary validation for all features to be present in comparator combiner entities by @abidwael in #3216
Disable number decoder in the decoder config by @abidwael in #3217
Add non_zero to common_fields.NumFCLayersField by @abidwael in #3215
Add the ability to specify a local or S3 HF mirror for more guaranteed loading of pre-trained HF models. by @justinxzhao in #3211
Schemafy merge_fixed_preprocessing_params by @tgaddair in #3223
Tagger decoder config override and auxiliary validation checks by @arnavgarg1 in #3222
Refactor loss implementation to use the schema config for all parameters by @tgaddair in #3227
follow up: unregister passthrough decoder for all features by @abidwael in #3225
top_k should be a positive integer by @connor-mccorm in #3230
Refactored combiner registry and broke circular dep with schema by @tgaddair in #3228
Implements sequence_length param by @geoffreyangus in #3221
Add columns and data types to ludwig datasets by @connor-mccorm in #3231
Fixes Explain step for tied weights by @geoffreyangus in #3214
Fix @slow hf tests by @justinxzhao in #3233
Remove partial RayTune checkpoints for trials that have not completed because of forceful termination by @arnavgarg1 in #3232
Replace NaN in timeseries rows with padding_value by @tgaddair in #3238
Adds support for TEXT features when using GBM with tf-idf encoder by @tgaddair in #3235
Persist Dask Dataframe after binary image/audio reads by @arnavgarg1 in #3241
Add Timeseries forecasting for column-major data, and introduce Timeseries output feature by @tgaddair in #3212
fix: transform onehot encoder outputs to float32 tensor by @abidwael in #3242
Pin torchaudio by @geoffreyangus in #3244
Pin torchvision and torchtext by @geoffreyangus in #3248
Filter entities from comparator combiner when not listed in input_features by @tgaddair in #3251
Reset os.environ var in hf_utils. by @justinxzhao in #3253
fix typo in pretrained_model_name_or_path by @abidwael in #3257
Fixes torch DDP distributed metric computation for AUROC by @geoffreyangus in #3234
Unpin torchvision, torchtext, and torchaudio. by @justinxzhao in #3255
Added compute tiers to parameter metadata by @tgaddair in #3254
Allow encoders for GBMs by @arnavgarg1 in #3258
feat: Add env var LUDWIG_SCHEMA_VALIDATION_POLICY to change marshmallow validation strictness by @tgaddair in #3226
Removes invalid keys from GBM defaults in the schema by @arnavgarg1 in #3252
Add support for model.compile in PyTorch 2.0 by @tgaddair in #3246
Fix ludwig docker by @tgaddair in #3264
Bump to v0.7.3 by @tgaddair in #3267
Update gpt text encoder afn parameter default to what's listed in HF docs. by @justinxzhao in #3261
Add test for falling back to HF model that's not in the ludwig pretrained dir by @justinxzhao in #3256
Fixed non-scalar (text, vector, set) output feature explanations by @tgaddair in #3269
Use LudwigFeatureDict to permit module keys that are rejected by torch ModuleDict. by @justinxzhao in #3270
Fixed handling of datetime types in input parquet files by @tgaddair in #3274
Fix date explanations by @tgaddair in #3276
Handle np.bool to JSON NumpyEncoder by @abidwael in #3280
Update all combiners to use .get to access LudwigFeatureDict contents by @abidwael in #3279
Check that concat combiner doesn't receive a mixture of non-reduced sequence and non-sequence features. by @justinxzhao in #3271
Better handling for missing dataset columns by @connor-mccorm in #3285
feat: Add kwargs option to all file readers and feed nrows where possible by @ksbrar in #3266
Update version to 0.8.dev by @justinxzhao in #3286
Added DeBERTa (v2 / v3) text encoder by @tgaddair in #3289
Check fill_with_const has fill_value for binary features by @abidwael in #3278
Fixed reduce_option=concat for auto_transformer and deberta by @tgaddair in #3291
Adds category_distribution output feature for providing labels as a probability distribution by @tgaddair in #3288
fix: [REBASE] Add config transformation that removes extra type param if it exists in the defaults config by @ksbrar in #3296
Pass in kwargs to read_parquet by @abidwael in #3293
Unpin transformers by @geoffreyangus in #3290
fix: Fix read_parquet for remote filesystems by @ksbrar in #3294
fix: Remove duplicate enums in deberta parameters by @ksbrar in #3299
fix: Add separate output config registries for ECD and GBM by @ksbrar in #3306
Use frac arg in df.sample instead of n by @abidwael in #3307
Added prompt_template preprocessing param for text features by @tgaddair in #3298
Allow different splits for hyperopt metric reporting by @abidwael in #3282
fix: Raise error if list provided to StringOptions has duplicates and change validation errors to assertion errors for this field as well. by @ksbrar in #3302
Add PROBABILITIES and PREDICTIONS to the prediction set for the sequence tagger decoder. by @justinxzhao in #3300
Fixed URI loading by @tgaddair in #3310
Added trainer.gradient_accumulation_steps for increasing effective batch size by @tgaddair in #3305
Fixed cache_encoder_embeddings with tied weights by @tgaddair in #3308
test: Parameterize dataset read chunking test by @ksbrar in #3311
Allow providing encoders / decoder params as strings by @tgaddair in #3309
Fixes for CI failures by @jeffkinnison in #3315
Pin pandas < 2.0 by @jeffkinnison in #3319
Add combinatorial tests to the CI by @abidwael in #2991
int: Prevent encoder enums from being reshuffled by @ksbrar in #3313
Ensure that iq and all number feature preprocessing normalizations work on dask backends. by @justinxzhao in #3297
Back out dataset profile protos in ludwig. by @justinxzhao in #3318
Docker image upgrades to Python 3.9, Ray 2.3.1, Torch 2.0.0, CUDA 11.8. Deprecating Horovod by @abidwael in #3320
Consolidate hyperopt schema in ludwig.schema.hyperopt by @jeffkinnison in #3190
Downgrade docker images to use Python 3.8 by @abidwael in #3325
Prevent GBM memory leaks by forcing actor cleanup after each epoch by @arnavgarg1 in #3323
Pass encoding dimensions to SequenceCombiner by @RXminuS in #3321
Filter auto_transformer kwargs based on forward signature by @tgaddair in #3329
Strip dataset path to figure its format correctly by @abidwael in #3331
Use Python 3.8 in pytest by @abidwael in #3330
Added union on ModelConfig and dict in create_model by @aseemk98 in #3333
Speed up average number of tokens computation by @geoffreyangus in #3337
Fix sequence decoder attribute error by @jeffkinnison in #3336
Decode bytes to strings for DatasetInfo avg_words by @hungcs in #3338
Add compute tier for DeBERTa by @connor-mccorm in #3340
Pandas 2.0 update by @jeffkinnison in #3322
fix image values inference for FieldInfo by @hungcs in #3341
Add input and output shapes for attention reducer (combiner) by @abidwael in #3339
Added distinct decoder registries by model type by @tgaddair in #3342
Skip ames tests by @tgaddair in #3345
[LLM] Support zero-shot learning and text generation through LLMs by @arnavgarg1 in #3335
Pin dask<2023.4.0 by @tgaddair in #3347
Fix dependency install in CI by @abidwael in #3354
fix slow tests by @abidwael in #3327
Bump Transformers to 4.28.1 by @arnavgarg1 in #3357
CI install dependency refactor by @abidwael in #3355
Use dist.barrier instead of horovod legacy code by @abidwael in #3358
Fixed distributed strategy registration to be explicit by @tgaddair in #3361
Tune batch size using distributed training to catch edge case CUDA OOMs by @tgaddair in #2934
Allow checkpoint download on non-coordinator process by @abidwael in #3363
Description Fix by @connor-mccorm in #3364
Add accuracy with micro averaging for category features. by @justinxzhao in #3367
temporary pip install expecttest for torch nightly tests by @abidwael in #3366
Clean up distributed batch size tuning logging by @abidwael in #3370
Ensure worker synchronization during resume by @abidwael in #3369
Remove expecttest temporary install by @abidwael in #3371
Added DeepSpeed distributed strategy and backend by @tgaddair in #3362
Default to LocalStrategy for GBM metrics by @abidwael in #3373
Implemented CORN loss for ordinal classification by @tgaddair in #3375
Fix MNIST datasets img path column by @shawnccx in #3376
[LLM] Few-shot learning via Retrieval-augmented ICL by @geoffreyangus in #3351
Correct TorchScript Typo in Readme by @samster25 in #3380
Temporarily switch to eval mode when using batch_size: 1 by @abidwael in #3378
Fixed DDP checkpointing to save to local disk on each node by @tgaddair in #3381
Add special handling for empty ray datasets. by @justinxzhao in #3384
Ensure captum's halved batch size retry fn wrapper accepts the minimum batch size of 1. by @justinxzhao in #3382
[LLM] Fine-Tune LLMs via Prompt tuning by @arnavgarg1 in #3359
Added LoRA tuner for HuggingFace text encoders by @tgaddair in #3385
fix: Typo in zscore metadata by @ksbrar in #3388
Skip broken image torchvision test with efficientnet. by @justinxzhao in #3389
fix: Fix dataset reading for parquet directories by @ksbrar in #3356
Cut CI runtime down to 1 hour by splitting integration tests into 3 separate runs. by @justinxzhao in #3391
Remove the flatten/unflatten ray backend tensor communication workaround by @justinxzhao in #3301
revert: Revert "fix: Fix dataset reading for parquet directories (#3356)" by @ksbrar in #3396
Sanitize GBM feature names to remove JSON special characters by @jeffkinnison in #3326
Add back in support for Dictionary Class Weights by @connor-mccorm in #3392
fix: sequence feature explain error by @jeffkinnison in #3397
[LLM] Add Prefix Tuning, PTuning, LoRA, AdaLoRA and Adaption Prompt for LLM fine-tuning by @tgaddair in #3386
[llm] Move prompt into top-level of config by @tgaddair in #3399
[LLM] Set additional properties to False for Adapter field by @arnavgarg1 in #3401
[llm] Added support for bfloat16 with deepspeed by @tgaddair in #3403
[llm] Parameter metadata by @tgaddair in #3404
Introduce a fourth test grouping to further speed up integration tests. by @justinxzhao in #3405
Ray nightly compatibility by @arnavgarg1 in #3275
[LLM] Fix Loss Computation for LLM Fine-tuning using shifted tensors/new loss function by @arnavgarg1 in #3408
Fix tokenizer pad and eos token assumption for HF tokenizers by @arnavgarg1 in #3411
Revert "Ray nightly compatibility (#3275)" by @tgaddair in #3413
[llm] Create separate Predictor for LLMs and enable flash attention on CUDA by @tgaddair in #3409
Fix Missing Param Metadata by @connor-mccorm in #3415
[llm] Fixed adapter initialization and OOM on checkpoint loading by @tgaddair in #3416
Remove Mutable Defaults by @connor-mccorm in #3418
Fix: Device placement for metrics during evaluation by @arnavgarg1 in #3420
Order prompt schema and add title by @connor-mccorm in #3419
[llm] Fixed loading when performing full fine-tuning by @tgaddair in #3421
[llm] Use lower default learning rate and rename llm decoders to extractors by @arnavgarg1 in #3422
Remove the other None Param Metadata imputation by @connor-mccorm in #3424
Add model name param metadata by @connor-mccorm in #3425
Support combining multiple columns into a single text feature by @tgaddair in #3426
fix: Install html5lib and test html reads by @jeffkinnison in #3428
Added vector index interface and fixed import errors by @tgaddair in #3429
[LLM] Skip left padding removal when there is no left padding by @arnavgarg1 in #3432
Update README.md, fixed some quotes by @neuhausler in #3436
Improve image/audio read throughput by 50% for image/audio features using Daft by @arnavgarg1 in #3249
[LLM] Various fixes for LLM Fine-Tuning issues that caused loss disparity between train and val sets by @arnavgarg1 in #3437
int: Add back required for input and output features to the Ludwig JSON schema by @ksbrar in #3442
int: Pin getdaft to 0.1.6 by @ksbrar in #3443
upgrade ludwig-gpu docker image to pytorch 2.0.0 by @jppgks in #3445
[llm] Replace model_name with required base_model, add preset LLM registry, update internal adapter modules by @tgaddair in #3423
[llm] fix device placement issues when using CPUs and GPUs during LLM fine tuning by @arnavgarg1 in #3447
Unpin Daft by @arnavgarg1 in #3451
Add param metadata to column parameter by @connor-mccorm in #3452
fix: Add titles for augmentation schema options by @ksbrar in #3454
pin torchmetrics to 0.11.4 by @arnavgarg1 in #3456
Adding --no-cache-dir to dockerfile pip install by @noyoshi in #3455
Replace all non-word characters in feature names to ensure no downstream issues with external libraries. by @justinxzhao in #3438
[Example] Fix llm_finetune example json part by @chongxiaoc in #3461
refactor: add type hints for encoders by @Dennis-Rall in #3449
Unpin deepspeed by @arnavgarg1 in #3466
Unpin torch nightly CI. by @justinxzhao in #3459
typo in parameter of README.md by @Skizzy-create in #3471
Zero copy initialization of models onto training workers for LLMs by @arnavgarg1 in #3469
Remove tables requirement as it causes issues installing ludwig in linux env. by @justinxzhao in #3473
Add QLoRA for 4-bit fine-tuning by @tgaddair in #3476
int: Fix quantization schema by @ksbrar in #3479
Fixed divide by zero when tuning batch size by @tgaddair in #3481
Exclude frozen weights from checkpoints and fix evaluation using quantized LLMs by @tgaddair in #3483
Removed extra clone during prediction and added quantization error handling by @tgaddair in #3484
Updated Ludwig CI for torch nightly by @Infernaught in #3467
Fixed QLoRA with mutli-gpu and reduce CUDA memory pressure during eval by @tgaddair in #3486
Enable evaluation with LLMs <7B by @geoffreyangus in #3478
Add ludwig upload to push artifacts to HuggingFace Hub by @arnavgarg1 in #3480
Add RoPE scaling to increase context length up to 8K for training or inference. by @arnavgarg1 in #3477
Added quantization parameter metadata by @tgaddair in #3487
Added CLI llama2 example by @tgaddair in #3491
Updated presets with Llama-2 by @tgaddair in #3490
[fix] Multiple test fixes by @jeffkinnison in #3489
fix: Read partitioned parquet files from relative paths by @jeffkinnison in #3470
Require using lora adapter when performing quantized fine-tuning by @tgaddair in #3492
Require text output feature for LLM finetuning by @arnavgarg1 in #3493
Pin bitsandbytes<0.41.0 by @tgaddair in #3494
int: Pin sqlalchemy to 1.x.x versions by @ksbrar in #3496
Add parameter metadata for global_max_sequence_length by @arnavgarg1 in #3497
Remove falcon from presets as it requires running untrusted code by @tgaddair in #3495
Updates for ludwig-docs by @tgaddair in #3499
Add use_pretrained attribute for AutoTransformers by @arnavgarg1 in #3498
Readme updates for 0.8 by @tgaddair in #3500
Improve description for generation config parameters by @arnavgarg1 in #3501
Readme TYPO by @arnavgarg1 in #3502
Make Ludwig logo smaller in the README by @abidwael in #3505
Check that LLMs have exactly one text input feature by @geoffreyangus in #3508
Fix temperature description by @arnavgarg1 in #3509
Add Cosine Annealing LR scheduler as a decay method by @arnavgarg1 in #3507
Fix typo in function name for LR schedulers by @arnavgarg1 in #3511
Bump to v0.8 by @tgaddair in #3512

New Contributors

@RXminuS made their first contribution in #3321
@aseemk98 made their first contribution in #3333
@shawnccx made their first contribution in #3376
@samster25 made their first contribution in #3380
@neuhausler made their first contribution in #3436
@chongxiaoc made their first contribution in #3461
@Skizzy-create made their first contribution in #3471
@Infernaught made their first contribution in #3467

Full Changelog: v0.7...v0.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.8: Low Code Framework to Efficiently Build Custom LLMs on Your Data

What's Changed

New Contributors

Contributors