v0.8: Low Code Framework to Efficiently Build Custom LLMs on Your Data
Full Release Blog Post Here: https://predibase.com/blog/ludwig-v0-8-open-source-toolkit-to-build-and-fine-tune-custom-llms-on-your-data
What's Changed
- Make fill_value a medium impact parameter in preprocessing by @arnavgarg1 in #3155
- Allow auto cherry-pick into release-0.7 by @tgaddair in #3157
- Fixed confidence_penalty for newer versions of pytorch by @tgaddair in #3156
- Fixed set explanations by @tgaddair in #3160
- Bump to hummingbird 0.4.8 by @tgaddair in #3164
- Update SequenceGeneratorDecoder to output predictions and probabilities by @jeffkinnison in #3152
- Disable sampling in preprocessing or when it results in too few rows by @ShreyaR in #3117
- Unpin pyarrow by @tgaddair in #3167
- Make Horovod an optional dependency when using Ray by @tgaddair in #3166
- Skip sample_ratio validation when using Dask to prevent materialization of DF by @tgaddair in #3174
- Fix TorchVision channel preprocessing by @geoffreyangus in #3173
- Bump Ludwig to v0.7.1 by @tgaddair in #3179
- Add fallback mirrors to dataset API by @abidwael in #3168
- Log cached dataset write paths during cache miss by @arnavgarg1 in #3181
- Re-enable benchmark tests on
Sarcos
dataset by @abidwael in #3169 - Disable passthrough decoder for all feature types by @arnavgarg1 in #3151
- Log when cached dataset can't be found by @arnavgarg1 in #3192
- Remove hard dependency on ludwig[tree]. Check
model.type()
instead ofinstanceof(model)
. by @justinxzhao in #3184 - Add sequence decoder integration tests by @jeffkinnison in #3175
- int: [REBASE] Remove unnecessary JSON schema code by @ksbrar in #3196
- fix: [REBASE] Hoist uniqueItemProperties to top of feature JSON schema by @ksbrar in #3183
- Guarantee determinism when sampling (either overall via
sample_ratio
, or while balancing data) by @arnavgarg1 in #3191 - Use tmpdir for more files generated during tests by @tgaddair in #3197
- Removed .vscode and added to .gitignore by @tgaddair in #3201
- Revert vscode by @tgaddair in #3202
- Fixes dict_hash discrepancy by @w4nderlust in #3195
- Reset Ray address by @arnavgarg1 in #3200
- Unpin scikit-learn. by @justinxzhao in #3185
- Fixed learning_rate_scheduler params in automl by @tgaddair in #3203
- Fix Docker image dependencies and add tests for minimal install by @tgaddair in #3186
- Bump to v0.7.2 by @tgaddair in #3208
- fix: [REBASE] Misc. JSON schema fixes by @ksbrar in #3187
- fix: [REBASE] Streamline GBM defaults schema by @ksbrar in #3188
- Update sequence/text feature
max_sequence_length
default toNone
by @geoffreyangus in #3205 - Add category onehot encoder for both ECD and GBM by @tgaddair in #3057
- Enable
transformer
encoder and disableembed
encoder fromSequenceCombiner
by @abidwael in #3154 - Add auxiliary validation for all features to be present in comparator combiner entities by @abidwael in #3216
- Disable number decoder in the decoder config by @abidwael in #3217
- Add
non_zero
tocommon_fields.NumFCLayersField
by @abidwael in #3215 - Add the ability to specify a local or S3 HF mirror for more guaranteed loading of pre-trained HF models. by @justinxzhao in #3211
- Schemafy merge_fixed_preprocessing_params by @tgaddair in #3223
- Tagger decoder config override and auxiliary validation checks by @arnavgarg1 in #3222
- Refactor loss implementation to use the schema config for all parameters by @tgaddair in #3227
- follow up: unregister passthrough decoder for all features by @abidwael in #3225
- top_k should be a positive integer by @connor-mccorm in #3230
- Refactored combiner registry and broke circular dep with schema by @tgaddair in #3228
- Implements
sequence_length
param by @geoffreyangus in #3221 - Add columns and data types to ludwig datasets by @connor-mccorm in #3231
- Fixes Explain step for tied weights by @geoffreyangus in #3214
- Fix @slow hf tests by @justinxzhao in #3233
- Remove partial RayTune checkpoints for trials that have not completed because of forceful termination by @arnavgarg1 in #3232
- Replace NaN in timeseries rows with
padding_value
by @tgaddair in #3238 - Adds support for TEXT features when using GBM with tf-idf encoder by @tgaddair in #3235
- Persist Dask Dataframe after binary image/audio reads by @arnavgarg1 in #3241
- Add Timeseries forecasting for column-major data, and introduce Timeseries output feature by @tgaddair in #3212
- fix: transform
onehot
encoder outputs to float32 tensor by @abidwael in #3242 - Pin
torchaudio
by @geoffreyangus in #3244 - Pin
torchvision
andtorchtext
by @geoffreyangus in #3248 - Filter entities from comparator combiner when not listed in input_features by @tgaddair in #3251
- Reset os.environ var in hf_utils. by @justinxzhao in #3253
- fix typo in
pretrained_model_name_or_path
by @abidwael in #3257 - Fixes torch DDP distributed metric computation for AUROC by @geoffreyangus in #3234
- Unpin
torchvision
,torchtext
, andtorchaudio
. by @justinxzhao in #3255 - Added compute tiers to parameter metadata by @tgaddair in #3254
- Allow encoders for GBMs by @arnavgarg1 in #3258
- feat: Add env var
LUDWIG_SCHEMA_VALIDATION_POLICY
to change marshmallow validation strictness by @tgaddair in #3226 - Removes invalid keys from GBM defaults in the schema by @arnavgarg1 in #3252
- Add support for
model.compile
in PyTorch 2.0 by @tgaddair in #3246 - Fix ludwig docker by @tgaddair in #3264
- Bump to v0.7.3 by @tgaddair in #3267
- Update gpt text encoder
afn
parameter default to what's listed in HF docs. by @justinxzhao in #3261 - Add test for falling back to HF model that's not in the ludwig pretrained dir by @justinxzhao in #3256
- Fixed non-scalar (text, vector, set) output feature explanations by @tgaddair in #3269
- Use
LudwigFeatureDict
to permit module keys that are rejected by torch ModuleDict. by @justinxzhao in #3270 - Fixed handling of datetime types in input parquet files by @tgaddair in #3274
- Fix date explanations by @tgaddair in #3276
- Handle
np.bool
to JSONNumpyEncoder
by @abidwael in #3280 - Update all combiners to use
.get
to accessLudwigFeatureDict
contents by @abidwael in #3279 - Check that concat combiner doesn't receive a mixture of non-reduced sequence and non-sequence features. by @justinxzhao in #3271
- Better handling for missing dataset columns by @connor-mccorm in #3285
- feat: Add
kwargs
option to all file readers and feednrows
where possible by @ksbrar in #3266 - Update version to 0.8.dev by @justinxzhao in #3286
- Added DeBERTa (v2 / v3) text encoder by @tgaddair in #3289
- Check
fill_with_const
hasfill_value
for binary features by @abidwael in #3278 - Fixed reduce_option=concat for auto_transformer and deberta by @tgaddair in #3291
- Adds category_distribution output feature for providing labels as a probability distribution by @tgaddair in #3288
- fix: [REBASE] Add config transformation that removes extra
type
param if it exists in the defaults config by @ksbrar in #3296 - Pass in
kwargs
toread_parquet
by @abidwael in #3293 - Unpin
transformers
by @geoffreyangus in #3290 - fix: Fix
read_parquet
for remote filesystems by @ksbrar in #3294 - fix: Remove duplicate enums in
deberta
parameters by @ksbrar in #3299 - fix: Add separate output config registries for ECD and GBM by @ksbrar in #3306
- Use
frac
arg indf.sample
instead ofn
by @abidwael in #3307 - Added
prompt_template
preprocessing param for text features by @tgaddair in #3298 - Allow different splits for hyperopt metric reporting by @abidwael in #3282
- fix: Raise error if list provided to
StringOptions
has duplicates and change validation errors to assertion errors for this field as well. by @ksbrar in #3302 - Add
PROBABILITIES
andPREDICTIONS
to the prediction set for the sequence tagger decoder. by @justinxzhao in #3300 - Fixed URI loading by @tgaddair in #3310
- Added
trainer.gradient_accumulation_steps
for increasing effective batch size by @tgaddair in #3305 - Fixed cache_encoder_embeddings with tied weights by @tgaddair in #3308
- test: Parameterize dataset read chunking test by @ksbrar in #3311
- Allow providing encoders / decoder params as strings by @tgaddair in #3309
- Fixes for CI failures by @jeffkinnison in #3315
- Pin pandas < 2.0 by @jeffkinnison in #3319
- Add combinatorial tests to the CI by @abidwael in #2991
- int: Prevent encoder enums from being reshuffled by @ksbrar in #3313
- Ensure that
iq
and all number feature preprocessing normalizations work on dask backends. by @justinxzhao in #3297 - Back out dataset profile protos in ludwig. by @justinxzhao in #3318
- Docker image upgrades to Python 3.9, Ray 2.3.1, Torch 2.0.0, CUDA 11.8. Deprecating Horovod by @abidwael in #3320
- Consolidate hyperopt schema in
ludwig.schema.hyperopt
by @jeffkinnison in #3190 - Downgrade docker images to use Python 3.8 by @abidwael in #3325
- Prevent GBM memory leaks by forcing actor cleanup after each epoch by @arnavgarg1 in #3323
- Pass encoding dimensions to SequenceCombiner by @RXminuS in #3321
- Filter auto_transformer kwargs based on forward signature by @tgaddair in #3329
- Strip dataset path to figure its format correctly by @abidwael in #3331
- Use Python 3.8 in pytest by @abidwael in #3330
- Added union on ModelConfig and dict in create_model by @aseemk98 in #3333
- Speed up average number of tokens computation by @geoffreyangus in #3337
- Fix sequence decoder attribute error by @jeffkinnison in #3336
- Decode bytes to strings for DatasetInfo avg_words by @hungcs in #3338
- Add compute tier for DeBERTa by @connor-mccorm in #3340
- Pandas 2.0 update by @jeffkinnison in #3322
- fix image values inference for FieldInfo by @hungcs in #3341
- Add input and output shapes for attention reducer (combiner) by @abidwael in #3339
- Added distinct decoder registries by model type by @tgaddair in #3342
- Skip ames tests by @tgaddair in #3345
- [LLM] Support zero-shot learning and text generation through LLMs by @arnavgarg1 in #3335
- Pin dask<2023.4.0 by @tgaddair in #3347
- Fix dependency install in CI by @abidwael in #3354
- fix slow tests by @abidwael in #3327
- Bump Transformers to 4.28.1 by @arnavgarg1 in #3357
- CI install dependency refactor by @abidwael in #3355
- Use
dist.barrier
instead of horovod legacy code by @abidwael in #3358 - Fixed distributed strategy registration to be explicit by @tgaddair in #3361
- Tune batch size using distributed training to catch edge case CUDA OOMs by @tgaddair in #2934
- Allow checkpoint download on non-coordinator process by @abidwael in #3363
- Description Fix by @connor-mccorm in #3364
- Add accuracy with micro averaging for category features. by @justinxzhao in #3367
- temporary
pip install expecttest
for torch nightly tests by @abidwael in #3366 - Clean up distributed batch size tuning logging by @abidwael in #3370
- Ensure worker synchronization during resume by @abidwael in #3369
- Remove
expecttest
temporary install by @abidwael in #3371 - Added DeepSpeed distributed strategy and backend by @tgaddair in #3362
- Default to LocalStrategy for GBM metrics by @abidwael in #3373
- Implemented CORN loss for ordinal classification by @tgaddair in #3375
- Fix MNIST datasets img path column by @shawnccx in #3376
- [LLM] Few-shot learning via Retrieval-augmented ICL by @geoffreyangus in #3351
- Correct TorchScript Typo in Readme by @samster25 in #3380
- Temporarily switch to
eval
mode when usingbatch_size: 1
by @abidwael in #3378 - Fixed DDP checkpointing to save to local disk on each node by @tgaddair in #3381
- Add special handling for empty ray datasets. by @justinxzhao in #3384
- Ensure captum's halved batch size retry fn wrapper accepts the minimum batch size of 1. by @justinxzhao in #3382
- [LLM] Fine-Tune LLMs via Prompt tuning by @arnavgarg1 in #3359
- Added LoRA tuner for HuggingFace text encoders by @tgaddair in #3385
- fix: Typo in
zscore
metadata by @ksbrar in #3388 - Skip broken image torchvision test with efficientnet. by @justinxzhao in #3389
- fix: Fix dataset reading for parquet directories by @ksbrar in #3356
- Cut CI runtime down to 1 hour by splitting integration tests into 3 separate runs. by @justinxzhao in #3391
- Remove the flatten/unflatten ray backend tensor communication workaround by @justinxzhao in #3301
- revert: Revert "fix: Fix dataset reading for parquet directories (#3356)" by @ksbrar in #3396
- Sanitize GBM feature names to remove JSON special characters by @jeffkinnison in #3326
- Add back in support for Dictionary Class Weights by @connor-mccorm in #3392
- fix: sequence feature explain error by @jeffkinnison in #3397
- [LLM] Add Prefix Tuning, PTuning, LoRA, AdaLoRA and Adaption Prompt for LLM fine-tuning by @tgaddair in #3386
- [llm] Move prompt into top-level of config by @tgaddair in #3399
- [LLM] Set additional properties to False for Adapter field by @arnavgarg1 in #3401
- [llm] Added support for bfloat16 with deepspeed by @tgaddair in #3403
- [llm] Parameter metadata by @tgaddair in #3404
- Introduce a fourth test grouping to further speed up integration tests. by @justinxzhao in #3405
- Ray nightly compatibility by @arnavgarg1 in #3275
- [LLM] Fix Loss Computation for LLM Fine-tuning using shifted tensors/new loss function by @arnavgarg1 in #3408
- Fix tokenizer pad and eos token assumption for HF tokenizers by @arnavgarg1 in #3411
- Revert "Ray nightly compatibility (#3275)" by @tgaddair in #3413
- [llm] Create separate Predictor for LLMs and enable flash attention on CUDA by @tgaddair in #3409
- Fix Missing Param Metadata by @connor-mccorm in #3415
- [llm] Fixed adapter initialization and OOM on checkpoint loading by @tgaddair in #3416
- Remove Mutable Defaults by @connor-mccorm in #3418
- Fix: Device placement for metrics during evaluation by @arnavgarg1 in #3420
- Order prompt schema and add title by @connor-mccorm in #3419
- [llm] Fixed loading when performing full fine-tuning by @tgaddair in #3421
- [llm] Use lower default learning rate and rename llm decoders to extractors by @arnavgarg1 in #3422
- Remove the other None Param Metadata imputation by @connor-mccorm in #3424
- Add model name param metadata by @connor-mccorm in #3425
- Support combining multiple columns into a single text feature by @tgaddair in #3426
- fix: Install html5lib and test html reads by @jeffkinnison in #3428
- Added vector index interface and fixed import errors by @tgaddair in #3429
- [LLM] Skip left padding removal when there is no left padding by @arnavgarg1 in #3432
- Update README.md, fixed some quotes by @neuhausler in #3436
- Improve image/audio read throughput by 50% for image/audio features using Daft by @arnavgarg1 in #3249
- [LLM] Various fixes for LLM Fine-Tuning issues that caused loss disparity between train and val sets by @arnavgarg1 in #3437
- int: Add back
required
for input and output features to the Ludwig JSON schema by @ksbrar in #3442 - int: Pin
getdaft
to 0.1.6 by @ksbrar in #3443 - upgrade ludwig-gpu docker image to pytorch 2.0.0 by @jppgks in #3445
- [llm] Replace
model_name
with requiredbase_model
, add preset LLM registry, update internal adapter modules by @tgaddair in #3423 - [llm] fix device placement issues when using CPUs and GPUs during LLM fine tuning by @arnavgarg1 in #3447
- Unpin Daft by @arnavgarg1 in #3451
- Add param metadata to column parameter by @connor-mccorm in #3452
- fix: Add titles for augmentation schema options by @ksbrar in #3454
- pin torchmetrics to 0.11.4 by @arnavgarg1 in #3456
- Adding --no-cache-dir to dockerfile pip install by @noyoshi in #3455
- Replace all non-word characters in feature names to ensure no downstream issues with external libraries. by @justinxzhao in #3438
- [Example] Fix llm_finetune example json part by @chongxiaoc in #3461
- refactor: add type hints for encoders by @Dennis-Rall in #3449
- Unpin deepspeed by @arnavgarg1 in #3466
- Unpin torch nightly CI. by @justinxzhao in #3459
- typo in parameter of README.md by @Skizzy-create in #3471
- Zero copy initialization of models onto training workers for LLMs by @arnavgarg1 in #3469
- Remove
tables
requirement as it causes issues installing ludwig in linux env. by @justinxzhao in #3473 - Add QLoRA for 4-bit fine-tuning by @tgaddair in #3476
- int: Fix quantization schema by @ksbrar in #3479
- Fixed divide by zero when tuning batch size by @tgaddair in #3481
- Exclude frozen weights from checkpoints and fix evaluation using quantized LLMs by @tgaddair in #3483
- Removed extra clone during prediction and added quantization error handling by @tgaddair in #3484
- Updated Ludwig CI for torch nightly by @Infernaught in #3467
- Fixed QLoRA with mutli-gpu and reduce CUDA memory pressure during eval by @tgaddair in #3486
- Enable evaluation with LLMs <7B by @geoffreyangus in #3478
- Add
ludwig upload
to push artifacts to HuggingFace Hub by @arnavgarg1 in #3480 - Add RoPE scaling to increase context length up to 8K for training or inference. by @arnavgarg1 in #3477
- Added quantization parameter metadata by @tgaddair in #3487
- Added CLI llama2 example by @tgaddair in #3491
- Updated presets with Llama-2 by @tgaddair in #3490
- [fix] Multiple test fixes by @jeffkinnison in #3489
- fix: Read partitioned parquet files from relative paths by @jeffkinnison in #3470
- Require using lora adapter when performing quantized fine-tuning by @tgaddair in #3492
- Require text output feature for LLM finetuning by @arnavgarg1 in #3493
- Pin bitsandbytes<0.41.0 by @tgaddair in #3494
- int: Pin
sqlalchemy
to1.x.x
versions by @ksbrar in #3496 - Add parameter metadata for global_max_sequence_length by @arnavgarg1 in #3497
- Remove falcon from presets as it requires running untrusted code by @tgaddair in #3495
- Updates for ludwig-docs by @tgaddair in #3499
- Add use_pretrained attribute for AutoTransformers by @arnavgarg1 in #3498
- Readme updates for 0.8 by @tgaddair in #3500
- Improve description for generation config parameters by @arnavgarg1 in #3501
- Readme TYPO by @arnavgarg1 in #3502
- Make Ludwig logo smaller in the README by @abidwael in #3505
- Check that LLMs have exactly one text input feature by @geoffreyangus in #3508
- Fix temperature description by @arnavgarg1 in #3509
- Add Cosine Annealing LR scheduler as a decay method by @arnavgarg1 in #3507
- Fix typo in function name for LR schedulers by @arnavgarg1 in #3511
- Bump to v0.8 by @tgaddair in #3512
New Contributors
- @RXminuS made their first contribution in #3321
- @aseemk98 made their first contribution in #3333
- @shawnccx made their first contribution in #3376
- @samster25 made their first contribution in #3380
- @neuhausler made their first contribution in #3436
- @chongxiaoc made their first contribution in #3461
- @Skizzy-create made their first contribution in #3471
- @Infernaught made their first contribution in #3467
Full Changelog: v0.7...v0.8