set_transform seem process all the sample on the fly, not batch_size #6050

maowandong · 2023-07-19T08:54:03Z

maowandong
Jul 19, 2023

two question with set_transform, ask for help!

batch size seems only process one sample each time;
trainer.train seems call prepare_dataset_transform on all samples, not batch_size samples

Basic info:
datasets 2.13.1
python 3.10
pytorch torch 2.0.1

`def prepare_dataset_transform(batch):
# process audio
sample = batch[audio_column_name]
array_input = [audio["array"] for audio in batch[audio_column_name]]
inputs = feature_extractor(
array_input, sampling_rate=sample[0]["sampling_rate"], return_attention_mask=forward_attention_mask
)

    # process audio length
    batch[model_input_name] = inputs.get(model_input_name)
    batch["input_length"] = [len(audio["array"]) for audio in batch[audio_column_name]]
    if forward_attention_mask:
        batch["attention_mask"] = inputs.get("attention_mask")

    # process targets
    # input_str = batch[text_column_name].lower() if do_lower_case else batch[text_column_name]
    input_str_list = [text.lower() if do_lower_case else text for text in batch[text_column_name]]
    tokenizer_ret = tokenizer(input_str_list)
    batch["labels"] = tokenizer_ret.input_ids
    return batch

raw_datasets["train"].set_transform(prepare_dataset_transform)
raw_datasets["eval"].set_transform(prepare_dataset_transform)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=raw_datasets["train"] if training_args.do_train else None,
    eval_dataset=raw_datasets["eval"] if training_args.do_eval else None,
    tokenizer=feature_extractor,
    data_collator=data_collator,
    compute_metrics=compute_metrics if training_args.predict_with_generate else None,
)

# 12. Training
if training_args.do_train:
    checkpoint = None
    if training_args.resume_from_checkpoint is not None:
        checkpoint = training_args.resume_from_checkpoint
    elif last_checkpoint is not None:
        checkpoint = last_checkpoint
    logging.info("Training started")
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
    trainer.save_model()  # Saves the feature extractor too for easy upload

    metrics = train_result.metrics
    max_train_samples = (
        data_args.max_train_samples
        if data_args.max_train_samples is not None
        else len(raw_datasets["train"])
    )
    metrics["train_samples"] = min(max_train_samples, len(raw_datasets["train"]))
    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()`

maowandong · 2023-07-19T09:14:10Z

maowandong
Jul 19, 2023
Author

i logging the process, and in the line: train_result = trainer.train(resume_from_checkpoint=checkpoint) , call the prepare_dataset_transform, but input param batch size = 1 , seem not batch; and seem to process all the sample.

1 reply

maowandong Jul 20, 2023
Author

Found the cause of the issue, I use the "--group_by_length" param with train script and train_dataset does not have the "input_lengh" column, so Trainer.py will calculate all the sample's length.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set_transform seem process all the sample on the fly, not batch_size #6050

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

set_transform seem process all the sample on the fly, not batch_size #6050

maowandong Jul 19, 2023

Replies: 1 comment · 1 reply

maowandong Jul 19, 2023 Author

maowandong Jul 20, 2023 Author

maowandong
Jul 19, 2023

Replies: 1 comment 1 reply

maowandong
Jul 19, 2023
Author

maowandong Jul 20, 2023
Author