OneTrainer on clouds #599

dxqbYD · 2024-12-07T09:04:47Z

Train on Linux-based remote hosts seamlessly, with some extra functionality for RunPod clouds. Details are best described in the UI tooltips below.

It is stable and fairly feature-complete now. More features and improvements have been requested by testers, which you can find listed here: https://github.com/dxqbYD/OneTrainer/issues
Let me know in case you find one or the other essential for the first merge, but in my view these are improvements.

A good starting point to read the code is modules/trainer/CloudTrainer.py. Everthing else follows from there.

Header	Header
Enabled	Enable cloud training
Type	Choose LINUX to connect to a linux machine via SSH. Choose RUNPOD for additional functionality such as automatically creating and deleting pods.
API key	Cloud service API key for RUNPOD. Leave empty for LINUX.
Huggingface token	If you provide your Huggingface token, a Huggingface repository path can then be set as the base model in OneTrainer. The model is then directly downloaded from Huggingface to the cloud, avoiding large uploads from your computer.
Hostname	SSH server hostname or IP. Leave empty if you have a Cloud ID or want to automatically create a new cloud.
Port	SSH server port. Leave empty if you have a Cloud ID or want to automatically create a new cloud.
Cloud id	RUNPOD Cloud ID. The cloud service must have a public IP and SSH service. Leave empty if you want to automatically create a new RUNPOD cloud, or if you're connecting to another cloud provider via SSH Hostname and Port.
Make tensorboard TCP tunnel	Instead of starting tensorboard locally, make a TCP tunnel to a tensorboard on the cloud
Remote Directory	The directory on the cloud where files will be uploaded and downloaded.
OneTrainer Directory	The directory for OneTrainer on the cloud.
Huggingface cache Directory	Huggingface models are downloaded to this remote directory.
Install OneTrainer	Automatically install OneTrainer from GitHub if the directory doesn't already exist.
Install command	The command for installing OneTrainer. Leave the default, unless you want to use a development branch of OneTrainer.
Detach remote trainer	Allows the trainer to keep running even if your connection to the cloud is lost.
Reattach run id	An id identifying the remotely running trainer. In case you have lost connection or closed OneTrainer, it will try to reattach to this id instead of starting a new remote trainer.
Download samples	Download samples from the remote workspace directory to your local machine.
Download output model	Download the final model after training. You can disable this if you plan to use an automatically saved checkpoint instead.
Download saved checkpoints	Download the automatically saved training checkpoints from the remote workspace directory to your local machine.
Download backups	Download backups from the remote workspace directory to your local machine. It's usually not necessary to download them, because as long as the backups are still available on the cloud, the training can be restarted using one of the cloud'sbackups.
Download tensorboard logs	Download TensorBoard event logs from the remote workspace directory to your local machine. They can then be viewed locally in TensorBoard. It is recommended to disable \
Delete remote workspace	Delete the workspace directory on the cloud after training has finished successfully and data has been downloaded.
Create cloud	Automatically creates a new cloud instance if both Host:Port and Cloud ID are empty. Currently supported for RUNPOD.
Cloud name	The name of the new cloud instance.
Type	Select the RunPod cloud type. See RunPod's website for details.
GPU	Select the GPU type. Enter an API key before pressing the button.
Volume size	Set the storage volume size in GB. This volume persists only until the cloud is deleted - not a RunPod network volume
Min download	Set the minimum download speed of the cloud in Mbps.
Action on finish	What to do when training finishes and the data has been fully downloaded
Action on error	What to do if training stops due to an error
Action on detached finish	What to do when training finishes, but the client has been detached and cannot download data. Data may be lost.
Action on detached error	What to if training stops due to an error, but the client has been detached and cannot download data. Data may be lost.

- close in/out/err streams on paramiko exec_command, to hopefully avoid random hangs - check for empty path on uploading concept caption files - OneTrainer install command added to UI - implemented min download speed parameter

warning message for many uploads

runpod-python PR Nerogar#360 was merged and published. Remove warning and code path that for the previous version add requirements UI tooltip update

- use config.cloud.run_id only in detached case, otherwise timestamped files: callbacks were mixed up when 2 training runs were running on 2 pods sharing a storage Volume - set_keepalive for command connection: command connection isn't used unless you press buttons in the UI. remote sometimes closes unused connections without keepalive - new option config.cloud.delete_workspace - rename config.cloud.workspace_dir to remote_dir, to avoid confusion between OT workspace dir and remote dir (which contains the workspace dir) - start Jupiyter lab by default - propagate exceptions from the sync thread to the main thread, to avoid taking actions (such as deleting the pod or deleting the workspace dir) in case training was successful but downloads failed

- bugfix: don't reset stop command, to avoid executing "on finish" actions after a manual stop - UI label change

…llback was received yet

…l embeddings download: #3

- use TrainConfig.to_pack_dict instead of deepcopy - remove tensorboard dependencies from CloudTrainer - has been moved to BaseTrainer

Nerogar · 2024-12-07T11:43:38Z

modules/util/config/CloudConfig.py

+    host: str
+    port: int
+    user: str
+    huggingface_token: str


Adding the huggingface_token to the config is a problem. Config files are regularly shared. If the token is included in that file, users can unknowingly expose it. That's also the reason I haven't added a setting for this yet. It needs to be saved in some different place. It would also be pretty useful to have this setting outside the cloud training environment.

I could imagine having a different user specific config file that's never shared or exported. But I haven't really put a lot of thought into it yet.

same for runpod API key then

asking AI for best practices for storing secrets in python applications, it lists environment variables, configuration files, and this:

what do you think?
would therefore be shared among all configs, not linked to a specific config

I think it makes sense to use the same structure in the python code (derive a new class from BaseConfig, save that as a json file, use a UIState to display the values in the ui). The json file could be saved either in a new subdirectory or directly in the root directory. I don't really have a preference there.

slightly different proposal here:
dxqbYD@08b5f64

By keeping the secrets within the config hierarchy, you avoid having to pass around two different config objects wherever you might need a secret

secrets.json:

{ "__version": 6, "cloud": { "__version": 0, "api_key": "", "id": "", "huggingface_token": "", "host": "", "port": "0", "user": "root" } }

#12 Use docker image with pre-installed OneTrainer new option "Update OneTrainer" new option "Jupyter password" better exception handling if one of the initial SSH connections fails better exception handling if the pod wasn't created successfully, but then is supposed to be stopped

#2 native SCP support parallel uploads for datasets (SCP and Fabric SFTP) simplify handling of SSH connection failures show status link during pod upstart update UI with pod id, IP, etc. even in case of failures

dxqbYD and others added 24 commits October 13, 2024 17:23

initial 74d2e83

249f3f7

fix bug in uploading a local model

ad3073d

Various bugfixes

1b38076

- close in/out/err streams on paramiko exec_command, to hopefully avoid random hangs - check for empty path on uploading concept caption files - OneTrainer install command added to UI - implemented min download speed parameter

changed behaviour create-on-failure of runpod clouds

85739d8

warning message for many uploads

Merge branch 'Nerogar:master' into cloud

d0e178c

Merge branch 'Nerogar:master' into cloud

28c65da

concepts bugfix and various config improvements

2b2f87d

house keeping:

538c50a

runpod-python PR Nerogar#360 was merged and published. Remove warning and code path that for the previous version add requirements UI tooltip update

Merge branch 'Nerogar:master' into cloud

2ed8f68

tensorboard tunnel, detached actions

466f58c

Merge branch 'Nerogar:master' into cloud

ab4197e

remove NOT IMPLEMENTED from ui

89944e7

Merge branch 'cloud' of https://github.com/dxqbYD/OneTrainer into cloud

9c38f7a

Merge branch 'Nerogar:master' into cloud

9aae5c8

Various:

e42585e

- bugfix: don't reset stop command, to avoid executing "on finish" actions after a manual stop - UI label change

auto-reconnect of callback, command and sync threads. bugfix if no ca…

9bfc8ff

…llback was received yet

Merge branch 'Nerogar:master' into cloud

d51d6bb

Merge branch 'Nerogar:master' into cloud

e425b7f

propagate exit code of detached trainer to the client

d7504c5

support LoRA base model upload, embeddings base upload, and additiona…

53a1114

…l embeddings download: #3

Various:

db6ea17

- use TrainConfig.to_pack_dict instead of deepcopy - remove tensorboard dependencies from CloudTrainer - has been moved to BaseTrainer

pre-commit hooks including linting

b9fcfcd

Merge branch 'Nerogar:master' into cloud

e15427f

Nerogar reviewed Dec 7, 2024

View reviewed changes

dxqbYD added 5 commits December 13, 2024 22:06

draft SCP multi-thread upload

045dc9c

Various:

fcfd66f

#2 native SCP support parallel uploads for datasets (SCP and Fabric SFTP) simplify handling of SSH connection failures show status link during pod upstart update UI with pod id, IP, etc. even in case of failures

bugfix

0dee095

bugfix

2d31bb5

Merge branch 'Nerogar:master' into cloud

e4cfc68

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OneTrainer on clouds #599

OneTrainer on clouds #599

dxqbYD commented Dec 7, 2024

Nerogar Dec 7, 2024

dxqbYD Dec 7, 2024

dxqbYD Dec 7, 2024

Nerogar Dec 7, 2024

dxqbYD Dec 7, 2024 •

edited

Loading

OneTrainer on clouds #599

Are you sure you want to change the base?

OneTrainer on clouds #599

Conversation

dxqbYD commented Dec 7, 2024

Nerogar Dec 7, 2024

Choose a reason for hiding this comment

dxqbYD Dec 7, 2024

Choose a reason for hiding this comment

dxqbYD Dec 7, 2024

Choose a reason for hiding this comment

Nerogar Dec 7, 2024

Choose a reason for hiding this comment

dxqbYD Dec 7, 2024 • edited Loading

Choose a reason for hiding this comment

dxqbYD Dec 7, 2024 •

edited

Loading