Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OneTrainer on clouds #599

Open
wants to merge 30 commits into
base: master
Choose a base branch
from
Open

OneTrainer on clouds #599

wants to merge 30 commits into from

Conversation

dxqbYD
Copy link
Contributor

@dxqbYD dxqbYD commented Dec 7, 2024

Train on Linux-based remote hosts seamlessly, with some extra functionality for RunPod clouds. Details are best described in the UI tooltips below.

image

It is stable and fairly feature-complete now. More features and improvements have been requested by testers, which you can find listed here: https://github.com/dxqbYD/OneTrainer/issues
Let me know in case you find one or the other essential for the first merge, but in my view these are improvements.

A good starting point to read the code is modules/trainer/CloudTrainer.py. Everthing else follows from there.

Header Header
Enabled Enable cloud training
Type Choose LINUX to connect to a linux machine via SSH. Choose RUNPOD for additional functionality such as automatically creating and deleting pods.
API key Cloud service API key for RUNPOD. Leave empty for LINUX.
Huggingface token If you provide your Huggingface token, a Huggingface repository path can then be set as the base model in OneTrainer. The model is then directly downloaded from Huggingface to the cloud, avoiding large uploads from your computer.
Hostname SSH server hostname or IP. Leave empty if you have a Cloud ID or want to automatically create a new cloud.
Port SSH server port. Leave empty if you have a Cloud ID or want to automatically create a new cloud.
Cloud id RUNPOD Cloud ID. The cloud service must have a public IP and SSH service. Leave empty if you want to automatically create a new RUNPOD cloud, or if you're connecting to another cloud provider via SSH Hostname and Port.
Make tensorboard TCP tunnel Instead of starting tensorboard locally, make a TCP tunnel to a tensorboard on the cloud
Remote Directory The directory on the cloud where files will be uploaded and downloaded.
OneTrainer Directory The directory for OneTrainer on the cloud.
Huggingface cache Directory Huggingface models are downloaded to this remote directory.
Install OneTrainer Automatically install OneTrainer from GitHub if the directory doesn't already exist.
Install command The command for installing OneTrainer. Leave the default, unless you want to use a development branch of OneTrainer.
Detach remote trainer Allows the trainer to keep running even if your connection to the cloud is lost.
Reattach run id An id identifying the remotely running trainer. In case you have lost connection or closed OneTrainer, it will try to reattach to this id instead of starting a new remote trainer.
Download samples Download samples from the remote workspace directory to your local machine.
Download output model Download the final model after training. You can disable this if you plan to use an automatically saved checkpoint instead.
Download saved checkpoints Download the automatically saved training checkpoints from the remote workspace directory to your local machine.
Download backups Download backups from the remote workspace directory to your local machine. It's usually not necessary to download them, because as long as the backups are still available on the cloud, the training can be restarted using one of the cloud'sbackups.
Download tensorboard logs Download TensorBoard event logs from the remote workspace directory to your local machine. They can then be viewed locally in TensorBoard. It is recommended to disable \
Delete remote workspace Delete the workspace directory on the cloud after training has finished successfully and data has been downloaded.
Create cloud Automatically creates a new cloud instance if both Host:Port and Cloud ID are empty. Currently supported for RUNPOD.
Cloud name The name of the new cloud instance.
Type Select the RunPod cloud type. See RunPod's website for details.
GPU Select the GPU type. Enter an API key before pressing the button.
Volume size Set the storage volume size in GB. This volume persists only until the cloud is deleted - not a RunPod network volume
Min download Set the minimum download speed of the cloud in Mbps.
Action on finish What to do when training finishes and the data has been fully downloaded
Action on error What to do if training stops due to an error
Action on detached finish What to do when training finishes, but the client has been detached and cannot download data. Data may be lost.
Action on detached error What to if training stops due to an error, but the client has been detached and cannot download data. Data may be lost.

dxqbYD and others added 24 commits October 13, 2024 17:23
 - close in/out/err streams on paramiko exec_command, to hopefully avoid random hangs
 - check for empty path on uploading concept caption files
 - OneTrainer install command added to UI
 - implemented min download speed parameter
runpod-python PR Nerogar#360 was merged and published. Remove warning and code path that for the previous version
add requirements
UI tooltip update
 - use config.cloud.run_id only in detached case, otherwise timestamped files: callbacks were mixed up when 2 training runs were running on 2 pods sharing a storage Volume
 - set_keepalive for command connection: command connection isn't used unless you press buttons in the UI. remote sometimes closes unused connections without keepalive
 - new option config.cloud.delete_workspace
 - rename config.cloud.workspace_dir to remote_dir, to avoid confusion between OT workspace dir and remote dir (which contains the workspace dir)
 - start Jupiyter lab by default
 - propagate exceptions from the sync thread to the main thread, to avoid taking actions (such as deleting the pod or deleting the workspace dir) in case training was successful but downloads failed
 - bugfix: don't reset stop command, to avoid executing "on finish" actions after a manual stop
 - UI label change
 - use TrainConfig.to_pack_dict instead of deepcopy
 - remove tensorboard dependencies from CloudTrainer - has been moved to BaseTrainer
host: str
port: int
user: str
huggingface_token: str
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the huggingface_token to the config is a problem. Config files are regularly shared. If the token is included in that file, users can unknowingly expose it. That's also the reason I haven't added a setting for this yet. It needs to be saved in some different place. It would also be pretty useful to have this setting outside the cloud training environment.

I could imagine having a different user specific config file that's never shared or exported. But I haven't really put a lot of thought into it yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for runpod API key then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asking AI for best practices for storing secrets in python applications, it lists environment variables, configuration files, and this:
image

what do you think?
would therefore be shared among all configs, not linked to a specific config

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to use the same structure in the python code (derive a new class from BaseConfig, save that as a json file, use a UIState to display the values in the ui). The json file could be saved either in a new subdirectory or directly in the root directory. I don't really have a preference there.

Copy link
Contributor Author

@dxqbYD dxqbYD Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly different proposal here:
dxqbYD@08b5f64

By keeping the secrets within the config hierarchy, you avoid having to pass around two different config objects wherever you might need a secret

secrets.json:

{
    "__version": 6,
    "cloud": {
        "__version": 0,
        "api_key": "",
        "id": "",
        "huggingface_token": "",
        "host": "",
        "port": "0",
        "user": "root"
    }
}

#12
Use docker image with pre-installed OneTrainer
new option "Update OneTrainer"
new option "Jupyter password"
better exception handling if one of the initial SSH connections fails
better exception handling if the pod wasn't created successfully, but then is supposed to be stopped
#2
native SCP support
parallel uploads for datasets (SCP and Fabric SFTP)
simplify handling of SSH connection failures
show status link during pod upstart
update UI with pod id, IP, etc. even in case of failures
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants