Update cloud_quickstart.ipynb

simplifine-llm · Aug 10, 2024 · 28362ea · 28362ea
1 parent e63a094
commit 28362ea
Showing 1 changed file with 221 additions and 25 deletions.
diff --git a/examples/cloud_quickstart.ipynb b/examples/cloud_quickstart.ipynb
@@ -4,9 +4,7 @@
   "metadata": {
     "colab": {
       "provenance": [],
-      "machine_shape": "hm",
-      "authorship_tag": "ABX9TyMbgzp4WjwBkEWl8qaCpYHi",
-      "include_colab_link": true
+      "machine_shape": "hm"
     },
     "kernelspec": {
       "name": "python3",
@@ -5495,17 +5493,23 @@
   "cells": [
     {
       "cell_type": "markdown",
-      "metadata": {
-        "id": "view-in-github",
-        "colab_type": "text"
-      },
       "source": [
-        "<a href=\"https://colab.research.google.com/github/simplifine-llm/Simplifine/blob/main/examples/fake_url_detection.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
-      ]
+        "### 📦 Installing Required Libraries\n",
+        "\n",
+        "Before we begin fine-tuning our fake news detector, we need to install the necessary libraries. In this step, we’re installing the `Simplifine` library, which provides tools to streamline the fine-tuning process for large language models. We’re also installing the `datasets` library, which allows us to easily access and manage datasets from Hugging Face.\n",
+        "\n",
+        "- The `Simplifine` library helps in making the fine-tuning process more efficient, whether you're working locally or in the cloud.\n",
+        "- The `datasets` library is essential for loading and processing the dataset we'll be using for this project.\n",
+        "\n",
+        "Running this cell will install both libraries quietly in the background.\n"
+      ],
+      "metadata": {
+        "id": "0SClYIzAQrpD"
+      }
     },
     {
       "cell_type": "code",
-      "execution_count": 1,
+      "execution_count": null,
       "metadata": {
         "colab": {
           "base_uri": "https://localhost:8080/"
@@ -5560,6 +5564,41 @@
         "!pip install datasets -q"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### 🛠️ Setting Up for Local Training\n",
+        "\n",
+        "In this section, we’re preparing to fine-tune our fake news detector model using Google Colab’s resources. The steps below outline how to configure and initiate the training process.\n",
+        "\n",
+        "1. **Importing Libraries:**\n",
+        "   - We import `train_engine` from the `Simplifine` library, which provides the necessary functions to handle the fine-tuning process.\n",
+        "   - We also import `SFTConfig` from the `trl` library, which allows us to configure the supervised fine-tuning parameters.\n",
+        "\n",
+        "2. **Dataset Selection:**\n",
+        "   - We define the dataset name as `'community-datasets/fake_news_english'`. This dataset contains examples of fake news articles that we will use to fine-tune our model.\n",
+        "\n",
+        "3. **Prompt Configuration:**\n",
+        "   - We create a `sftPromptConfig` object to specify how the training data is formatted.\n",
+        "   - The `template` parameter defines the input format, and the `response_template` specifies how the model should generate outputs.\n",
+        "   - The `use_chat_template` flag is set to `True` to format the inputs in a conversational style, which can be effective for chat-based models.\n",
+        "\n",
+        "4. **Training Configuration:**\n",
+        "   - We define the training settings using `SFTConfig`. This includes parameters like batch size, learning rate, and the number of epochs.\n",
+        "   - We also enable `fp16` (16-bit floating-point) training for faster computation and set `gradient_checkpointing` to save memory during training.\n",
+        "\n",
+        "5. **Model Selection:**\n",
+        "   - The model we’re fine-tuning is `'TinyLlama/TinyLlama-1.1B-Chat-v1.0'`. This is a smaller, efficient model suitable for demonstration purposes on Colab.\n",
+        "\n",
+        "6. **Training the Model:**\n",
+        "   - Finally, we call `sft_train` to start the fine-tuning process. This step will take a while to complete, as we’re training the model from scratch without any optimizations like quantization or LoRA.\n",
+        "\n",
+        "Running this cell will fine-tune the model locally on Colab, using the configurations we’ve set up. This is ideal for quick experiments or when cloud resources are not available."
+      ],
+      "metadata": {
+        "id": "C0dDwmg4Rb3N"
+      }
+    },
     {
       "cell_type": "code",
       "source": [
@@ -5738,7 +5777,7 @@
         "id": "uKH1cxpkxFAr",
         "outputId": "bae79adb-9ed2-49f6-c618-efdb66923cc3"
       },
-      "execution_count": 2,
+      "execution_count": null,
       "outputs": [
         {
           "output_type": "stream",
@@ -5987,14 +6026,49 @@
         }
       ]
     },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### ☁️ Training the Model on Cloud Servers\n",
+        "\n",
+        "In this section, we’re moving from local training to cloud-based training using Simplifine’s cloud infrastructure. This allows you to leverage powerful GPUs like the A100 for more intensive tasks, making it easier to handle larger models and datasets.\n",
+        "\n",
+        "1. **Importing the `train_utils` Module:**\n",
+        "   - We start by importing the `train_utils` module from the `Simplifine` library. This module provides utilities to interact with Simplifine's cloud servers.\n",
+        "\n",
+        "2. **Model and API Configuration:**\n",
+        "   - We select a different model for this cloud training: `'microsoft/Phi-3-mini-4k-instruct'`. This model is more powerful and well-suited for deployment on cloud GPUs.\n",
+        "   - The `simplifine_api_key` is your unique key to access Simplifine’s cloud services. Ensure you have it ready.\n",
+        "   - The `gpu_type` is set to `'a100'`, which specifies the type of GPU to be used in the cloud. The A100 is a high-performance GPU ideal for deep learning tasks.\n",
+        "\n",
+        "   ### 🔑 Need an API Key?\n",
+        "   If you don't have an API key yet, you can [**request one here for free**](https://www.simplifine.com/api-key-interest). The turnaround time is just 24 hours, so you'll be up and running in no time!\n",
+        "\n",
+        "3. **Client Initialization:**\n",
+        "   - We create a `Client` object using the API key and GPU type. This client will handle the communication with Simplifine’s cloud infrastructure, managing the training job on your behalf.\n",
+        "\n",
+        "4. **Defining the Training Job:**\n",
+        "   - The `job_name` is set to `'fake_news_english_phi3'`, which uniquely identifies this training task.\n",
+        "   - We then call the `sft_train_cloud` method on our `client` object. This method sends the training job to the cloud, using the model and configurations we’ve defined earlier.\n",
+        "\n",
+        "5. **Cloud Training Setup:**\n",
+        "   - We enable `use_zero=True` to utilize DeepSpeed's ZeRO optimization, allowing the model to scale effectively across multiple GPUs.\n",
+        "   - We disable Distributed Data Parallel (DDP) for this job, which is appropriate when ZeRO is handling the distribution of data.\n",
+        "\n",
+        "Running this cell will initiate the training process on Simplifine’s cloud servers, allowing you to offload the heavy lifting to a powerful cloud infrastructure. This is ideal when working with larger models or when your local resources are insufficient.\n"
+      ],
+      "metadata": {
+        "id": "oehMA7hwRky5"
+      }
+    },
     {
       "cell_type": "code",
       "source": [
         "from simplifine_alpha import train_utils\n",
         "\n",
         "# change name to phi 3\n",
         "model_name = 'microsoft/Phi-3-mini-4k-instruct'\n",
-        "simplifine_api_key = ''\n",
+        "simplifine_api_key = 'PUT YOUR OWN API KEY PROVIDED BY SIMPLIFINE'\n",
         "gpu_type = 'a100'\n",
         "client = train_utils.Client(simplifine_api_key, gpu_type)\n",
         "\n",
@@ -6013,7 +6087,7 @@
         },
         "outputId": "d2510f4d-5246-4631-df37-8a741cf92240"
       },
-      "execution_count": 2,
+      "execution_count": null,
       "outputs": [
         {
           "output_type": "stream",
@@ -6028,13 +6102,24 @@
     {
       "cell_type": "markdown",
       "source": [
-        "You can check the status of your job. The status can be any of the following:\n",
+        "### 📝 Checking the Status of Your Training Jobs\n",
+        "\n",
+        "After submitting your training job to Simplifine’s cloud servers, it’s important to monitor its status to ensure everything is running smoothly. In this section, we’ll check the status of your most recent job.\n",
+        "\n",
+        "1. **Retrieving Job Status:**\n",
+        "   - We call the `get_all_jobs` method on our `client` object. This method returns a list of all jobs associated with your API key, including their current statuses.\n",
+        "\n",
+        "2. **Displaying the Latest Job:**\n",
+        "   - We loop through the latest job in the list and print its status. This gives you a quick overview of how your most recent training job is progressing.\n",
         "\n",
+        "3. **Understanding Job Statuses:**\n",
+        "   - Your job can have one of the following statuses:\n",
+        "     - `pending`: The job has been submitted and is waiting to start.\n",
+        "     - `in progress`: The job is currently running.\n",
+        "     - `stopped`: The job was stopped before completion, either manually or due to an error.\n",
+        "     - `completed`: The job has successfully finished.\n",
         "\n",
-        "```\n",
-        "pending | in progress | stopped | completed\n",
-        "```\n",
-        "\n"
+        "Running this cell will display the status of your most recent job, helping you keep track of your training tasks on Simplifine’s cloud servers.\n"
       ],
       "metadata": {
         "id": "W88J_Ef7yaYG"
@@ -6054,7 +6139,7 @@
         "id": "l70vZyPV6_AC",
         "outputId": "b32db3fe-e353-4105-e8b7-63a772d7ccde"
       },
-      "execution_count": 3,
+      "execution_count": null,
       "outputs": [
         {
           "output_type": "stream",
@@ -6068,7 +6153,20 @@
     {
       "cell_type": "markdown",
       "source": [
-        "To see how things are going, you can take a look at the logs, to see if there were any errors or what not."
+        "### 📊 Retrieving and Viewing Training Logs\n",
+        "\n",
+        "After checking the status of your training job, you might want to dive deeper into the details by viewing the training logs. These logs provide insights into the training process, including any issues or updates on the progress.\n",
+        "\n",
+        "1. **Getting the `job_id`:**\n",
+        "   - We start by extracting the `job_id` of the last job from the status list. The `job_id` is a unique identifier for each training job, which we’ll use to retrieve its logs.\n",
+        "\n",
+        "2. **Retrieving Logs:**\n",
+        "   - We call the `get_train_logs` method on our `client` object, passing in the `job_id`. This method fetches the detailed logs for the specified job, giving you access to the complete training history.\n",
+        "\n",
+        "3. **Viewing the Logs:**\n",
+        "   - Finally, we print the `response` from the logs, which contains detailed information about the training process. This includes updates, errors, and any other relevant messages from the training run.\n",
+        "\n",
+        "Running this cell will display the logs for your most recent job, allowing you to monitor and troubleshoot the training process effectively.\n"
       ],
       "metadata": {
         "id": "BDe93gbayl_n"
@@ -6090,7 +6188,7 @@
         "id": "jt35FPNn8ADK",
         "outputId": "1de668ed-718e-452d-eb85-0632d7652008"
       },
-      "execution_count": 4,
+      "execution_count": null,
       "outputs": [
         {
           "output_type": "stream",
@@ -6552,6 +6650,27 @@
         }
       ]
     },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### 📂 Downloading and Saving the Trained Model\n",
+        "\n",
+        "Once your training job is completed, the next step is to download the trained model so you can use it locally or for further fine-tuning.\n",
+        "\n",
+        "1. **Creating a Directory for the Model:**\n",
+        "   - We begin by creating a new folder called `sf_trained_model_zero_phi`. This folder will serve as the destination for the downloaded model files.\n",
+        "\n",
+        "2. **Downloading the Model:**\n",
+        "   - We use the `download_model` method on our `client` object to download the trained model from the cloud. The `job_id` is passed to specify which model to download, and we extract the files to the newly created directory.\n",
+        "   \n",
+        "   - **Tip:** This process might take some time depending on the size of the model, so feel free to take a break or grab a coffee while you wait! ☕\n",
+        "\n",
+        "Running this cell will download your trained model and save it in the specified directory, making it ready for use in your next project or analysis.\n"
+      ],
+      "metadata": {
+        "id": "koKpp2XNU-y1"
+      }
+    },
     {
       "cell_type": "code",
       "source": [
@@ -6571,7 +6690,7 @@
         },
         "outputId": "b88812a9-8f64-464e-d4c5-d2ace8814f08"
       },
-      "execution_count": 5,
+      "execution_count": null,
       "outputs": [
         {
           "output_type": "stream",
@@ -6593,6 +6712,31 @@
         }
       ]
     },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### 🔄 Loading the Trained Model and Tokenizer\n",
+        "\n",
+        "Now that we've successfully downloaded the trained model, the next step is to load it into our environment so we can use it for inference or further fine-tuning.\n",
+        "\n",
+        "1. **Importing Required Libraries:**\n",
+        "   - We import `AutoModelForCausalLM` and `AutoTokenizer` from the `transformers` library. These classes are used to load the model and tokenizer from the saved files.\n",
+        "\n",
+        "2. **Setting the Path:**\n",
+        "   - We set the `path` variable to point to the directory where we saved the trained model (`'/content/sf_trained_model_zero_phi'`).\n",
+        "\n",
+        "3. **Loading the Model:**\n",
+        "   - We use `AutoModelForCausalLM.from_pretrained(path)` to load the trained model from the specified path. This initializes the model so it’s ready for use.\n",
+        "\n",
+        "4. **Loading the Tokenizer:**\n",
+        "   - Similarly, we load the tokenizer using `AutoTokenizer.from_pretrained(path)`. The tokenizer is essential for processing text input into a format that the model can understand.\n",
+        "\n",
+        "Running this cell will load both the trained model and tokenizer into your environment, allowing you to start generating text or continue fine-tuning with your freshly trained model."
+      ],
+      "metadata": {
+        "id": "mQ1fk9tJVJKy"
+      }
+    },
     {
       "cell_type": "code",
       "source": [
@@ -6624,7 +6768,7 @@
         },
         "outputId": "9b14f28f-3376-45bc-82cb-c6b09a31aa6c"
       },
-      "execution_count": 6,
+      "execution_count": null,
       "outputs": [
         {
           "output_type": "display_data",
@@ -6649,6 +6793,25 @@
         }
       ]
     },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### 📚 Loading the Dataset\n",
+        "\n",
+        "Before we can use our trained model for inference or further fine-tuning, we need to load the dataset that we’ve been working with.\n",
+        "\n",
+        "1. **Importing the Datasets Library:**\n",
+        "   - We start by importing the `datasets` library, which provides easy access to a wide range of datasets, including the one we've been using for training.\n",
+        "\n",
+        "2. **Loading the Dataset:**\n",
+        "   - We load the dataset using the `load_dataset` function from the `datasets` library. The `dataset_name` variable contains the name of the dataset we specified earlier in our code.\n",
+        "\n",
+        "Running this cell will load the dataset into your environment, making it ready for evaluation, inference,"
+      ],
+      "metadata": {
+        "id": "UZ-1si0bVOMC"
+      }
+    },
     {
       "cell_type": "code",
       "source": [
@@ -6698,7 +6861,7 @@
         "id": "Orm2RTPh1s-s",
         "outputId": "34794037-e2bb-4e64-cf52-445e61a7aaf6"
       },
-      "execution_count": 8,
+      "execution_count": null,
       "outputs": [
         {
           "output_type": "stream",
@@ -6756,6 +6919,39 @@
         }
       ]
     },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### 🧠 Generating Text with the Trained Model\n",
+        "\n",
+        "Now that we've loaded both the model and the dataset, it’s time to generate some text using our trained model. In this section, we’ll configure the generation settings and produce some sample outputs.\n",
+        "\n",
+        "1. **Importing Inference Tools:**\n",
+        "   - We import `inference_tools` from the `simplifine_alpha` library. This module provides the necessary tools to generate text using the model we’ve fine-tuned.\n",
+        "\n",
+        "2. **Configuring Text Generation:**\n",
+        "   - We create a `GenerationConfig` object to define how the model should generate text. This configuration includes:\n",
+        "     - `prompt_template` and `response_template`: Templates for how the inputs and outputs are formatted.\n",
+        "     - `keys`: Specifies the data keys used in the templates.\n",
+        "     - `train_type`: Indicates that we're using supervised fine-tuning (`sft`).\n",
+        "     - `max_length`: The maximum length of the generated sequences.\n",
+        "     - `num_return_sequences`: How many sequences to generate.\n",
+        "     - `do_sample`, `top_k`, `top_p`, `temperature`: Parameters that control the randomness and diversity of the generated text.\n",
+        "\n",
+        "3. **Generating Text:**\n",
+        "   - We call `generate_from_pretrained` using our fine-tuned model, tokenizer, and the generation configuration. We also pass in a small sample of the dataset to generate text based on the training data.\n",
+        "   \n",
+        "   - **Note:** We’re using only the first three examples from the training dataset (`dataset['train'][:3]`) for quick testing.\n",
+        "\n",
+        "4. **Displaying the Generated Text:**\n",
+        "   - Finally, we print the generated text, which provides a glimpse into how well the model has learned to detect fake news.\n",
+        "\n",
+        "Running this cell will generate text using your trained model, showcasing its ability to produce outputs based on the fine-tuned dataset. This is where you can see the real impact of your training efforts!"
+      ],
+      "metadata": {
+        "id": "tHGpRwU6VVav"
+      }
+    },
     {
       "cell_type": "code",
       "source": [
@@ -6784,7 +6980,7 @@
         "id": "8KWnTV9w1OMQ",
         "outputId": "e78d14ca-9b91-4412-8d16-ece24b3ffe7d"
       },
-      "execution_count": 11,
+      "execution_count": null,
       "outputs": [
         {
           "output_type": "stream",