Skip to content

Latest commit

 

History

History
163 lines (107 loc) · 17.2 KB

File metadata and controls

163 lines (107 loc) · 17.2 KB

MATLAB Parallel Server on Azure (Linux VM)

Step 1. Deploy the Template

Click the Deploy to Azure button below to deploy the cloud resources on Azure®. This opens the Azure Portal in your web browser.

Create Virtual Network Use Existing Virtual Network
Use this option to deploy the resources in a new virtual network.



Use this option to deploy the resources in an existing virtual network.



Cluster Platform: Ubuntu Server 22.04 LTS

MATLAB® Release: R2024a

Step 2. Configure the Cloud Resources

Clicking the Deploy to Azure button opens the "Custom deployment" page in your browser. You can configure the parameters on this page. It is easier to complete the steps if you position these instructions and the Azure Portal window side by side. Create a new resource group by clicking Create New. Alternatively, you can select an existing resource group, but this can cause conflicts if resources are already deployed in it.

  1. Specify and check the defaults for these resource parameters:
Parameter label Description
Cluster Name Name to use for this cluster. This name is shown in MATLAB as the cluster profile name.
Num Worker Nodes Number of Azure instances to start for the workers to run on.
Min Worker Nodes Minimum number of Azure instances running at all times.
Max Worker Nodes Maximum number of Azure instances running at all times.
Num Workers Per Node Number of MATLAB workers to start on each instance. Specify 1 worker for every 2 vCPUs, because this results in 1 worker per physical core. For example, a Standard_D64s_v3 virtual machine has 64 vCPUs, so can support 32 MATLAB workers. See https://learn.microsoft.com/azure/virtual-machines/sizes for details on vCPUs for each VM size.
Headnode VM Size Azure VM size to use for the headnode, which runs the job manager. No workers are started on this node, so this can be a smaller instance type than the worker nodes. By default, the heap memory for the job manager is set between 1024 MiB and a maximum of half of the instance memory, depending on the total number of MATLAB workers. See https://learn.microsoft.com/azure/virtual-machines/sizes for a list of instance types.
Worker VM Size Azure VM size to use for the workers. By default, the heap memory for all worker process is set between 1024 MiB and a maximum of a quarter of the instance memory, depending on the number of MATLAB workers on the VM. See https://learn.microsoft.com/azure/virtual-machines/sizes for a list of instance types.
Use Spot Instances For Worker Nodes Option indicating whether to enable Azure Spot Virtual Machines for worker nodes. For more information, refer to the FAQ section in the deployment README.
Database Volume Size Size in GB of the volume to use for the database file. All job and task information, including input and output data is stored on this volume and should therefore have enough capacity to store the expected amount of data. If set to 0, no volume is created and the root volume of the instance is used for the database.
Shared Persisted Storage Size in GB of the persisted storage to mount across the cluster. The storage has a default and minimum capacity of 100 GB and can expand to 100 TB. If set to 0, no persisted storage is created.
Client IP Address IP address range that can be used to access the cluster from MATLAB. This must be a valid IP CIDR range of the form x.x.x.x/x. Use the value <your_public_ip_address>/32 to restrict access to only your computer.
Admin Username Admin username for the cluster. To avoid any deployment errors, please check the list of disallowed values for adminUsername.
Admin Password Choose the password for the admin user of the cluster. This password and the chosen admin username are required to login into any instance in the cluster using ssh. For the deployment to succeed, your password must meet Azure's password requirements. See Password requirements when creating a VM for information on the password requirements.
Virtual Network Resource ID Resource ID of an existing virtual network to deploy your cluster into. You can find this under the Properties of your virtual network. Specify this parameter only when deploying with the Existing Virtual Network option.
Subnet Name Name of an existing subnet within your virtual network to deploy your cluster into. Specify this parameter only when deploying with the Existing Virtual Network option.
License Server License Manager for MATLAB in the form port@hostname. If not specified, online licensing is used. Otherwise, the license manager must be accessible from the specified virtual network and subnet(s).
MJS Security Level Security level for the cluster. Level 0: Any user can access any jobs and tasks. Level 1: Accessing other users' jobs and tasks issues a warning. However, all users can still perform all actions. Level 2: Users must enter a password to access their jobs and tasks. The job owner can grant access to other users.
Enable Autoscaling Flag indicating whether instance autoscaling is enabled. For more information about autoscaling, see https://github.com/mathworks-ref-arch/matlab-parallel-server-on-azure-lin#use-autoscaling.
Automatically Terminate Cluster Option to auto-terminate the cluster after a few hours or when idle. When the cluster is terminated, all worker nodes are deleted and the headnode is deallocated. Select 'Never' to disable auto-termination now but you can enable it later. Select 'Disable auto-termination' to fully disable this feature or if you do not have the permissions to create and assign roles in your subscription. For more information, see Automatically terminate the MATLAB Parallel Server cluster.
MJS Scheduling Algorithm Scheduling algorithm for the job manager. 'standard' spreads communicating jobs across as few worker machines as possible to reduce communication overheads and fills in unused spaces on worker machines with independent jobs. Suitable for good behaviour for a wide range of uses including autoscaling. 'loadBalancing' distributes load evenly across the cluster to give as many resources as possible to running jobs and tasks when the cluster is underutilized.
Optional User Command Provide an optional inline shell command to run on machine launch. For example, to set an environment variable CLOUD=AZURE, use this command excluding the angle brackets: <echo -e "export CLOUD=AZURE" | sudo tee -a /etc/profile.d/setenvvar.sh && source /etc/profile>. To run an external script, use this command excluding the angle brackets: <wget -O /tmp/my-script.sh "https://example.com/script.sh" && bash /tmp/my-script.sh>. Find the logs at '/var/log/mathworks.log'.
Cluster Log Level Log level controls the amount of detail in the generated logs, ranging from 'Off' (no logging aside from essential system messages) to 'Highest' (full debug mode). To diagnose any cluster issues with support engineers, increase the log level. Log levels above 'Medium' can reduce performance.

NOTE: If you are using network license manager, the port and hostname of the network license manager must be reachable from the MATLAB Parallel Server™ Virtual Machines (VMs). It is therefore recommended that you deploy into a subnet within the same virtual network as the network license manager.

  1. Click the Review + create button to review the Azure Marketplace terms and conditions.

  2. Click the Create button.

When you click the Create button, the resources are created using Azure template deployments. Template deployment can take several minutes.

Step 3. Connect to Your Cluster From MATLAB

  1. After clicking Create, you are taken to the Deployment Details page, where you can monitor the progress of your deployment. Wait for the message Your deployment is complete.

  2. Go to your resource group, and select the Storage Account ending with storage. The screen should look like the one in Figure 1.

    Resource Group On Completion

    Figure 1: Resource Group On Completion

  3. Under Data Storage on the left panel, click File shares and select the file share named "shared".

  4. Click Browse on the left panel, then open the "cluster" folder.

  5. Download the file, <NAME>.mlsettings, where NAME is the name of your MATLAB Job Scheduler.

  6. Open MATLAB.

  7. In the Parallel drop-down menu in the MATLAB toolstrip select Create and Manage Clusters.

  8. Click Import.

  9. Select the downloaded profile and click Open.

  10. Click Set as Default.

  11. (Optional) Validate your cluster by clicking the Validate button.

After setting the cloud cluster as default, the next time you run a parallel language command (such as parfor, spmd, parfeval or batch), MATLAB connects to the cluster. The first time you connect, you are prompted for your MathWorks® account login. The first time you run a task on a worker, it takes several minutes for the worker MATLAB to start. This delay is due to provisioning the instance disk. This is a one-time operation, and subsequent tasks begin much faster.

Your cluster is now ready to use.

NOTE: Use the profile and client IP address range to control access to your cloud resources. Anyone with this file can connect to your resources from a machine within the specified IP address range and run jobs on it.

Your cluster remains running after you close MATLAB. To delete your cluster, follow these instructions.

Delete Your Cloud Resources

You can remove the resource group and all associated resources when you are done with them. Note that you cannot recover resources once they are deleted. After you delete the cloud resources, you cannot use the downloaded profile again or any data stored in the persisted file share.

  1. Login to the Azure Portal.

  2. Select the resource group containing your resources.

  3. Select the "Delete resource group" icon to destroy all resources deployed in this group.

  4. You are prompted to enter the name of the resource group to confirm the deletion.

    Resource Group Delete

Additional Information

Requirements to access MATLAB Parallel Server

To access a MATLAB Parallel Server cluster from your client MATLAB, your client machine must be able to communicate on specific ports. Make sure that the network firewall allows the following outgoing connections:

Required ports Description
TCP 27350 to 27358 + 4*N Ports 27350 to 27358 + 4*N, where N is the maximum number of workers on a single node
TCP 443 HTTPS access to (at least) *.mathworks and *.microsoft.com
TCP 22 SSH access to the cluster nodes

Table 1: Outgoing port requirements

By default, MATLAB Parallel Server is configured with the public hostname of each machine to allow the MATLAB client to access both the scheduler and workers. If you modify the Azure Resource Manager (ARM) template provided, ensure that a public hostname is provided for the headnode and the worker nodes.

Use Autoscaling

To optimize the number of Virtual Machines running MATLAB workers, enable autoscaling by setting Enable Autoscaling to Yes when you deploy the template. Autoscaling is optional and is disabled by default.

When autoscaling is disabled, the Virtual Machine Scale Set (VMSS) deploys Num Worker Nodes instances. To change the number of worker nodes, use the Azure Portal.

If you enable autoscaling, the capacity of the VMSS is regulated by the number of workers needed by the cluster. The number of Virtual Machines is initially set at Num Worker Nodes. It then fluctuates between Min and Max Worker Nodes. To change these limits after the deployment, use the Azure Portal and modify the tags on the VMSS. To change the amount of time for which idle nodes are preserved, adjust the value of the tag mwWorkerIdleTimeoutMinutes. Do not use the "manual scale" option to change the instance count, as this can lead to the unsafe termination of MATLAB workers.

Ensure that the Max Worker Nodes parameter is within your Azure subscription quotas for the specific instance type. To learn about setting quotas, see Azure subscription and service limits, quotas, and constraints.

To disable autoscaling after the deployment, redeploy the template with autoscaling disabled.

Automatically terminate the MATLAB Parallel Server cluster

Use the Automatically Terminate Cluster parameter while deploying the resource group to manage costs efficiently. You can choose one of these options:

  1. Never (default): No auto-termination but can be enabled after deployment.
  2. When cluster is idle: Terminates the cluster when it is idle for about 10 minutes (30 minutes at startup).
  3. After x hours: Terminates the cluster after 'x' hours (where x is between 1 and 24).
  4. Disable auto-termination: No auto-termination. Use this option to fully disable this feature or if you do not have the permissions to create and assign roles in your subscription.

When the cluster is auto-terminated, the headnode virtual machine is deallocated and all worker virtual machines are deleted. To use the cluster again, restart the headnode.

To modify the termination policy after deploying the resource group, edit the value of the tag mw-autoshutdown that is attached to the headnode. Set the value of the tag to either never, on_idle or After x hours, where x must be an integer between 1 and 24.

Cluster File System and Storage

Cluster Shared Storage:

  • Persisted Storage: The pre-mounted shared storage available at /shared/persisted provides a robust and scalable data storage solution. This disk space is shared among worker machines and is retained between cluster runs. You should use this persisted storage to save data you want to retrieve on the next start of the cluster. The storage has a default capacity of 100 GB and can expand to 100 TB. To learn about throughput related performance features, see [https://learn.microsoft.com/en-us/azure/storage/files/understanding-billing#provisioning-method]

  • Temporary Storage: The cluster also has the temporary storage which is mounted at /shared/tmp. This folder is shared among worker machines and is not retained between cluster runs. Use it to store temporary and intermediate data that must be visible or accessible from multiple worker machines.

Local Machine Storage:

  • Ephemeral Storage: This type of storage is by default available on all instances of the cluster. Each ephemeral storage is mounted at /mnt. The available storage space depends on the ephemeral storage available on the selected machine instance type.

MATLAB Job Scheduler Configuration

By default, MATLAB Job Scheduler (MJS) is configured to manage a wide range of cluster uses.

To change the MJS configuration for advanced use cases, replace the default mjs_def with your own file using the template parameter OptionalUserCommand. This overwrites all MJS startup parameters, except for DEFAULT_JOB_MANAGER_NAME, HOSTNAME, and SHARED_SECRET_FILE. To learn more about the MJS startup parameters and to edit them, see Define MATLAB Job Scheduler Startup Parameters. For example, to retrieve and use your edited mjs_def from a storage service (e.g. Azure Blob Storage), set the OptionalUserCommand to the following:

wget --output-document=${MJS_DEF_FILE} https://<your_storage_account>.blob.core.windows.net/<container_name>/mjs_def.sh

Troubleshooting

If your resource group fails to deploy, check the Deployments section of the resource group. This section indicates which resource deployments failed and allows you to navigate to the error message.

If the resource group deployed successfully but you are unable to validate the cluster, check the logs on the instances to diagnose the error. The logs are output to /var/log on the instance nodes; the files of interest are cloud-init.log, cloud-init-output.log, mathworks.log and all the logs under /var/log/mjs.


Copyright 2021 - 2024 The MathWorks, Inc.