This is a cache of https://developer.ibm.com/tutorials/awb-run-distributed-model-training-workloads-ray-codeflare/. It is a snapshot of the page as it appeared on 2025-11-26T05:39:47.553+0000.
Run a distributed model training workload using Ray and CodeFlare - IBM Developer
This content is no longer being updated or maintained. The content is provided โas is.โ Given the rapid evolution of technology, some content, steps, or illustrations may have changed.
In today's rapid ML development lifecycle, the ability to process and analyze vast amounts of data quickly and efficiently is paramount. People might typically develop light workloads on their laptop and then struggle when they need to scale out the workloads. Tackling large-scale computation with distributed systems usually comes with complexity.
Ray is an open source and unified framework for running distributed Python workloads. You can start by developing your Python code locally on your laptop, taking full advantage of its resources. But when your workload demands more computational power, Ray seamlessly extends your capabilities by leveraging a remote Ray cluster. Explore more about using Ray in the Getting Started doc.
Built on top of Ray, CodeFlare is designed to simplify the integration, scaling, and acceleration of large-scale workloads on Red Hat OpenShift in the cloud. CodeFlare provides a user-friendly interface that abstracts away the complexities of infrastructure management, making cluster resources accessible to both novice and experienced data scientists.
In this tutorial, you'll learn how to create a Ray cluster using the CodeFlare SDK and interact with Ray within the CodeFlare Stack.
Python runtime environment. See the instructions in the Pyenv repo.
Steps
Step 1: Run the training example locally.
Step 2: Create a Ray cluster using the CodeFlare SDK.
Step 3: Interact with the cluster.
Run the training example locally
Let's use the training script in the Ray documentation. The script is an example of distributed training using the Ray Train python library.
# save as ray_torch_quickstart.pyimport torch
import torch.nn as nn
import ray
from ray import train
from ray.air import session, Checkpoint
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig
# If using GPUs, set this to True.
use_gpu = False
input_size = 1
layer_size = 15
output_size = 1
num_epochs = 3classNeuralNetwork(nn.Module):
def__init__(self):
super(NeuralNetwork, self).__init__()
self.layer1 = nn.Linear(input_size, layer_size)
self.relu = nn.ReLU()
self.layer2 = nn.Linear(layer_size, output_size)
defforward(self, input):
returnself.layer2(self.relu(self.layer1(input)))
deftrain_loop_per_worker():
dataset_shard = session.get_dataset_shard("train")
model = NeuralNetwork()
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
model = train.torch.prepare_model(model)
for epoch inrange(num_epochs):
for batches in dataset_shard.iter_torch_batches(
batch_size=32, dtypes=torch.float
):
inputs, labels = torch.unsqueeze(batches["x"], 1), batches["y"]
output = model(inputs)
loss = loss_fn(output, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"epoch: {epoch}, loss: {loss.item()}")
session.report(
{},
checkpoint=Checkpoint.from_dict(
dict(epoch=epoch, model=model.state_dict())
),
)
train_dataset = ray.data.from_items([{"x": x, "y": 2 * x + 1} for x inrange(200)])
scaling_config = ScalingConfig(num_workers=3, use_gpu=use_gpu)
trainer = TorchTrainer(
train_loop_per_worker=train_loop_per_worker,
scaling_config=scaling_config,
datasets={"train": train_dataset},
)
result = trainer.fit()
Copy codeCopied!Show more
Next, we need to verify that we can run this example locally. First, install the Ray packages using the following commands in terminal.
mkdir example; cd example
pyenv install 3.8.13
pyenv virtualenv 3.8.13 localinter
pyenv local localinter
pip install "ray[default]" torch pandas pyarrow
# Assume the above script is saved as ray_torch_quickstart.py
python ray_torch_quickstart.py
Copy codeCopied!
The example automatically starts a local instance of Ray and runs the traning job. When the job finishes, it should return a status message similar to the following:
Trial TorchTrainer_ec4d0_00000 completed.
== Status ==
Current time: 2023-06-20 17:02:45 (running for 00:00:11.98)
Using FIFO scheduling algorithm.
Logical resource usage: 4.0/8 CPUs, 0/0 GPUs
Result logdir: /Users/tedchang/ray_results/TorchTrainer_2023-06-20_17-02-33
Copy codeCopied!
Create a Ray cluster using the CodeFlare SDK
If you do not already have a Ray cluster running on your CodeFlare, you can create one using the CodeFlare SDK (codeflare-sdk).
You must have already issued oc login into your cluster.
# This creates a Ray cluster with 1 head and 2 worker nodes in the CodeFlare.
pip install codeflare-sdk
python - << EOF
from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration
cluster = Cluster(ClusterConfiguration(namespace="default", name="torch", min_worker=1, max_worker=2, min_cpus=2, max_cpus=4, min_memory=4, max_memory=8, gpu=0, instascale=False))
cluster.up()
cluster.wait_ready()
cluster.details()
EOF
Copy codeCopied!
The output should include remote Ray dashboard URL.
Alternatively, you can get the remote Ray dashboard URL from the Codeflare cluster this way:
oc get route -n default|grep ray-dashboard-torch|awk '{print $2}'# Take note of the Ray dashboard url. It should something like:# ray-dashboard-torch-default.apps.tedbig412.cp.fyre.ibm.com
Copy codeCopied!
Interact with the remote cluster
The Ray Jobs CLI lets you interact with the remote Ray cluster. For example, you can submit the training script as a job to the Ray cluster via the Ray Dashboard URL.
exportRAY_ADDRESS=http://`oc get route -n default|grep ray-dashboard-torch|awk '{print $2}'`
# The ray job command line tool comes with Ray python package which we installed in earlier.
ray job submit --working-dir . --no-wait -- python ray_torch_quickstart.py
Copy codeCopied!
You can expect output to be similar to the following output:
-------------------------------------------------------
Job 'raysubmit_ebUGqdNDVAcvFZ8U' submitted successfully
-------------------------------------------------------
Next steps
Query the logs of the job: ray job logs raysubmit_3u1wFAuV2x4Lx2eT Query the status of the job: ray job status raysubmit_3u1wFAuV2x4Lx2eT Request the job to be stopped: ray job stop raysubmit_3u1wFAuV2x4Lx2eT
Copy codeCopied!
You can interact with Ray using the commands presented in the output. For example, ray job status raysubmit_3u1wFAuV2x4Lx2eT should eventually return something similar to the following:
Job submission server address: http://ray-dashboard-torch-default.apps.tedbig412.cp.fyre.ibm.com
------------------------------------------
Job 'raysubmit_3u1wFAuV2x4Lx2eT' succeeded
------------------------------------------
The tutorial showed you how to run a distributed model training workload with the open source solutions, Ray and CodeFlare. Model training is a key part of machine learning lifecycle.
If you require a robust platform to train your ML workflows, you should check out watsonx.ai. Watsonx.ai brings together new generative AI capabilities, powered by foundation models, and traditional machine learning into a powerful platform spanning the AI lifecycle. With watsonx.ai, you can train, validate, tune, and deploy models with ease and build AI applications in a fraction of the time with a fraction of the data.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBMโsprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.