
AI workloads are becoming more intricate in terms of computing power and data, and technologies such as Kubernetes and PyTorch are instrumental in constructing AI systems ready for production to cater to them. Anyscale’s Robert Nishihara recently presented at the KubeCon + CloudNativeCon North America 2025 Conference on how an AI compute stack consisting of Kubernetes, PyTorch, vLLM, and Ray technologies can support these advanced AI workloads.
Ray is a freely available framework used for creating and expanding machine learning and Python applications. It coordinates the infrastructure for distributed workloads and was created at Berkeley as part of a research project focusing on reinforcement learning. Ray has recently joined the PyTorch Foundation to contribute to the wider open-source AI community.
Nishihara highlighted three primary areas propelling the transformation of AI workloads: data processing, model training, and model deployment. Data processing must evolve to accommodate the new data formats required for AI applications, moving beyond conventional tabular data to encompass multimodal datasets (such as images, videos, audio, text, and sensor data). This shift is crucial for supporting inference tasks that are essential components of AI-driven applications. Moreover, the hardware utilized for data storage and computations needs to be capable of supporting GPUs alongside regular CPUs. He pointed out that data processing has transitioned from “SQL operations on CPUs” to “inferences on GPUs.”
Model training encompasses reinforcement learning (RL) and post-training activities like generating new data through model inference. Ray’s Actor API can be utilized for Trainer and Generator modules. An “Actor” serves as a stateful worker that generates a new worker class upon instantiation and oversees method scheduling on that specific worker instance. Additionally, Ray’s native Remote Direct Memory Access (RDMA) support facilitates direct transfer of GPU objects over RDMA, enhancing performance.
Numerous open-source reinforcement learning frameworks have been constructed on top of Ray. For example, the AI-driven code editor tool Cursor’s composer is developed using Ray. Nishihara also mentioned other notable frameworks such as Verl (Bytedance), OpenRLHF, ROLL (Alibaba), NeMO-RL (Nvidia), and SkyRL (UC Berkeley), which leverage training engines like Hugging Face, FSDP, DeepSpeed, Megatron, along with serving engines such as Hugging Face, vLLM, SGLang, and OpenAI—all orchestrated by Ray.
He detailed the application architecture behind Ray, highlighting the increasing complexity in both upper and lower layers. There is a rising demand for software stacks that bridge applications at the top layer with hardware at the bottom layer. The upper layers include AI workloads, model training, and inference frameworks like PyTorch, vLLM, Megatron, and SGLang. Conversely, the lower layers comprise computing substrates (GPUs and CPUs) as well as container orchestrators like Kubernetes and Slurm. Distributed computing frameworks like Ray and Spark act as connectors between the top-tier applications and lower-level components by managing data ingestion and movement.
Kubernetes and Ray complement each other when hosting AI applications by extending container-level isolation with process-level isolation while providing both vertical and horizontal autoscaling capabilities. Nishihara emphasized that leveraging Ray and Kubernetes together allows for shifting GPUs between inference and training stages efficiently.
In summary, Nishihara stressed the fundamental requirements of AI platforms—supporting a seamless multi-cloud experience, workload prioritization across GPU allocations, observability tools, tracking model/data lineage comprehensively, and ensuring governance overall. Observability plays a critical role at various levels—from monitoring metrics such as object transfer speeds at the container level to overseeing workload processes closely.