Decentralized Inference with Ray and vLLM

1. Background

Large AI models often demand significant computation power for inference. Such computation power is traditionally supplied by physical clusters in a centralized data center, which creates barriers for the users to access in a cost effective manner. We introduce a decentralized model-inference engine, called YottaFusion, built upon Ray and vLLM to address this problem. By aggregating scattered GPU resources across geo-distributed regions, we provide a unified virtual cluster across private networks, enabling the user to access sufficient GPU sources in a transparent way.

Our major technical strengths are highlighted as follows:

Accessing GPU resources located in different private networks: This involves establishing secure and efficient communication channels between isolated network environments.
Aggregating scattered GPU resources into a unified cluster: Achieving this requires sophisticated orchestration to manage and schedule workloads across diverse and geo-distributed resources.

We build upon vLLM, a high-performance and portable library for AI model inference. vLLM can run on multiple nodes, making it suitable for distributed inference tasks. Building upon VPN techniques, GPU machines distributed in isolated private networks can be connected to the same virtual local area network (VLAN). After connecting the GPU machines across data centers to the same VLAN, we use Ray to build a virtualized cluster for vLLM. The overall architecture is depicted in Figure 1. Once the virtual Ray cluster is established, the user can utilize for AI model inference without the need to care where the GPU servers are.

Press enter or click to view image in full size

Figure 1. Overview of Decentralized Inference Engine: YottaFusion

2. Design

We aim to establish a secure and efficient distributed inference framework for large AI models by leveraging the capabilities of virtual ray cluster, vLLM, and decentralized scheduling. Our goal is to integrate GPU resources from different private networks into a unified cluster, enabling seamless utilization of global computational resources. We add a management and scheduling layer on top of vLLM and Ray. Given an AI inference request, the management layer achieves two functionalities: (1) Finding a data center with sufficient GPU resources to meet the inference request. In this case, the inference request in the decentralized environment is able to achieve similar performance as in a centralized infrastructure, because of the lightweight design. (2) Gathering geo-distributed GPU resources to run the inference workload. In this case, YottaFusion enables large AI model inference previously impossible to run using free GPU resources in individual data center.

2.1 System architecture

2.1.1 Secure Networking Layer

Establishing a Virtual LAN: By establishing peer-to-peer (P2P) connections or utilizing relay servers, the GPU machines residing in distinct private networks can be integrated into a VLAN, even though those machines lack public IP addresses.
Optimizing Latency: We implement latency monitoring tools within the VLANs to detect and avoid high-latency paths and dynamically adjust routes based on real-time network conditions.

2.1.2 Cluster Orchestration Tool

There are a few steps to build the virtual Ray Cluster

Deploy Ray on all nodes within the VLAN.
Configure Ray to manage and schedule workloads across the distributed GPU resources.
Use Ray’s built-in libraries and APIs to handle task distribution and load balancing.

2.1.3 Distributed Inference Framework

Setting Up vLLM: Deploy vLLM on multiple nodes within the established virtual cluster and allocate GPU resources for inference tasks.
Multi-Node Configuration: Configure vLLM to run on multiple nodes.

2.1.4 Decentralized And Centralized Scheduling Mechanism

Dynamic Load Balancer: Develop a dynamic load balancer that distributes incoming requests to different regions based on current workload, network latency, and availability of GPU resources.

Latency-Aware Routing: Use latency monitoring data to route requests to the nearest region with available capacity.

Load Distribution Algorithms: Implement algorithms such as round-robin, least connections, or weighted round-robin to distribute requests efficiently.

Region-Specific Task Queues: Each region maintains its own task queue, managed by a local scheduler. The local scheduler prioritizes tasks based on user QoS and resource availability.

Global Coordinator: A lightweight global coordinator monitors the health and status of all regions, ensuring balanced load distribution and fail handling. It also handles inter-region communication and coordination.

2.2 Challenges in decentralized environment

Challenge #1: The geo-distributed data centers are located in isolated LANs.
Challenge #2: Limited GPU resources in individual regions with low inter-region network bandwidth.

2.3 Solutions

To address the above challenges in the decentralized environment, we introduce the following solutions:

2.3.1 Secure Multi-LAN Interconnection based on VPN

Establish Virtual LANs: create virtual LANs (VLANs) that connect all participating machines across different private networks securely.
Access Control Policies: build access control policies to ensure only authorized devices can join the VLANs.
Latency Optimization: implement latency monitoring tools within the VLANs to detect and avoid high-latency paths. Dynamically adjust routes based on real-time network conditions.

2.3.2 Centralized and Decentralized Scheduling Based on Model Size and Resource Availability

2.3.2.1 Centralized Inference for Lightweight Models

For lightweight models where the request requirement is manageable within a single data center, prioritize the centralized inference within that region. This ensures lower latency and better resource utilization.

Dynamic Load Balancer: Develop a dynamic load balancer that distributes incoming requests within the region based on current load, latency, and resources availability.
Latency-Aware Routing: Route requests to the nearest node with available capacity to minimize latency.
Load Distribution Algorithms: Implement algorithms such as round-robin, least connections, or weighted round-robin to distribute requests efficiently.
Region-Specific Task Queues: Each region maintains its own task queue, managed by a local scheduler. The local scheduler prioritizes tasks based on the user QoS and resource availability.

2.3.2.2 Decentralized Inference for Larger Models

When GPU resources in individual data center cannot meet the inference request, use a decentralized approach to partition the inference workload across multiple geo-distributed regions.

Global Coordinator: A lightweight global coordinator monitors the health and status of all regions, ensuring balanced load distribution and failure handling. It also handles inter-region communication and coordination, allowing the system to utilize resources from multiple regions to support larger models.

By employing both centralized and decentralized scheduling strategies, we optimize resource usage while providing high performance, even with low inter-region network bandwidth.

3. Evaluation

3.1 Experimental settings

3.1.1 Basic settings

We evaluate YottaFusion in accordance with the guidelines provided in the vLLM documentation under the section “Details for Distributed Inference and Serving”. All models were downloaded from Hugging Face. The Docker image utilized in the experiments was vllm/vllm-openai:v0.6.3.post1.

We run online serving utilizing the pipeline parallelism method provided by vLLM and then conducted performance evaluation using the benchmark_serving.py script provided in the benchmarks of vLLM. The parameter gpu_memory_utilization is set to 0.9.

3.1.2 Decentralized Environment

We build a decentralized environment using three nodes from two distinct AWS data centers located in Ohio and Oregon. Each node is equipped with an NVIDIA A10 GPU. The following summarizes the basic configuration of the three nodes in two regions:

Node 1 located in Oregon: An Nvidia A10 GPU.
Node 2 located in Oregon: An Nvidia A10 GPU.
Node 3 located in Ohio: An Nvidia A10 GPU.

The node 1 and 2 are located in Oregon. The network bandwidth between them is approximately 5000 Mb/s. A P2P connection is established between any two nodes. Within the VLAN across the three nodes, the network bandwidth between Ohio and Oregon is approximately 500Mb/s.

3.2 AI Models for Evaluation

We use the following models for evaluation.

meta-llama/Llama-3.1–8B-Instruct

mistralai/Mistral-7B-Instruct-v0.2

facebook/opt-6.7b

3.3 Main Results

Using sonnet dataset, we study the impact of the number of inference requests and request length on the inference performance. We report output token throughput, Time to First Token (TTFT) and Time per Output Token (TPOT).

3.3.1 Centralized Inference in a Decentralized Cluster

Press enter or click to view image in full size

Figure 2. Performance of Centralized Inference in a Decentralized Cluster.

In this evaluation, YottaFusion is able to find a data center with sufficient GPU resources to meet the inference request. Figure 2 shows the results. YottaFusion chooses the nodes 1 and 2 located in the same data center for inference instead of nodes 1 and 3 or nodes 2 and 3 located in different data centers. In this evaluation, YottaFusion achieves the same performance of inference as the one in a centralized environment.

Figure 2 shows the inference performance of the three models in such setting: when the number of input requests is 1, the output token throughput is about 30, the average TTFT is about 50 milliseconds, and the average TPOT is less than 40 milliseconds. As the number of input requests increases to dozens or hundreds, the output token throughput can reach thousands tokens/s, the average TTFT is less than 1 second, and the average TPOT is always below 70 milliseconds. As for the impact of request length on inference performance, we can see that the length mainly impacts TTFT, but has little effect on output token throughput and TPOT.

3.3.2 Decentralized Inference for Larger Models

Press enter or click to view image in full size

Figure 3. Performance of Decentralized Inference for Larger Models.

We modified the decentralized environment above to keep only node 1 and node 3 (the two nodes located in two data centers). As a result, each node has only one GPU card and cannot independently run the larger models we want. When YottaFusion connects the two nodes into a virtual cluster, the user is able to have two remote GPUs running together for larger model inference. Figure 3 shows the inference performance of the three models.

In this evaluation, as the number of requests increases from 1 to 100, the throughput of the three models rises from a few tokens per second to several hundred tokens per second. When the number of requests is small, the average TTFT can be kept within 1 second. This level of performance is acceptable in some application scenarios where the output speed is not critically demanding. The length of request also mainly affects TTFT, but has no significant impact on output token throughput and TPOT.

This evaluation demonstrates that YottaFusion enables larger AI model inference across geo-distributed regions.

4. Conclusions

In conclusion, YottaFusion creates a robust and efficient decentralized and distributed inference system. The dynamic load balancer and region-specific task queues ensure optimal resource utilization within a single region, while the global coordinator enables the user to leverage resources from multiple regions, enhancing scalability and fault tolerance. YottaFusion provides a structured approach to building a secure, efficient, and scalable distributed inference system for large AI models, ensuring optimal performance and resource utilization on top of a global virtual infrastructure.

YottaFusion also lays a strong foundation for future decentralized serving research and development. By integrating advanced load balancing, priority-based task scheduling, and cross-region resource coordination mechanisms, YottaFusion not only enhances the efficiency and stability of the current system but also provides valuable experience and technical insights for more complex and distributed computing architectures. YottaFusion will drive further innovation and development in decentralized services, enabling their application in larger-scale and more sophisticated scenarios.

5. Acknowledgement

This work was contributed by Zihan Chang, Sheng Xiao, and Shuibing He from Zhejiang University, along with Dong Li from UC Merced and Yotta Labs. We also appreciate the support and valuable suggestions from Shomil Jain at Anyscale.