Book a demo

Research

Your GPU dashboard is lying to you

The standard GPU utilization metric, the one reported by nvidia-smi, nvtop, rocm-smi, Weights & Biases, Amazon CloudWatch, Google Cloud Monitoring, and Azure Monitor, does not measure how hard your GPU is actually working. It only tells you whether the GPU is doing anything at all. Real compute throughput can be as low as 1% while dashboards read 100%. That single misleading number drives enormous amounts of wasted spend, wasted energy, and unnecessary hardware purchases across the AI industry.

Systalyze is open-sourcing Utilyze, a free, production-ready monitoring and debugging tool that accurately shows how efficiently your GPUs are actually doing useful work, and how close you are to the realistic maximum for your specific workload. Utilyze runs alongside any AI workload in real time with negligible overhead. In production deployments, Utilyze revealed orders-of-magnitude performance headroom in settings that standard tools declared fully saturated.

Read the article

MLTCP: Congestion Control for DNN Training

Abstract: We present MLTCP, a technique to augment today’s congestion control algorithms to accelerate DNN training jobs in shared GPU clusters. MLTCP enables the communication phases of jobs that compete for network bandwidth to interleave with each other, thereby utilizing the network efficiently. At the heart of MLTCP lies a very simple principle based on a key conceptual insight: DNN training flows should scale their congestion window size based on the number of bytes sent at each training iteration. We show that integrating this principle into today’s congestion control protocols is straightforward: by adding 30–60 lines of code to Reno, CUBIC, or DCQCN, MLTCP stabilizes flows of different jobs into an interleaved state within a few training iterations, regardless of the number of competing flows or the start time of each flow. Our experiments with popular DNN training jobs demonstrate that enabling MLTCP accelerates the average and 99𝑡ℎ percentile training iteration time by up to 2× and 4×, respectively.

Read the paper

Congestion Control in Machine Learning Clusters

Abstract: This paper argues that fair-sharing, the holy grail of congestion control algorithms for decades, is not necessarily a desirable property in Machine Learning (ML) training clusters. We demonstrate that for a specific combination of jobs, introducing unfairness improves the training time for all competing jobs. We call this specific combination of jobs compatible and define the compatibility criterion using a novel geometric abstraction. Our abstraction rolls time around a circle and rotates the communication phases of jobs to identify fully compatible jobs. Using this abstraction, we demonstrate up to 1.3× improvement in the average training iteration time of popular ML models. We advocate that resource management algorithms should take job compatibility on network links into account. We then propose three directions to ameliorate the impact of network congestion in ML training clusters: (𝑖) an adaptively unfair congestion control scheme, (𝑖𝑖) priority queues on switches, and (𝑖𝑖𝑖) precise flow scheduling

Read the paper

Cassini: Network-Aware Job Scheduling in Machine Learning Clusters

Abstract: We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters. CASSINI introduces a novel geometric abstraction to consider the communication pattern of different jobs while placing them on network links. To do so, CASSINI uses an Affinity graph that finds a series of time-shift values to adjust the communication phases of a subset of jobs such that the communication patterns of jobs sharing the same network link are interleaved with each other. Experiments with 13 common ML models on a 24-server testbed demonstrate that compared to the state-of-the-art ML schedulers, CASSINI improves the average and tail completion time of jobs by up to 1.6× and 2.5×, respectively. Moreover, we show that CASSINI reduces the number of ECN marked packets in the cluster by up to 33×.

Read the paper

Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters

Abstract: This paper presents a low-cost network architecture for training large language models (LLMs) at hyperscale. We study the optimal parallelization strategy of LLMs and propose a novel datacenter network design tailored to LLM’s unique communication pattern. We show that LLM training generates sparse communication patterns in the network and, therefore, does not require any-to-any full-bisection network to complete efficiently. As a result, our design eliminates the spine layer in traditional GPU clusters. We name this design a Rail-only network and demonstrate that it achieves the same training performance while reducing the network cost by 38% to 77% and network power consumption by 37% to 75% compared to a conventional GPU datacenter. Our architecture also supports Mixture-of-Expert (MoE) models with all-to-all communication through forwarding, with only 8.2% to 11.2% completion time overhead for all-to-all traffic. We study the failure robustness of Rail-only networks and provide insights into the performance impact of different network and training parameters.

Read the paper

Nona: A Stochastic Congestion-Aware Job Scheduler for Real-Time Inference Queries

Abstract: This paper proposes a novel queueing-theoretic approach to enable stochastic congestion-aware scheduling for distributed machine learning inference queries. Our proposed framework, called Nona, combines a stochastic scheduler with an offline optimization formulation rooted in queueing-theoretic principles to minimize the average completion time of heterogeneous inference queries. At its core, Nona incorporates the fundamental tradeoffs between compute and network resources to make efficient scheduling decisions. Nona’s formulation uses the Pollaczek–Khinchine formula to estimate queueing latency and to predict system congestion. Builind upon conventional Jackson networks, it captures the dependency between the computation and communication operations of interfering jobs. From this formulation, we derive an optimization problem and use its results as inputs for the scheduler. We introduce a novel graph contraction procedure to enable cloud providers to solve Nona’s optimization formulation in practical settings. We evaluate Nona with real-world machine learning models (AlexNet, ResNet, DenseNet, VGG, and GPT2) and demonstrate that Nona outperforms state-of-the-art schedulers by up to 350×.

Read the paper

Checkmate: Zero-Overhead Model Checkpointing via Network Gradient Replication

Abstract: This paper presents Checkmate, a system that enables periteration checkpointing in DNN training without any training slowdown. The traditional approach to checkpointing requires a pause in training to copy model states to a separate location, allowing the state to be restored in the event of failure. This approach fundamentally has a tradeoff between the frequency of checkpoints and the cost of a failure. We avoid this tradeoff; our key insight is that in data-parallel training, all information necessary to create a checkpoint already exists in the network as gradients. Our core contribution is a new multicast abstraction that simultaneously delivers gradients to a separate CPU-based shadow cluster. The shadow maintains a checkpoint by applying those gradients to a copy of the model. Our evaluation shows that Checkmate performs per-iteration checkpointing with training throughput comparable to an ideal no-checkpoint baseline. Checkmate achieves 5 to 34.5× more frequent checkpointing compared to state-of-the-art checkpointing systems, resulting in 80% to 97.1% reduction in repeated work per failure. At the same checkpointing frequency, Checkmate delivers 1.3× to 6.5× throughput compared to other systems.

Read the paper

Experience AI innovation.

Try it for free or book a demo

Try for free

Book a demo