Skip to the content.

deployment Platform license

0. Table of Contents

1. Introduction

The flexibility, portability, and isolation characteristics have made containers a popular environment for large model training in recent years. Unfortunately, these advantages render the network support for containerized large model training extremely challenging, due to the high dynamics of containers, the complex interplay between underlay and overlay networks, and the stringent requirements on failure detection and localization. Existing data center network debugging tools, which rely on comprehensive or opportunistic monitoring, are either inefficient or inaccurate in this setting.

This paper presents SkeletonHunter, a container network monitoring and diagnosis system that leverages the intrinsic and regular sparsity of the network traffic incurred by large model training. Its key idea is to reason about the traffic skeleton, which comprises a crucial set of network paths consistently traversed by the training traffic, so as to reliably detect and localize network failures in short time. We deployed it in production for six months, uncovering 4,816 network failures with 98.2% precision and 99.3% recall, and localizing them with a high accuracy of 95.7%. After fixing 98% problematic network components, the monthly network failure rate has significantly dropped by 99.1%.

2. Code

The code for release is currently undergoing the internal review. Once the review process is completed, we will make the corresponding content publicly available.

The code is organized as follows:

Directory Description Source Code
common common dependencies like host utils host.go
pci.go
config congurations that store task/pod info global.go
task.go
coretask probing task implementations rdmapingcore.go
taskgenerator.go
taskscheduler.go
daemon daemon services to keep the process alive agent_ctl.sh
agent_demon.sh
main entry point of the agent main.go
taskmessage helper functions for communicating with the controller messager.go
msgkey.go

3. Data

We provide data samples here. Note that some fields are hashed for privacy and security reasons.

Below are the description for some of the data fields:

Field Name Description
dial a flag that indicates whether the connectivity is successful
dst_ip IP address of the pinged container
dst_node_id ID of the pinged container in a training cluster
dst_node_sn SN of the host that runs the pinged container
dst_port port number for processing ping packets
dst_real_ip IP of the host that runs the pinged container
tenant_id tenant ID of the training task
latency end-to-end latency (RTT) in microsecond
loss_rate end-to-end packet loss rate
src_ip IP address of the container that initiates the probing
src_node_id ID of the container that initiates the probing
src_node_sn SN of the host running the container that initiates the probing
src_port port number for processing ping packets
src_real_ip IP of the host running the container that initiates the probing
ping_time timestamp of the ping data
task_type enum fields for RDMA/TCP/Traceroute pings