0. Table of Contents
1. Introduction
The flexibility, portability, and isolation characteristics have made containers a popular environment for large model training in recent years. Unfortunately, these advantages render the network support for containerized large model training extremely challenging, due to the high dynamics of containers, the complex interplay between underlay and overlay networks, and the stringent requirements on failure detection and localization. Existing data center network debugging tools, which rely on comprehensive or opportunistic monitoring, are either inefficient or inaccurate in this setting.
This paper presents SkeletonHunter, a container network monitoring and diagnosis system that leverages the intrinsic and regular sparsity of the network traffic incurred by large model training. Its key idea is to reason about the traffic skeleton, which comprises a crucial set of network paths consistently traversed by the training traffic, so as to reliably detect and localize network failures in short time. We deployed it in production for six months, uncovering 4,816 network failures with 98.2% precision and 99.3% recall, and localizing them with a high accuracy of 95.7%. After fixing 98% problematic network components, the monthly network failure rate has significantly dropped by 99.1%.
2. Code
The code for release is currently undergoing the internal review. Once the review process is completed, we will make the corresponding content publicly available.
The code is organized as follows:
Directory | Description | Source Code |
common | common dependencies like host utils | host.go |
pci.go | ||
config | congurations that store task/pod info | global.go |
task.go | ||
coretask | probing task implementations | rdmapingcore.go |
taskgenerator.go | ||
taskscheduler.go | ||
daemon | daemon services to keep the process alive | agent_ctl.sh |
agent_demon.sh | ||
main | entry point of the agent | main.go |
taskmessage | helper functions for communicating with the controller | messager.go |
msgkey.go |
3. Data
We provide data samples here. Note that some fields are hashed for privacy and security reasons.
Below are the description for some of the data fields:
Field Name | Description |
dial | a flag that indicates whether the connectivity is successful |
dst_ip | IP address of the pinged container |
dst_node_id | ID of the pinged container in a training cluster |
dst_node_sn | SN of the host that runs the pinged container |
dst_port | port number for processing ping packets |
dst_real_ip | IP of the host that runs the pinged container |
tenant_id | tenant ID of the training task |
latency | end-to-end latency (RTT) in microsecond |
loss_rate | end-to-end packet loss rate |
src_ip | IP address of the container that initiates the probing |
src_node_id | ID of the container that initiates the probing |
src_node_sn | SN of the host running the container that initiates the probing |
src_port | port number for processing ping packets |
src_real_ip | IP of the host running the container that initiates the probing |
ping_time | timestamp of the ping data |
task_type | enum fields for RDMA/TCP/Traceroute pings |