The SkeletonHunter Project

deployment license

0. Table of Contents

Introduction
Code
Data

1. Introduction

The flexibility, portability, and isolation characteristics have made containers a popular environment for large model training in recent years. Unfortunately, these advantages render the network support for containerized large model training extremely challenging, due to the high dynamics of containers, the complex interplay between underlay and overlay networks, and the stringent requirements on failure detection and localization. Existing data center network debugging tools, which rely on comprehensive or opportunistic monitoring, are either inefficient or inaccurate in this setting.

This paper presents SkeletonHunter, a container network monitoring and diagnosis system that leverages the intrinsic and regular sparsity of the network traffic incurred by large model training. Its key idea is to reason about the traffic skeleton, which comprises a crucial set of network paths consistently traversed by the training traffic, so as to reliably detect and localize network failures in short time. We deployed it in production for six months, uncovering 4,816 network failures with 98.2% precision and 99.3% recall, and localizing them with a high accuracy of 95.7%. After fixing 98% problematic network components, the monthly network failure rate has significantly dropped by 99.1%.

2. Code

The code for release is currently undergoing the internal review. Once the review process is completed, we will make the corresponding content publicly available.

The code is organized as follows:

Directory	Description	Source Code
common	common dependencies like host utils	host.go
		pci.go
config	congurations that store task/pod info	global.go
		task.go
coretask	probing task implementations	rdmapingcore.go
		taskgenerator.go
		taskscheduler.go
daemon	daemon services to keep the process alive	agent_ctl.sh
		agent_demon.sh
main	entry point of the agent	main.go
taskmessage	helper functions for communicating with the controller	messager.go
		msgkey.go

3. Data

We provide data samples here. Note that some fields are hashed for privacy and security reasons.

Below are the description for some of the data fields:

Field Name	Description
dial	a flag that indicates whether the connectivity is successful
dst_ip	IP address of the pinged container
dst_node_id	ID of the pinged container in a training cluster
dst_node_sn	SN of the host that runs the pinged container
dst_port	port number for processing ping packets
dst_real_ip	IP of the host that runs the pinged container
tenant_id	tenant ID of the training task
latency	end-to-end latency (RTT) in microsecond
loss_rate	end-to-end packet loss rate
src_ip	IP address of the container that initiates the probing
src_node_id	ID of the container that initiates the probing
src_node_sn	SN of the host running the container that initiates the probing
src_port	port number for processing ping packets
src_real_ip	IP of the host running the container that initiates the probing
ping_time	timestamp of the ping data
task_type	enum fields for RDMA/TCP/Traceroute pings