JABAS: Joint Adaptive Batching and Automatic Scaling for DNN Training on Heterogeneous GPUs

윤경찬, 강준수, 정현준, 엄상현, 장민성, 최영리

학회/저널

EuroSys

년도

2024년

연구분야

Computing

Abstract

Adaptive batching is a promising technique to reduce the communication and synchronization overhead for training Deep Neural Network (DNN) models. In this paper, we study how to speed up the training of a DNN model using adaptive batching, without degrading the convergence performance in a heterogeneous GPU cluster. We propose a novel DNN training system, called JABAS (Joint Adaptive Batching and Automatic Scaling). In JABAS, a DNN training job is executed on a DNN training framework called IIDP, which provides the same theoretical convergence rate of distributed SGD in a heterogeneous GPU cluster. To maximize the performance of the job with adaptive batching, JABAS employs adaptive batching and automatic resource scaling jointly.
JABAS changes a global batch size every 𝑝 iterations in a fine-grained manner within an epoch, while auto-scaling to the best GPU allocation for the next epoch in a coarsegrained manner. Using three heterogeneous GPU clusters, we evaluate JABAS for seven DNN models including large language models. Our experimental results demonstrate that JABAS provides 33.3% shorter training time and 54.2% lower training cost than the state-of-the-art adaptive training techniques,
on average, without any accuracy loss.

논문보기