TAPP: DNN Training for Task Allocation through Pipeline Parallelism Based on Distributed Deep Reinforcement Learning

oleh: Yingchi Mao, Zijian Tu, Fagang Xi, Qingyong Wang, Shufang Xu

Format:	Article
Diterbitkan:	MDPI AG 2021-05-01

Deskripsi

The rapid development of artificial intelligence technology has made deep neural networks (DNNs) widely used in various fields. DNNs have been continuously growing in order to improve the accuracy and quality of the models. Moreover, traditional data/model parallelism is hard to expand due to communication bottlenecks and hardware efficiency issues. However, pipeline parallelism trains multiple batches, reducing training overheads, so that it can achieve better acceleration effect. Considering the complexity of solving the pipeline parallel task allocation problem in heterogeneous computing resources, in this paper, a task allocation in pipeline parallelism (TAPP) based on deep reinforcement learning, is proposed. In TAPP, the predictive network is trained by a policy gradient until it obtains the optimal pipeline parallel task allocation scheme and speeds up the model training. Experimental results show that, on average, the single-step training time of TAPP is decreased by 1.37 times and the proportion of communication time is reduced by 48.92%, compared with the data parallelism, bulk synchronous parallel (BSP).

Find in Library

Indexed Open Access Databases

TAPP: DNN Training for Task Allocation through Pipeline Parallelism Based on Distributed Deep Reinforcement Learning

Deskripsi