Researchers Propose a Novel Scheduler for Managing Data Center Resources Efficiently


Recently, a research team led by Prof. Zhang Wenbo in the Technology Center of Software Engineering, Institute of Software Chinese Academy of Sciences, has made a new progress in scheduling and managing data center resources. This research provides an novel way to maximize resource efficiency.

Nowadays, long-lived applications (LLAs) are increasingly running on production clusters, such as machine learning and microservices. Cluster schedulers face the challenges of more complex placement constraints and larger degrees of parallelism (e.g., to augment the capabilities of applications by 100× on 11.11 e-commerce holiday or Black Friday) to manage LLAs well. 

Latest cluster traces in Microsoft, Google and Alibaba show that LLAs’ constraints mainly include anti-affinity and priority. But previous work may usually encounter violated constraints with a high scheduling latency. It means enterprises would lose billions of dollars in annual advertising revenues if LLAs are crucial and latency-sensitive. Support for LLAs in existing schedulers remains rudimentary.

Dr. WU Heng, an assistant researcher from Technology Center of Software Engineering, Institute of Software Chinese Academy of Sciences present an novel scheduler named Aladdin for scheduling and managing data center resources. Dr. WU and his co-authors get rid of the limitation of traditional scheduling models like Queue and Integer Linger Program (ILP). By reducing the scheduling of LLAs to a flow network problem, a multidimensional and nonlinear capacity function based on a flow network model to express anti-affinity and priority constraints. And an optimized maximum-flow algorithm is designed to achieve high-quality placements and global objectives, especially when massive LLAs arrive simultaneously. Using Alibaba workload traces as the testbed, Aladdin reduces constraint violations by as much as 20% and improves resource efficiency by 50%.

This work was supported by National Key Research and Development Program of China, National Natural Science Foundation of China, and Alibaba Group through Alibaba Innovative Research (AIR) Program.

The study entitled “Aladdin: Optimized Maximum Flow Management for Shared Production Clusters” has been published in the 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019).