cfaed Publications

Supporting Fine-grained Dataflow Parallelism in Big Data Systems

Reference

Sebastian Ertel, Justus Adam, Jeronimo Castrillon, "Supporting Fine-grained Dataflow Parallelism in Big Data Systems", Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM), ACM, pp. 41–50, New York, NY, USA, Feb 2018. [doi]

Abstract

Big data systems scale with the number of cores in a cluster for the parts of an application that can be executed in data parallel fashion. It has been recently reported, however, that these systems fail to translate hardware improvements, such as increased network bandwidth, into a higher throughput. This is particularly the case for applications that have inherent sequential, computationally intensive phases. In this paper, we analyze the data processing cores of state-of-the-art big data systems to find the cause for these scalability problems. We identify design patterns in the code that are suitable for pipeline and task-level parallelism, potentially increasing application performance. As a proof of concept, we rewrite parts of the Hadoop MapReduce framework in an implicit parallel language that exploits this parallelism without adding code complexity. Our experiments on a data analytics workload show throughput speedups of up to 3.5x.

Bibtex

@InProceedings{ertel_pmam18,
author = {Sebastian Ertel and Justus Adam and Jeronimo Castrillon},
title = {Supporting Fine-grained Dataflow Parallelism in Big Data Systems},
booktitle = {Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM)},
year = {2018},
series = {PMAM'18},
address = {New York, NY, USA},
month = feb,
publisher = {ACM},
doi = {10.1145/3178442.3178447},
isbn = {978-1-4503-5645-9},
location = {Vienna, Austria},
pages = {41--50},
numpages = {10},
acmid = {3178447},
url = {http://doi.acm.org/10.1145/3178442.3178447},
abstract = {Big data systems scale with the number of cores in a cluster for the parts of an application that can be executed in data parallel fashion. It has been recently reported, however, that these systems fail to translate hardware improvements, such as increased network bandwidth, into a higher throughput. This is particularly the case for applications that have inherent sequential, computationally intensive phases. In this paper, we analyze the data processing cores of state-of-the-art big data systems to find the cause for these scalability problems. We identify design patterns in the code that are suitable for pipeline and task-level parallelism, potentially increasing application performance. As a proof of concept, we rewrite parts of the Hadoop MapReduce framework in an implicit parallel language that exploits this parallelism without adding code complexity. Our experiments on a data analytics workload show throughput speedups of up to 3.5x.},
}

Downloads

pmam-2018-slides [PDF]

1802_Ertel_PMAM [PDF]

Related Paths

HAEC, Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1794


Go back to publications list