SQL Query Execution Optimization on Spark SQL

Gleb Mozhaiskii, Vladimir Korkhov, Ivan Gankevich

Spark and Hadoop ecosystem includes a wide variety of different components and can be integrated withany tool required for Big Data nowadays. From release-to-release developers of these frameworks optimize the inner work of components and make their usage more flexible and elaborate. Nevertheless, since inventing MapReduce as a programming model and the first Hadoop releases data skew has been the main problem of distributed data processing. Data skew leads to performance degradation, i.e., slowdown of application execution due to idling while waiting for the resources to become available. The newest Spark framework versions allow handling this situation easily from the box. However, there is no opportunity to upgrade versions of tools and appropriate logic in the case of corporate environments with multiple large-scale projects development of which was started years ago. In this article we consider approaches to execution optimization of SQL query in case of data skew on concrete example with HDFS and Spark SQL 2.3.2 version usage.

Bibtex
@inproceedings{mozhaiskii2021sql,
  title={SQL Query Execution Optimization on Spark SQL},
  author={Gleb Mozhaiskii and Vladimir Korkhov and Ivan Gankevich},
  publisher={RWTH Aahen University},
  booktitle={Proceedings of GRID'21},
  url={http://ceur-ws.org/Vol-3041/445-449-paper-82.pdf},
  year={2021},
  month={01},
  doi={10.54546/MLIT.2021.37.73.001},
  language={english},
  volume={3041},
  series={CEUR Workshop Proceedings},
  issn={1613-0073},
  editor={Vladimir Korenkov and Andrey Nechaevskiy and Tatiana Zaikina},
  type={inproceedings}
}

Publication: Proceedings of GRID'21
Publisher: RWTH Aahen University