Don’t Tune Twice: Reusing Tuning Setups for SQL-on-Hadoop Queries

0
91

Authors: Edson Ramiro Lucas Filho, Eduardo Cunha de Almeida, Stefanie Scherzinger

Tags: 2019, conceptual modeling

SQL-on-Hadoop processing engines have become state-of-the-art in data lake analysis. However, the skills required to tune such systems are rare. This has inspired automated tuning advisors which profile the query workload and produce tuning setups for the low-level MapReduce jobs. Yet with highly dynamic query workloads, repeated re-tuning costs time and money in IaaS environments. In this paper, we focus on reducing the costs for up-front tuning. At the heart of our approach is the observation that a SQL query is compiled into a query plan of MapReduce jobs. While the plans differ from query to query, single jobs tend to be similar between queries. We introduce the notion of the code signature of a MapReduce job and, based on this, our concept of job similarity. We show that we can effectively recycle tuning setups from similar MapReduce jobs already profiled. In doing so, we can leverage any third-party tuning adviser for MapReduce engines. We are able to show that by recycling tuning setups, we can reduce the time spent on profiling by 50% in the TPC-H benchmark.

Read the full paper here: https://link-springer-com.proxy2.hec.ca/chapter/10.1007/978-3-030-33223-5_9