Authors: Fadila Bentayeb, Nadia Kabachi, Omar Boussaid, Yassine Ramdane
Tags: 2019, conceptual modeling
Hadoop uses horizontal partitioning to improve the performance of a big data warehouse. A major challenge when horizontally partitioning the tables of a big data warehouse is to reduce network traffic for a given workload. A common technique to avoid this issue, when performing a join operation, is to co-partition the tables of the data warehouse on their join key. However, in the existing partitioning schemes, executing a star join operation in Hadoop still needs many MapReduce cycles. In this paper, we combine a data-driven and a workload-driven model to create a new scheme for distributed big data warehouses over Hadoop, called “SkipSJoin”. Our approach allows performing the star join operation in only one Spark stage, and allows skipping the loading of some unnecessary HDFS blocks. Our experiments show that our proposal outperforms some approaches in terms of query execution time.Read the full paper here: https://link-springer-com.proxy2.hec.ca/chapter/10.1007/978-3-030-33223-5_21