Design and implementation of ETL processes using BPMN and relational algebra

0
111

Authors: Alejandro A. Vaisman, Esteban Zimányi, Judith Awiti

Tags: 2020, conceptual modeling

Extraction, transformation, and loading (ETL) processes are used to extract data from internaland external sources of an organization, transform these data, and load them into a datawarehouse. The Business Process Modeling and Notation (BPMN) has been proposed forexpressing ETL processes at a conceptual level. A different approach is studied in this paper,where relational algebra (RA), extended with update operations, is used for specifying ETLprocesses. In this approach, data tasks in an ETL workflow can be automatically translatedinto SQL queries to be executed over a DBMS. To illustrate this study, the paper addresses theproblem of updating Slowly Changing Dimensions (SCDs) with dependencies, that is, the casewhen updating a SCD table impacts on associated SCD tables. Tackling this problem requiresextending the classic RA with update operations. The paper also shows the implementationof a portion of the TPC-DI benchmark that results from both approaches. Thus, the paperpresents three implementations: (a) An SQL implementation based on the extended RA-basedspecification of an ETL process expressed in BPMN4ETL; and (b) Two implementations ofworkflows that follow from BPMN4ETL, one that uses the Pentaho DI tool, and another onethat uses Talend Open Studio for DI. Experiments over these implementations of the TPC-DIbenchmark for different scale factors were carried out, and are described and discussed in thepaper, showing that the extended RA approach results in more efficient processes than the onesproduced by implementing the BPMN4ETL specification over the mentioned ETL tools. Thereasons for this result are also discussed.

Read the full paper here: https://www.sciencedirect.com/science/article/pii/S0169023X19306111