A Semi-clustering Scheme for High Performance PageRank on Hadoop

0
72

Authors: Dong Hoon Choi, Jaewoo Chang, Jeonghoon Lee, Seungtae Hong

Tags: 2014, conceptual modeling

As global Internet business has been evolving, large-scale graphs are becoming popular. PageRank computation on the large-scale graphs using Hadoop with default data partitioning method suffers from poor performance because Hadoop scatters even a set of directly connected vertices to arbitrary multiple nodes. In this paper we propose a semi-clustering scheme to address this problem and improve the performance of PageRank on Hadoop. Our scheme divides a graph into a set of semi-clusters, each of which consists of connected vertices, and assigns a semi-cluster to a single data partition in order to reduce the cost of data exchange between nodes during the computation of PageRank. The semi-clusters are merged and split before the PageRank computation, in order to evenly distribute a large-scale graph into a number of data partitions. Our semi-clustering scheme drastically improves the performance: total elapsed time including the cost of the semi-clustering computation reduced by up to 36%. Furthermore, the effectiveness of our scheme increases as the size of the graph increases.

Read the full paper here: https://link-springer-com.proxy2.hec.ca/chapter/10.1007/978-3-319-12256-4_4