Comparative Study of Record Linkage Approaches for Big Data

Loading...
Thumbnail Image

Date

2020-06

Journal Title

Journal ISSN

Volume Title

Type

Article

Publisher

Walailak University

Series Info

Walailak Journal of Science and Technology;

Abstract

Record linkage is a challenging task for Big Data. This paper presents a comparative study of record linkage approaches for Big Data. We compare based on three dimensions; record linkage phases, dataset properties, and parallel processing approach for Big Data. As far as we know, current state of art only conducts comparative studies between record linkage approaches. We only found one comparative study covers the whole record linkage framework of the relational database. Our focus on the dimensions of the parallel processing approaches for Big Data and dataset properties are novel. Our research revealed five findings. First, data exploration is almost a non-existing phase despite its importance of exploring the dataset being examined. Second, techniques that handle data standardization and preparation phase of the first dimension are not extensively covered in the literature which can directly affect the results’ quality. Third, record linkage in unstructured data is not yet explored in literature. Fourth, MapReduce has been used in about 50% of the selected studies to handle the parallel processing of Big Data, but due to its limitations, more recent and efficient approaches have been used. These approaches include Apache Spark and Apache Flink. Apache Spark is just recently adapted to resolve duplicates due to its supporting of in-memory computation, which makes the whole linkage process more efficient. Although the comparative study, includes many recent studies supporting Apache Spark, it is not yet well explored in literature, as more researches need to be conducted. In addition, Apache Flink is still rarely used to solve the record linkage problem of Big Data. Fifth, pruning techniques which used to eliminate unnecessary comparisons, are not adequately applied in the covered studies despite their effect on reducing the search space which results in a more effective Record Linkage process.

Description

Keywords

modern science and arts university, October University for Flink., Spark, MapReduce, Hadoop, Record linkage, Big Data

Citation