Abstract:
Record linkage is a challenging task for Big Data. This paper presents a comparative study of record
linkage approaches for Big Data. We compare based on three dimensions; record linkage phases, dataset
properties, and parallel processing approach for Big Data. As far as we know, current state of art only
conducts comparative studies between record linkage approaches. We only found one comparative study
covers the whole record linkage framework of the relational database. Our focus on the dimensions of
the parallel processing approaches for Big Data and dataset properties are novel. Our research revealed
five findings. First, data exploration is almost a non-existing phase despite its importance of exploring
the dataset being examined. Second, techniques that handle data standardization and preparation phase
of the first dimension are not extensively covered in the literature which can directly affect the results’
quality. Third, record linkage in unstructured data is not yet explored in literature. Fourth, MapReduce
has been used in about 50% of the selected studies to handle the parallel processing of Big Data, but due
to its limitations, more recent and efficient approaches have been used. These approaches include
Apache Spark and Apache Flink. Apache Spark is just recently adapted to resolve duplicates due to its
supporting of in-memory computation, which makes the whole linkage process more efficient. Although
the comparative study, includes many recent studies supporting Apache Spark, it is not yet well
explored in literature, as more researches need to be conducted. In addition, Apache Flink is still rarely
used to solve the record linkage problem of Big Data. Fifth, pruning techniques which used to eliminate
unnecessary comparisons, are not adequately applied in the covered studies despite their effect on
reducing the search space which results in a more effective Record Linkage process.