Comparative Study of Record Linkage Approaches for Big Data
Loading...
Date
2020-06
Journal Title
Journal ISSN
Volume Title
Type
Article
Publisher
Walailak University
Series Info
Walailak Journal of Science and Technology;
Scientific Journal Rankings
Abstract
Record linkage is a challenging task for Big Data. This paper presents a comparative study of record
linkage approaches for Big Data. We compare based on three dimensions; record linkage phases, dataset
properties, and parallel processing approach for Big Data. As far as we know, current state of art only
conducts comparative studies between record linkage approaches. We only found one comparative study
covers the whole record linkage framework of the relational database. Our focus on the dimensions of
the parallel processing approaches for Big Data and dataset properties are novel. Our research revealed
five findings. First, data exploration is almost a non-existing phase despite its importance of exploring
the dataset being examined. Second, techniques that handle data standardization and preparation phase
of the first dimension are not extensively covered in the literature which can directly affect the results’
quality. Third, record linkage in unstructured data is not yet explored in literature. Fourth, MapReduce
has been used in about 50% of the selected studies to handle the parallel processing of Big Data, but due
to its limitations, more recent and efficient approaches have been used. These approaches include
Apache Spark and Apache Flink. Apache Spark is just recently adapted to resolve duplicates due to its
supporting of in-memory computation, which makes the whole linkage process more efficient. Although
the comparative study, includes many recent studies supporting Apache Spark, it is not yet well
explored in literature, as more researches need to be conducted. In addition, Apache Flink is still rarely
used to solve the record linkage problem of Big Data. Fifth, pruning techniques which used to eliminate
unnecessary comparisons, are not adequately applied in the covered studies despite their effect on
reducing the search space which results in a more effective Record Linkage process.
Description
Keywords
modern science and arts university, October University for Flink., Spark, MapReduce, Hadoop, Record linkage, Big Data