Comparative Study of Record Linkage Approaches for Big Data

Abd El-Ghafar, Randa MEl-Bastawissy, AliNasr, EmanGheith, Mervat H.Comparative Study of Record Linkage Approaches for Big DataWalailak University2020modern science and arts universityOctober University for Flink.SparkMapReduceHadoopRecord linkageBig DataMy UniversityMy University2020-08-212020-08-212020-06en-USArticle2228-835Xhttps://doi.org/10.13140/RG.2.2.10094.23368http://repository.msa.edu.eg/xmlui/handle/123456789/3718https://doi.org/10.13140/RG.2.2.10094.23368Record linkage is a challenging task for Big Data. This paper presents a comparative study of record linkage approaches for Big Data. We compare based on three dimensions; record linkage phases, dataset properties, and parallel processing approach for Big Data. As far as we know, current state of art only conducts comparative studies between record linkage approaches. We only found one comparative study covers the whole record linkage framework of the relational database. Our focus on the dimensions of the parallel processing approaches for Big Data and dataset properties are novel. Our research revealed five findings. First, data exploration is almost a non-existing phase despite its importance of exploring the dataset being examined. Second, techniques that handle data standardization and preparation phase of the first dimension are not extensively covered in the literature which can directly affect the results’ quality. Third, record linkage in unstructured data is not yet explored in literature. Fourth, MapReduce has been used in about 50% of the selected studies to handle the parallel processing of Big Data, but due to its limitations, more recent and efficient approaches have been used. These approaches include Apache Spark and Apache Flink. Apache Spark is just recently adapted to resolve duplicates due to its supporting of in-memory computation, which makes the whole linkage process more efficient. Although the comparative study, includes many recent studies supporting Apache Spark, it is not yet well explored in literature, as more researches need to be conducted. In addition, Apache Flink is still rarely used to solve the record linkage problem of Big Data. Fifth, pruning techniques which used to eliminate unnecessary comparisons, are not adequately applied in the covered studies despite their effect on reducing the search space which results in a more effective Record Linkage process.