Comparative Study of Record Linkage Approaches for Big Data

dc.AffiliationOctober University for modern sciences and Arts (MSA)
dc.contributor.authorAbd El-Ghafar, Randa M
dc.contributor.authorEl-Bastawissy, Ali
dc.contributor.authorNasr, Eman
dc.contributor.authorGheith, Mervat H.
dc.date.accessioned2020-08-21T13:24:42Z
dc.date.available2020-08-21T13:24:42Z
dc.date.issued2020-06
dc.description.abstractRecord linkage is a challenging task for Big Data. This paper presents a comparative study of record linkage approaches for Big Data. We compare based on three dimensions; record linkage phases, dataset properties, and parallel processing approach for Big Data. As far as we know, current state of art only conducts comparative studies between record linkage approaches. We only found one comparative study covers the whole record linkage framework of the relational database. Our focus on the dimensions of the parallel processing approaches for Big Data and dataset properties are novel. Our research revealed five findings. First, data exploration is almost a non-existing phase despite its importance of exploring the dataset being examined. Second, techniques that handle data standardization and preparation phase of the first dimension are not extensively covered in the literature which can directly affect the results’ quality. Third, record linkage in unstructured data is not yet explored in literature. Fourth, MapReduce has been used in about 50% of the selected studies to handle the parallel processing of Big Data, but due to its limitations, more recent and efficient approaches have been used. These approaches include Apache Spark and Apache Flink. Apache Spark is just recently adapted to resolve duplicates due to its supporting of in-memory computation, which makes the whole linkage process more efficient. Although the comparative study, includes many recent studies supporting Apache Spark, it is not yet well explored in literature, as more researches need to be conducted. In addition, Apache Flink is still rarely used to solve the record linkage problem of Big Data. Fifth, pruning techniques which used to eliminate unnecessary comparisons, are not adequately applied in the covered studies despite their effect on reducing the search space which results in a more effective Record Linkage process.en_US
dc.description.urihttps://www.scimagojr.com/journalsearch.php?q=21100258402&tip=sid&clean=0
dc.identifier.doihttps://doi.org/10.13140/RG.2.2.10094.23368
dc.identifier.issn2228-835X
dc.identifier.otherhttps://doi.org/10.13140/RG.2.2.10094.23368
dc.identifier.urihttp://repository.msa.edu.eg/xmlui/handle/123456789/3718
dc.language.isoen_USen_US
dc.publisherWalailak Universityen_US
dc.relation.ispartofseriesWalailak Journal of Science and Technology;
dc.subjectmodern science and arts universityen_US
dc.subjectOctober University for Flink.en_US
dc.subjectSparken_US
dc.subjectMapReduceen_US
dc.subjectHadoopen_US
dc.subjectRecord linkageen_US
dc.subjectBig Dataen_US
dc.titleComparative Study of Record Linkage Approaches for Big Dataen_US
dc.typeArticleen_US

Files