Record linkage approaches in big data: A state of art study

dc.AffiliationOctober University for modern sciences and Arts (MSA)
dc.contributor.authorEl-Ghafar R.M.A.
dc.contributor.authorGheith M.H.
dc.contributor.authorEl-Bastawissy A.H.
dc.contributor.authorNasr E.S.
dc.contributor.otherComputer Science Department
dc.contributor.otherInstitute of Statistical Studies and Research
dc.contributor.otherCairo University
dc.contributor.otherCairo
dc.contributor.otherEgypt; Faculty of Computer Science
dc.contributor.otherModern Sciences and Arts University
dc.contributor.otherCairo
dc.contributor.otherEgypt; Independent Researcher
dc.contributor.otherCairo
dc.contributor.otherEgypt
dc.date.accessioned2020-01-09T20:40:59Z
dc.date.available2020-01-09T20:40:59Z
dc.date.issued2018
dc.descriptionScopus
dc.description.abstractRecord Linkage aims to find records in a dataset that represent the same real-world entity across many different data sources. It is a crucial task for data quality. With the evolution of Big Data, new difficulties appeared to deal mainly with the 5Vs of Big Data properties; i.e. Volume, Variety, Velocity, Value, and Veracity. Therefore Record Linkage in Big Data is more challenging. This paper investigates ways to apply Record Linkage algorithms that handle the Volume property of Big Data. Our investigation revealed four major issues. First, the techniques used to resolve the Volume property of Big Data mainly depend on partitioning the data into a number of blocks. The processing of those blocks is parallelly distributed among many executers. Second, MapReduce is the most famous programming model that is designed for parallel processing of Big Data. Third, a blocking key is usually used for partitioning the big dataset into smaller blocks; it is often created by the concatenation of the prefixes of chosen attributes. Partitioning using a blocking key may lead to unbalancing blocks, which is known as data skew, where data is not evenly distributed among blocks. An uneven distribution of data degrades the performance of the overall execution of the MapReduce model. Fourth, to the best of our knowledge, a small number of studies has been done so far to balance the load between data blocks in a MapReduce framework. Hence more work should be dedicated to balancing the load between the distributed blocks. � 2017 IEEE.en_US
dc.description.urihttps://www.scimagojr.com/journalsearch.php?q=21100803201&tip=sid&clean=0
dc.identifier.doihttps://doi.org/10.1109/ICENCO.2017.8289792
dc.identifier.doiPubMed ID :
dc.identifier.isbn9.78E+12
dc.identifier.otherhttps://doi.org/10.1109/ICENCO.2017.8289792
dc.identifier.otherPubMed ID :
dc.identifier.urihttps://t.ly/AXbWG
dc.language.isoEnglishen_US
dc.publisherInstitute of Electrical and Electronics Engineers Inc.en_US
dc.relation.ispartofseriesICENCO 2017 - 13th International Computer Engineering Conference: Boundless Smart Societies
dc.relation.ispartofseries2018-January
dc.subjectBig Dataen_US
dc.subjectBig Data Integrationen_US
dc.subjectblockingen_US
dc.subjectentity matchingen_US
dc.subjectentity resolutionen_US
dc.subjectHadoopen_US
dc.subjectmachine learningen_US
dc.subjectMapReduceen_US
dc.subjectRecord Linkageen_US
dc.subjectData integrationen_US
dc.subjectLearning systemsen_US
dc.subjectblockingen_US
dc.subjectEntity matchingen_US
dc.subjectEntity resolutionsen_US
dc.subjectHadoopen_US
dc.subjectMap-reduceen_US
dc.subjectRecord linkageen_US
dc.subjectBig dataen_US
dc.titleRecord linkage approaches in big data: A state of art studyen_US
dc.typeConference Paperen_US
dcterms.isReferencedByDhavapriya, M., Yasodha, N., Big Data Analytics Challenges and Solutions Using Hadoop, Map Reduce and Big Table (2016) International Journal of Computer Science Trends and Technology (IJCST), 4 (1); Dong, X.L., Srivastava, D., (2015) Big Data Integration, , Morgan & Claypool Publishers; Dong, X.L., Srivastava, D., Big data integration (2013) ICDE; Castanedo, F., (2015) Data Preparation in the Big Data Era, USA, , O'Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA; Getoor, L., Machanavajjhala, A., Entity Resolution: Theory, Practice & Open Challenges (2012) VLDB Endowment, 5 (12); Kong, C., Gao, M., Xu, C., Qian, W., Zhou, A., Entity Matching Across Multiple Heterogeneous Data Sources (2016) International Conference on Database Systems for Advanced Applications, Cham; K�pcke, H., Thor, A., Rahm, E., Evaluation of entity resolution approaches on real-world match problems (2010) Proceedings of the VLDB Endowment; Kannan, A., Kannan, A., Agrawal, R., Fuxman, A., Matching unstructured product offers to structured product specifications (2011) 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Cao, Y., Chen, Z., Zhu, J., Yue, P., Lin, C.-Y., Yu, Y., Leveraging Unlabeled Data to Scale Blocking for Record Linkage (2011) Proceedings of the 22nd International JointConference on Artificial Intelligence (IJCAI); Kolb, L., Thor, A., Rahm, E., Parallel Sorted Neighborhood Blocking with MapReduce (2011) Proc. Conf. Datenbanksysteme in Buro, Technik und Wissenschaft; Baxter, R., Christen, P., Churches, T., Comparison of fast blocking methods for record linkage (2003) ACM SIGKDD '03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation; Christen, P., A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication (2011) IEEE Transactions on Knowledge and Data Engineering X(Y); Papadakis, G., Ioannou, E., Palpanas, T., Niederee, C., Wolfgang, N., A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces (2013) IEEE Transactions on Knowledge and Data Engineering; Papadakis, G., Ioannou, E., Niedere, C., Palpanas, T.N., Eliminating the Redundancy in Blocking-based Entity Resolution Methods (2011) Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries; Kolb T, L., Rahm, E., Multi-pass Sorted Neighborhood Blocking with MapReduce (2012) Computer Science-Research and Development; Mestre, D.G., Pires, C.E., An Adaptive Blocking Approach for Entity Matching with MapReduce (2013) SBBD; Kolb, L., K�pcke, H., Thor, A., Rahm, E., Learning-based Entity Resolution with MapReduce (2011) CloudDB; Dean, J., Ghemawat, S., MapReduce: Simplified Data Processing on Large Clusters (2004) The 6th Conference on Symposium on Operarting Systems Design & Implementation, Berkeley, CA, USA; Hsueh, S.-C., Lin, M.-Y., Chiu, Y.-C., A Load-Balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys (2014) Parallel and Distributed Computing; Huang, Y., Record linkage in an Hadoop environment (2011) School of Computing, National University of Singapore; Kolb, L., Thor, A., Rahm, E., Block-based Load Balancing for Entity Resolution with MapReduce (2011) Proceedings of the 20th ACM International Conference on Information and Knowledge Management; Kolb, L., Thor, A., Rahm, E., Load Balancing for MapReduce-based Entity Resolution (2012) International Conference on Data Engineering (ICDE), , IEEE, Leipzing, German; Chen, C., Pullen, D., Petty, R.H., Talburt, J.R., Methodology for Large-Scale Entity Resolution Without Pairwise Matching (2015) IEEE International Conference on Data Mining Workshop (ICDMW); Papadakis, G., Ioannou, E., Nieder�e, C., Palpanas, T., Nejdl, W., To Compare or Not to Compare:Making Entity Resolution more Efficient (2011) Proceedings of the International Workshop on Semantic Web Information Management; Jin, C., Patwary, M.M.A., Agrawal, A., Hendrix, W., Liao, W.-K., Choudhary, A., DiSC: A Distributed Single-Linkage Hierarchical Clustering Algorithm using MapReduce (2013) 4th International SC Workshop on Data Intensive Computing in the Clouds (DataCloud); Kolb, L., Thor, A., Rahm, E., Dedoop: Efficient Deduplication with Hadoop (2012) VLDB Endow, 12 (5), pp. 1878-2188; Moir, C., Dean, J., A Machine Learning approach to Generic Entity Resolution in support of Cyber Situation Awareness (2015) Proceedings of the 38th Australasian Computer Science Conference (ACSC 2015)
dcterms.sourceScopus

Files

Original bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
avatar_scholar_128.png
Size:
2.73 KB
Format:
Portable Network Graphics
Description:
Loading...
Thumbnail Image
Name:
Record-linkage-techniques-in-Big-Data_3_1__2018_CameraReady.pdf
Size:
387.28 KB
Format:
Adobe Portable Document Format
Description: