Record linkage approaches in big data: A state of art study
dc.Affiliation | October University for modern sciences and Arts (MSA) | |
dc.contributor.author | El-Ghafar R.M.A. | |
dc.contributor.author | Gheith M.H. | |
dc.contributor.author | El-Bastawissy A.H. | |
dc.contributor.author | Nasr E.S. | |
dc.contributor.other | Computer Science Department | |
dc.contributor.other | Institute of Statistical Studies and Research | |
dc.contributor.other | Cairo University | |
dc.contributor.other | Cairo | |
dc.contributor.other | Egypt; Faculty of Computer Science | |
dc.contributor.other | Modern Sciences and Arts University | |
dc.contributor.other | Cairo | |
dc.contributor.other | Egypt; Independent Researcher | |
dc.contributor.other | Cairo | |
dc.contributor.other | Egypt | |
dc.date.accessioned | 2020-01-09T20:40:59Z | |
dc.date.available | 2020-01-09T20:40:59Z | |
dc.date.issued | 2018 | |
dc.description | Scopus | |
dc.description.abstract | Record Linkage aims to find records in a dataset that represent the same real-world entity across many different data sources. It is a crucial task for data quality. With the evolution of Big Data, new difficulties appeared to deal mainly with the 5Vs of Big Data properties; i.e. Volume, Variety, Velocity, Value, and Veracity. Therefore Record Linkage in Big Data is more challenging. This paper investigates ways to apply Record Linkage algorithms that handle the Volume property of Big Data. Our investigation revealed four major issues. First, the techniques used to resolve the Volume property of Big Data mainly depend on partitioning the data into a number of blocks. The processing of those blocks is parallelly distributed among many executers. Second, MapReduce is the most famous programming model that is designed for parallel processing of Big Data. Third, a blocking key is usually used for partitioning the big dataset into smaller blocks; it is often created by the concatenation of the prefixes of chosen attributes. Partitioning using a blocking key may lead to unbalancing blocks, which is known as data skew, where data is not evenly distributed among blocks. An uneven distribution of data degrades the performance of the overall execution of the MapReduce model. Fourth, to the best of our knowledge, a small number of studies has been done so far to balance the load between data blocks in a MapReduce framework. Hence more work should be dedicated to balancing the load between the distributed blocks. � 2017 IEEE. | en_US |
dc.description.uri | https://www.scimagojr.com/journalsearch.php?q=21100803201&tip=sid&clean=0 | |
dc.identifier.doi | https://doi.org/10.1109/ICENCO.2017.8289792 | |
dc.identifier.doi | PubMed ID : | |
dc.identifier.isbn | 9.78E+12 | |
dc.identifier.other | https://doi.org/10.1109/ICENCO.2017.8289792 | |
dc.identifier.other | PubMed ID : | |
dc.identifier.uri | https://t.ly/AXbWG | |
dc.language.iso | English | en_US |
dc.publisher | Institute of Electrical and Electronics Engineers Inc. | en_US |
dc.relation.ispartofseries | ICENCO 2017 - 13th International Computer Engineering Conference: Boundless Smart Societies | |
dc.relation.ispartofseries | 2018-January | |
dc.subject | Big Data | en_US |
dc.subject | Big Data Integration | en_US |
dc.subject | blocking | en_US |
dc.subject | entity matching | en_US |
dc.subject | entity resolution | en_US |
dc.subject | Hadoop | en_US |
dc.subject | machine learning | en_US |
dc.subject | MapReduce | en_US |
dc.subject | Record Linkage | en_US |
dc.subject | Data integration | en_US |
dc.subject | Learning systems | en_US |
dc.subject | blocking | en_US |
dc.subject | Entity matching | en_US |
dc.subject | Entity resolutions | en_US |
dc.subject | Hadoop | en_US |
dc.subject | Map-reduce | en_US |
dc.subject | Record linkage | en_US |
dc.subject | Big data | en_US |
dc.title | Record linkage approaches in big data: A state of art study | en_US |
dc.type | Conference Paper | en_US |
dcterms.isReferencedBy | Dhavapriya, M., Yasodha, N., Big Data Analytics Challenges and Solutions Using Hadoop, Map Reduce and Big Table (2016) International Journal of Computer Science Trends and Technology (IJCST), 4 (1); Dong, X.L., Srivastava, D., (2015) Big Data Integration, , Morgan & Claypool Publishers; Dong, X.L., Srivastava, D., Big data integration (2013) ICDE; Castanedo, F., (2015) Data Preparation in the Big Data Era, USA, , O'Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA; Getoor, L., Machanavajjhala, A., Entity Resolution: Theory, Practice & Open Challenges (2012) VLDB Endowment, 5 (12); Kong, C., Gao, M., Xu, C., Qian, W., Zhou, A., Entity Matching Across Multiple Heterogeneous Data Sources (2016) International Conference on Database Systems for Advanced Applications, Cham; K�pcke, H., Thor, A., Rahm, E., Evaluation of entity resolution approaches on real-world match problems (2010) Proceedings of the VLDB Endowment; Kannan, A., Kannan, A., Agrawal, R., Fuxman, A., Matching unstructured product offers to structured product specifications (2011) 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Cao, Y., Chen, Z., Zhu, J., Yue, P., Lin, C.-Y., Yu, Y., Leveraging Unlabeled Data to Scale Blocking for Record Linkage (2011) Proceedings of the 22nd International JointConference on Artificial Intelligence (IJCAI); Kolb, L., Thor, A., Rahm, E., Parallel Sorted Neighborhood Blocking with MapReduce (2011) Proc. Conf. Datenbanksysteme in Buro, Technik und Wissenschaft; Baxter, R., Christen, P., Churches, T., Comparison of fast blocking methods for record linkage (2003) ACM SIGKDD '03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation; Christen, P., A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication (2011) IEEE Transactions on Knowledge and Data Engineering X(Y); Papadakis, G., Ioannou, E., Palpanas, T., Niederee, C., Wolfgang, N., A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces (2013) IEEE Transactions on Knowledge and Data Engineering; Papadakis, G., Ioannou, E., Niedere, C., Palpanas, T.N., Eliminating the Redundancy in Blocking-based Entity Resolution Methods (2011) Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries; Kolb T, L., Rahm, E., Multi-pass Sorted Neighborhood Blocking with MapReduce (2012) Computer Science-Research and Development; Mestre, D.G., Pires, C.E., An Adaptive Blocking Approach for Entity Matching with MapReduce (2013) SBBD; Kolb, L., K�pcke, H., Thor, A., Rahm, E., Learning-based Entity Resolution with MapReduce (2011) CloudDB; Dean, J., Ghemawat, S., MapReduce: Simplified Data Processing on Large Clusters (2004) The 6th Conference on Symposium on Operarting Systems Design & Implementation, Berkeley, CA, USA; Hsueh, S.-C., Lin, M.-Y., Chiu, Y.-C., A Load-Balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys (2014) Parallel and Distributed Computing; Huang, Y., Record linkage in an Hadoop environment (2011) School of Computing, National University of Singapore; Kolb, L., Thor, A., Rahm, E., Block-based Load Balancing for Entity Resolution with MapReduce (2011) Proceedings of the 20th ACM International Conference on Information and Knowledge Management; Kolb, L., Thor, A., Rahm, E., Load Balancing for MapReduce-based Entity Resolution (2012) International Conference on Data Engineering (ICDE), , IEEE, Leipzing, German; Chen, C., Pullen, D., Petty, R.H., Talburt, J.R., Methodology for Large-Scale Entity Resolution Without Pairwise Matching (2015) IEEE International Conference on Data Mining Workshop (ICDMW); Papadakis, G., Ioannou, E., Nieder�e, C., Palpanas, T., Nejdl, W., To Compare or Not to Compare:Making Entity Resolution more Efficient (2011) Proceedings of the International Workshop on Semantic Web Information Management; Jin, C., Patwary, M.M.A., Agrawal, A., Hendrix, W., Liao, W.-K., Choudhary, A., DiSC: A Distributed Single-Linkage Hierarchical Clustering Algorithm using MapReduce (2013) 4th International SC Workshop on Data Intensive Computing in the Clouds (DataCloud); Kolb, L., Thor, A., Rahm, E., Dedoop: Efficient Deduplication with Hadoop (2012) VLDB Endow, 12 (5), pp. 1878-2188; Moir, C., Dean, J., A Machine Learning approach to Generic Entity Resolution in support of Cyber Situation Awareness (2015) Proceedings of the 38th Australasian Computer Science Conference (ACSC 2015) | |
dcterms.source | Scopus |