Record linkage approaches in big data: A state of art study
Date
2018
Journal Title
Journal ISSN
Volume Title
Type
Conference Paper
Publisher
Institute of Electrical and Electronics Engineers Inc.
Series Info
ICENCO 2017 - 13th International Computer Engineering Conference: Boundless Smart Societies
2018-January
2018-January
Scientific Journal Rankings
Abstract
Record Linkage aims to find records in a dataset that represent the same real-world entity across many different data sources. It is a crucial task for data quality. With the evolution of Big Data, new difficulties appeared to deal mainly with the 5Vs of Big Data properties; i.e. Volume, Variety, Velocity, Value, and Veracity. Therefore Record Linkage in Big Data is more challenging. This paper investigates ways to apply Record Linkage algorithms that handle the Volume property of Big Data. Our investigation revealed four major issues. First, the techniques used to resolve the Volume property of Big Data mainly depend on partitioning the data into a number of blocks. The processing of those blocks is parallelly distributed among many executers. Second, MapReduce is the most famous programming model that is designed for parallel processing of Big Data. Third, a blocking key is usually used for partitioning the big dataset into smaller blocks; it is often created by the concatenation of the prefixes of chosen attributes. Partitioning using a blocking key may lead to unbalancing blocks, which is known as data skew, where data is not evenly distributed among blocks. An uneven distribution of data degrades the performance of the overall execution of the MapReduce model. Fourth, to the best of our knowledge, a small number of studies has been done so far to balance the load between data blocks in a MapReduce framework. Hence more work should be dedicated to balancing the load between the distributed blocks. � 2017 IEEE.
Description
Scopus
Keywords
Big Data, Big Data Integration, blocking, entity matching, entity resolution, Hadoop, machine learning, MapReduce, Record Linkage, Data integration, Learning systems, blocking, Entity matching, Entity resolutions, Hadoop, Map-reduce, Record linkage, Big data