Record linkage approaches in big data: A state of art study

Thumbnail Image

Date

2018

Journal Title

Journal ISSN

Volume Title

Type

Conference Paper

Publisher

Institute of Electrical and Electronics Engineers Inc.

Series Info

ICENCO 2017 - 13th International Computer Engineering Conference: Boundless Smart Societies
2018-January

Abstract

Record Linkage aims to find records in a dataset that represent the same real-world entity across many different data sources. It is a crucial task for data quality. With the evolution of Big Data, new difficulties appeared to deal mainly with the 5Vs of Big Data properties; i.e. Volume, Variety, Velocity, Value, and Veracity. Therefore Record Linkage in Big Data is more challenging. This paper investigates ways to apply Record Linkage algorithms that handle the Volume property of Big Data. Our investigation revealed four major issues. First, the techniques used to resolve the Volume property of Big Data mainly depend on partitioning the data into a number of blocks. The processing of those blocks is parallelly distributed among many executers. Second, MapReduce is the most famous programming model that is designed for parallel processing of Big Data. Third, a blocking key is usually used for partitioning the big dataset into smaller blocks; it is often created by the concatenation of the prefixes of chosen attributes. Partitioning using a blocking key may lead to unbalancing blocks, which is known as data skew, where data is not evenly distributed among blocks. An uneven distribution of data degrades the performance of the overall execution of the MapReduce model. Fourth, to the best of our knowledge, a small number of studies has been done so far to balance the load between data blocks in a MapReduce framework. Hence more work should be dedicated to balancing the load between the distributed blocks. � 2017 IEEE.

Description

Scopus

Keywords

Big Data, Big Data Integration, blocking, entity matching, entity resolution, Hadoop, machine learning, MapReduce, Record Linkage, Data integration, Learning systems, blocking, Entity matching, Entity resolutions, Hadoop, Map-reduce, Record linkage, Big data

Citation

Full Text link