Browsing by Author "Gheith M.H."

Now showing 1 - 1 of 1

Record linkage approaches in big data: A state of art study
(Institute of Electrical and Electronics Engineers Inc., 2018) El-Ghafar R.M.A.; Gheith M.H.; El-Bastawissy A.H.; Nasr E.S.; Computer Science Department; Institute of Statistical Studies and Research; Cairo University; Cairo; Egypt; Faculty of Computer Science; Modern Sciences and Arts University; Cairo; Egypt; Independent Researcher; Cairo; Egypt
Record Linkage aims to find records in a dataset that represent the same real-world entity across many different data sources. It is a crucial task for data quality. With the evolution of Big Data, new difficulties appeared to deal mainly with the 5Vs of Big Data properties; i.e. Volume, Variety, Velocity, Value, and Veracity. Therefore Record Linkage in Big Data is more challenging. This paper investigates ways to apply Record Linkage algorithms that handle the Volume property of Big Data. Our investigation revealed four major issues. First, the techniques used to resolve the Volume property of Big Data mainly depend on partitioning the data into a number of blocks. The processing of those blocks is parallelly distributed among many executers. Second, MapReduce is the most famous programming model that is designed for parallel processing of Big Data. Third, a blocking key is usually used for partitioning the big dataset into smaller blocks; it is often created by the concatenation of the prefixes of chosen attributes. Partitioning using a blocking key may lead to unbalancing blocks, which is known as data skew, where data is not evenly distributed among blocks. An uneven distribution of data degrades the performance of the overall execution of the MapReduce model. Fourth, to the best of our knowledge, a small number of studies has been done so far to balance the load between data blocks in a MapReduce framework. Hence more work should be dedicated to balancing the load between the distributed blocks. � 2017 IEEE.