Image Object Recognition Using Apache Hadoop and Python
Abstract
The amount of data generated by people each day on social media platforms is increasing at an alarming rate. Studies performed show that approximately 1.5 billion images are uploaded to the internet each day. Applications that can use and analyze this data are not available to all users due to limitations in processing power or storage space required for the analysis of these large datasets. Apache Hadoop is an open-source framework that allows distributed processing and fault tolerance of Big Data with the use of commodity hardware using Hadoop Distributed File System(HDFS)and MapReduce. Using HDFS data is stored in a distributed manner across different machines (data notes). The use of the MapReduce framework parallelized computing is available and manageable to be able to mine and analyze the image data available created by users. The focus of this article will be the analysis of image data in large datasets to create feature vectors using the k-means algorithm to group together images that contain similar objects inside them using Apache Hadoop, Map Reduce, Apache Spark, Computer Vision, and the Python programming language. Key Terms⎯K-Means Clustering, Map Reduce, Sequence File, Scale Invariant Feature Transform (SIFT)