Big data is a set of large chunk of data sets that traditional algorithms or software programs are unable to deal with them at once thus making the process slower. Challenges in big data include capturing data, data storage, analysis of data, data security, data sharing, transfer, data updating and data source.
Big data refers to the use of predictive analytics in identifying the data sets. The term has been in use since 1990 and the credit for coining the term has been given to John Mashey. Big data can be defined on three different parameters ranging from Volume, Variety and Veracity.Big data can be used in multiple industries such as healthcare, government organizations, manufacturing, education, media etc. basically in any industry which has large sets of data to process.
Hadoop was created by computer scientists Doug Cutting and Mike Cafarella. Hadoop provides the complete ecosystem on open source projects which gives us the distributed frameworks to deal with big data.
Hadoop can handle any form of structured as well as unstructured data which in turn gives users flexibility for collecting, processing and analyzing data which are not being provided by traditional data warehouses.
Hadoop supports advanced analytics such as predictive analytics, data mining and machine learning. Commercial distribution of Hadoop has been handled by four major vendors such as Cloudera, Horton Networks, Amazon Web Services (AWS) and MapR Technologies. Google and Microsoft also provide cloud-based managed services which can be used with Hadoop and related technologies.
Hadoop can process massive amounts of data at once. It runs on clusters of commodity servers which also scales to support thousands of hardware nodes.
MapReduce, the Hadoop Distributed File System (HDFS) and Hadoop Common, a set of shared utilities and libraries are the core components in the first iteration of Hadoop.
MapReduce reduces functions using a map to split them into multiple tasks.
As Hadoop can analyse and process a large chunk of data, it enables the large organization to create small ponds or lakes of data thus acting as a reservoir for the massive amount of data. Continuous cleansing of data compliments the data lakes to make a set of transaction data.
Hadoop is commonly used in customer behaviour analytics predicting the churn, loss and behaviour patterns of users. Enterprise uses Hadoop to analyse pricing and managing the safe driver discount programs. Healthcare uses Hadoop to make the treatments more effective to the patients also to understand the pattern of the particular disease.Apache offers the open source frameworks for Hadoop. Hadoop can also analyse semi-structured data.