Apache Hadoop was designed to store and process large data sets and it has many advantages like its open-source, cost-effective, fault-tolerant, etc on the other hand it has a disadvantage as well.
Apache Hadoop Advantages
The following are some of the Apache Hadoop advantages.
1. Open Source
Apache Hadoop is open-source software and developed at Apache Software Foundation. We can download Hadoop software from the Apache Software portal and start using it. It is freely available.
2. Data Sources
Apache Hadoop can store structured, unstructured, and semi-structured data which is generated from different sources like emails, social media in the form of log format, XML format, text format.
3. Performance
Apache Hadoop is a distributed storage and distributed processing system that process large datasets in the range of terabytes to petabytes. It achieves the best performance by dividing data into several blocks and store into several nodes and when a user submits a job it divides that job into a sub-task that starts executing on all those slave nodes and this way Hadoop achieves the best performance.
4. Scalable
Apache Hadoop can scale horizontally depending upon workload; nodes can be added to the Hadoop framework on the fly which makes it scalable.
5. High Availability
Apache Hadoop supports multiple standby name nodes and if one or two name nodes get down Hadoop will continue functioning this is how Hadoop achieves high availability.
6. Language Support
Apache Hadoop supports multiple languages like Python, C, C++Perl so programmers can write down codes in these languages.
7. Compatibility
Apache Hadoop is compatible with other fast-growing technology like Spark, Spark has its processing unit so it uses Hadoop as a data storage platform.
Apache Hadoop Disadvantages
The following are some of the disadvantages of Apache Hadoop.
1. Batch Processing
Apache Hadoop is a batch-processing engine, which processes data in batch mode. In batch, mode data is already stored on the system, and not real-time streaming cause Hadoop is not efficient in processing of real-time data.
2. Processing Overhead
When we deal with terabytes or Petabytes of data, it becomes overhead for Hadoop to read such huge data from disk and after processing write down on disk because Hadoop cannot process data in memory.
3. Small File Overhead
Apache Hadoop is used to store a small number of large files, but when it comes to storing a large number of small files(below 100 MB) then Hadoop fails because Hadoop store data in the block size of 128 MB or 256 MB by default and storing less than default size creates overhead for name node to process.
4. Security Concern
Apache Hadoop uses Kerberos for its authentication but missing encryption at storage and network layers are security concerns.