Apache Spark vs Hadoop: What’s best for managing Big Data?

Apache Spark and Hadoop are both frameworks of platforms, systems and tools that are used for real time Big Data and BI analytics, but which one is the best for your data management?

According to Bernard Marr at Forbes, Spark has overtaken Hadoop as the most active open source Big Data project. While Hadoop has dominated the field since the late 2000s, Spark has more recently come to prominence as a big hitter

However, a quick look at Google Trends shows us that while interest in Spark has been on the rise since around November 2013 it’s still completely dwarfed by Hadoop.

Google suggests that in March 2016 interest in Hadoop equalled its all-time peak, but Spark has only ever achieved around 44% of Hadoop’s peak interest level. Incidentally, Spark’s own March 2016 peak is only up 3% from its previous high point in June 2015, so growth in interest does seem to have slowed.

Ready to deploy on bare metal? Create your free account and start configuring your bare metal servers here.

So what is Spark, and how is it competing with the Hadoop elephant?

What is Apache Spark?

At its simplest, Apache Spark is a data processor. Like Hadoop, it is open source, and provides a range of connected tools for managing Big Data. It’s often considered a more advanced product than Hadoop, and is proving popular with companies that need to analyse and store vast quantities of data.

The Spark team are clear on who they view as their competition, suggesting that that their engine can run programs “up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.” If that’s true, then why has interest in Hadoop contiued to rise?

Comparing Spark and Hadoop

The answer is that both products have their strengths and weaknesses, and in many cases their use is not mutually exclusive.

1) Performance

By processing data in-memory Spark reduces latency almost to 0, but can be extremely demanding in terms of memory as it caches processes. This means that if it is running on top of Hadoop YARN or systems that also have high-memory demand, it might be deprived of the resource it needs to perform efficiently.

By contrast, Hadoop MapReduce kills each process once a task is completed, which makes it leaner and more effective to run alongside other resource demanding services. Spark is a classic only-child, it works best in dedicated clusters, whilst Hadoop plays well with others.

2) Costs

Although both software products are open-source and thus free to use, Spark requires a lot of RAM to run in-memory, and thus the individual systems required to run it cost more. However, this is balanced out by the fact that it requires far fewer machines to process large volumes of data, with one test successfully using it to sort 100 TB of data three times faster than Hadoop MapReduce on 10% of the machines.

3) Ease of Use

Spark is generally regarded as easier to use than MapReduce, as it comes packaged with APIs for Java, Python and Spark SQL. This helps users to code in their most familiar languages, and Spark’s interactive mode can help developers and users get immediate feedback for queries.

4) Scalability

Both systems are scalable using the Java-based file system HDFS. Hadoop’s age means that it has been used for high profile large infrastructures: Yahoo has over 100,000 CPUs in over 40,000 servers running Hadoop, with 4500 nodes in its largest cluster. According to the Spark team the largest known cluster has 8000 nodes.

5) Security

Hadoop’s Kerberos authentication support can make security difficult to manage. While Spark lacks secure authentication, it benefits from sharing Hadoop’s HDFS support for access control lists and file level permissions.

Overall, Hadoop comes out on top for security, but Spark benefits from its strengths.

Conclusion

The good news is that the two systems are compatible. Spark benefits from a lot of Hadoop’s strengeths via HDFS, while adding speed and ease of use that the older project lacks.

If you need to process huge quantities of data, and you can dedicate systems to process it, then Spark is likely to be better, easier to use and more cost-effective for your project. However, if you need scalability and for your solution to run alongside other resource-demanding services, Hadoop MapReduce will probably be a safer bet.

High Performance Computing Blog