To Hadoop or not to Hadoop

Hadoop has become synonymous with Big Data. And, more recently it has become an increasingly polarizing topic, in many ways like the term Big Data itself.

Hadoop was not an original to finance. It is slower with real-time data and in comparison to relational databases which dominate finance is less apt to slice and dice data. Hadoop however has its own advantages – not least of which is its distributed nature which allows for faster larger scale analytics. Another major advantage is its file system, Hadoop distributed file system or HDFS. It is excellent at storing unstructured and semi-structured data that do not have a well and previously defined schema. Such a schema fits new and novel data sources especially well.

It is important to stress that when people refer to Hadoop they are often mixing terms. Some people refer to Hadoop as being map-reduce (or the distributed and batch oriented processing), whereas others imply the file system or HDFS, and others are really talking about the Hadoop ecosystem which today includes many other open-source Big Data projects that integrate with or leverage the Hadoop infrastructure.

Going forward, it would appear that map-reduce will lose traction. It is inherently slow for real-time analysis, but actually good for certain larger scale calculations that might not be as time critical. Over time, batch type analysis will become less important, especially for finance, but will still continue to be used. Especially for overnight type calculations or end of day analysis, map-reduce will continue to dominate.

There are many new projects emerging ready to fill the gap between streaming Big Data analytics and Hadoop. Such projects as Storm and Spark allow for increasingly real time coverage, and resolve many of the issues revolving around Hadoop’s slower real-time analysis. Summingbird, a newer open source project, actually focuses on bridging the gap between Hadoop and real-time by creating a platform to dial back and forth between these two paradigms. In short, Hadoop’s map-reduce will likely decrease in usage and importance but it will not disappear – it will be used for less time-critical cases and will be buttressed by emerging open source projects made to fill in where map-reduce left off.

Its file system, HDFS, in contrast appears to have become fully ingrained in the Big Data world. It appears that even though map-reduce will diminish in importance over time, the file system will remain. That is not to say that all data will necessarily be required to land there before analytics take place. In fact, there are considerable financial uses cases where streaming data is first analyzed outside of Hadoop, only to end up in HDFS later for longer term storage and batch processes.

As for finance in general, most will likely continue to rely upon relational databases. This will especially be true for highly structured data that have very well defined use cases. If you want the data fast to do the same or similar types of straightforward analysis, relational databases are likely a good choice and will continue to be. If you want to leverage new data sources, especially those that contain unstructured or semi-structured data where the use cases and analytical paths are less well defined, you will likely end up using, or at least wanting your service provider, to use Hadoop and its related ecosystem.