How big is big? Is there a case for big data solutions right now in investment banking?

hammer When you have a hammer, everything looks like a nail.

There is a significant amount of interest in Big Data solutions in Investment Banking technology departments. And there have been a number of proof of concept and partial implementations of Hadoop and similar technologies to demonstrate the value of these solutions in our space. I’m not sure though that we are in a position to make use of this technology at this point. Our rationale all hangs on the big question…

Does anyone actually know what is meant by “Big Data”?

One possible definition is :

“Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications”. (Autonomy)

IBM have a definition which adds colour and depth and focuses on four dimensions: Volume; Velocity; Variety and Veracity. Expanding the idea that Big Data it isn’t just lots of data, its lots of data that has meaning and value hidden deep within it. Big data tools offer the means of uncovering this meaning.

The notion of uncovering meaning in data is not a new idea, and I am sure that we are all familiar with the data->information->knowledge hierarchy and the traditional statistical and relational analysis tools that have been brought to bear in uncovering nuggets of knowledge from Mines of data. This metaphor even gave the discipline a name : Data Mining.

In the word of Big Data, where volume and variety create the need for new tools, Data Mining has given way to Data Science, and traditional approaches have been expanded and evolved to account for the huge scale of the problem. Big Data requires large scale distributed processing, and combines a recent history in Grid and High Performance Computing with sets of distributed data analysis and consolidation tools to create an environment for data analysis, discovery and verisimilitude in the new world.

The drive and evolution of Big Data tools has been through search engine analytics. Here is a context in which there are huge (not just big) amounts of unstructured data, with meaning hidden deep in layers of seemingly random, unconnected data items. Tools such as Hadoop emerged creating opportunities to build massively parallel , meaningful queries that identified relationships, semantics and correlations across such data sets. Now these tools are brought to bear in all sorts of applications, where data volumes are measured in Petabytes, and data sources are all forms of text and media, both structured and unstructured.

Closely related to these tools, but on the other side of the coin are the noSQL offerings. Loosing the shackles of consistency required within a relational frame (Ted Codd would be turning in his grave , and as for Boyce.. ) the noSQL databases are built for horizontal scalability, and include redundancy and fault tolerance capabilities. They typically support reads and amendments well, and are well suited for simple high volume transactional data (twitter, time-series). They are not, as yet, however suitable tools for storing complex data structures or for serious Data Mining applications (although they may form part of a Data Mining related stack)

Challenged by the influx of new technologies, approaches and conceptual frames, traditional relational database vendors are not gong to be left behind. And many of them are expanding capability in their existing relational databases or adding additional components to address the industry needs for more and more data processing. Distributed databases are not news, and concepts around partitioning of relational data across multiple database servers is implemented in all of the well known database offerings (Oracle, DB2, Sybase, Postgress etc) . But more than this, these database solutions are just able to handle bigger amounts of data. It is difficult to get true details, as there are a number of factors that contribute to scalability but as an indication, according to dbfriend.net the largest SQL server database in production now is 100TB. Oracle’s exadata , which is a purpose built hardware and software appliance for running existing and new database applications at high scale and high volumes is reported to have one customer with over a Petabyte of data under management.

For those applications which really do need big data, or a combination of big data and relational data, Oracle Big Data Appliance and Sybase IQ offer large scale data analysis solutions that address similar markets to the open source options, but at more cost and with more support and development.

The various solutions seem to be sat in regions on a two dimensional axis from Small Data to Big Data in one dimension and Structured data to Unstructured data in another.

There are some clear overlaps, as you would expect, in which the same problem can be solved, appropriately, by more than one tool. The question for us, when we consider the value of Big data technology in the context of Investment Banking technical needs is:

Is there a business problem in which the size and complexity of the data such that a Big Data solution is the only solution?

I contend that the answer to this question is a firm NO. Not yet, at least.

Firstly on the question of structure, I would suggest that the data that we are dealing with in the context of investment banking is very well structured and understood. We have spent many years in understanding ways to make meaning from our data. We have built Entity-Relationship models for all asset classes , for counterparties and transaction. We have developed reporting data warehouses using standard DW scheme definitions, and we have but standard ways of communicating data within and between organisations (FPML and FIX for example) . The one thing we have in abundance is structured data, data that we understand, that has a shared semantic and a common ontology. We know how to derive value and meaning from this data, and we know how to store it, transform it and distribute it. This is not the same as the endless stream of chatter and media and interactions that the facebook’s and Google’s of this new world have to deal and make meaning from on a daily basis.

Secondly on the question of size, it is my experience that while we may think we have a lot of data, sometimes up to 10′s of terabytes, this is not actually a big enough amount of data to be a Big data problem. Its more a Slightly Large data problem, and one that can be typically handled by existing database technology, perhaps with a few enhancements.

The advantage of using existing SQL based E-R technology (distributed, partitioned or just on a huge machine) is manifold. Most importantly is the key fact that the existing semantic of that data is retained, no further processing is necessary to create the first level of meaning. Secondary meaning can be established using well understood and traditional Data Mining and Business Intelligence tools and approaches with the benefits that SQL bring in terms of making sense of set-oriented data. Then of course there are questions of suportability and available skilled resource, and the benefits of stability that using a well understood and existing technology brings.

Big Data solutions are often cited as being essential as we move to new models of risk management that cut across portfolios, counterparties and asset classes. But these problems are predominately compute intensive, and data constraints are prevalent in initial load, intermediate results processing and storage and aggregation. At this point none of the data sizes are too large for a combination of database and file system, and none of the problems require Hadoop style parallel queries, although Map-Reduce is a useful paradigm that could help with aggregation, and which can be implemented in a variety of ways (is is supported now by the major grid vendors for example).

Reflecting on the size-structure quadrant above, it is clear that there are certain tools that are suitable for certain jobs. In finance there are, with out a doubt, contexts and frames dealing with large amounts of unstructured data in which Big Data tools are the right technology. However at this time, and for the foreseeable future, while it may be an interesting diversion and a useful areas to build skills and knowledge , there seems to be no standard business problem in the Investment Banking world that needs Big Data technology, or that cannot be addressed with standard database offerings and their enhancements.

That is not, of course, to say it won’t come. Only that right now, if you want to screw in a screw, you need a screwdriver.. not a hammer.

Tagged: Big Data, cassandra, financial markets, grid computing, hadoop, technical, technology