I have been reading lots of material on the web about Big Data. As well as attending EMC World a couple of weeks ago, at which the major theme was around Big Data meets the Cloud. So I have been immersed in this area for a while now.
Unfortunately, I think folks don't really understand what Big Data really is. At least, the definitions that I read on the web (and see at tradeshow sessions) miss the essence of the problem.
I have always believed that if I am confused other folks are confused as well. Thus, I think that the issue of Big Data (which is undoubtedly one of the major IT issues of our time) needs to be clarified.
In this post, I will attempt to define what I think Big Data is, and in the process, hopefully spark some debate. Perhaps this will help clarify things a bit. In my view, most Big Data applications have several characteristics in common:
- Perhaps it goes without saying but the data being managed by Big Data is big. This is the one characteristic that most of the pundits seem to "get". Scaling to hundreds of terabytes is common, and petabyte scaling is not unheard of. This creates unique challenges, as you can imagine.
- The first challenge is performance. Big Data applications are volatile. An enormous percentage of the data gets overwritten very quickly. When you have terabytes of data being written very quickly, that means that traditional SAN storage and software like Oracle Database simply won't cut it in terms of the throughput required. A new way of thinking about the problem is required, and this new way of thinking primarily concerns major order-of-magnitude increases in throughput.
- The next challenge is cost. The hardware and software being used currently for most IT applications is simply too expensive and over-engineered for Big Data, because the cost per gigabyte is way too high. Most Big Data applications would be impossible if conventional, old-school hardware and software (like traditional SAN storage and Oracle Database) were used to manage the application. Thus, another new way of thinking is required in order to manage Big Data which revolves around dramatically reducing cost per gigabyte. This issue, combined with the previous issue, requires a new type of hardware and software, hence the use of large in-memory arrays and software like Hadoop.
- Traditional transactional properties like ACID simply do not apply to most Big Data applications. Hardware like traditional SAN storage and software like Oracle Database are oriented around one concept: Data is extremely precious and must be protected at all costs. In Big Data the use of "fuzzy logic" is very common: It is not any individual piece of data which is precious. Rather, it is the overall texture of the data which is important. Any individual piece of data is of trivial value. Many of my customers tell me that this change in mindset is where the costs savings come from. Once you divorce yourself from the concept of ACID properties and data which is guaranteed to be transactionally consistent, it becomes possible to create applications with orders of magnitude lower cost and higher performance than traditional transaction databases like Oracle.
The effect is that many Big Data applications are not even backed up. The current "state" of the Big Data is important: Preserving the data content historically is not important. Thus, Big Data applications are regularly polled and summarized. The summaries are then saved into a traditional transaction database like Oracle and stored on a traditional storage array. But of course this represents a minute fraction of the data managed by the Big Data application.
I will give you two examples of what I am talking about. The first concerns a major, multinational shipping corporation, which is a household name. I do not have permission to share the name of this customer, so I will not do so here. However, this is one of the largest and most well-established online shipping companies in the world.
One of the promises implicit in the business model of this company is promptness of delivery. If a customer ships a package overnight, the company guarantees that this package will be at the destination within that timeframe, and maintaining absolutely compliance with that promise is one of the company's core values.
This creates a tension: In the case of a peak in usage at a given location, the company can do one of the following two things:
- Maintain a staff, facility, and equipment infrastructure sufficient to manage the peak usage that could ever be imagined at a given location, in the process wasting large amounts of money.
- Maintain a reasonable level of staff, facility and equipment consistent with more common usage levels with the risk that the company will fail to meet its delivery promise during periods of peak usage.
A good example of peak usage occurs when a major event like the Superbowl happens. Superbowl 2011 drew over 100,000 people to a city called Arlington, Texas, which has a population of approximately 400,000. In the process, our online shipping company saw a major spike in shipping at the Arlington, Texas location. (A certain percentage of the population in any area use the company's shipping facilities at any given time. Thus, an increase in population in any area will result in an increase in shipping activity.) That represents an increase in about 25% in shipping during Superbowl weekend.
In order to avoid over provisioning staff, facilities and equipment while still maintaining its guarantee of prompt delivery, the company maintains a Big Data application in its corporate headquarters. This application contains an object for every package which is in the company's system at any time worldwide. This object maintains a property showing its current physical location in the system. At the moment an airbill is scanned into a company facility, this object shows up in the Big Data application at that location. Using this application, the company can see the number of packages in process within any location in the entire company.
As a result, within a matter of minutes, the company can detect a peak usage event. (There is a control center in the company headquarters where the state of the Big Data application is displayed on large screens, and personnel monitor this system continuously.) Once a peak usage event is detected, the control center personnel contact company facilities in the surrounding areas and have all of the spare drivers and sorters converge on the peak usage location. In our example, the folks at company headquarters would contact Irving, Carrollton, Dallas, Fort Worth, and so forth, and have all excess capacity shift to Arlington.
The effect of this Big Data application is as follows:
- A reasonable level of capacity can be maintained at any given location, in the process saving a lot of money.
- Excess capacity can quickly be shifted around the system so that peak usage can be handled without breaking the promise of prompt delivery.
As a result the Big Data application in the company headquarters has a huge ROI.
Another classic Big Data application is smart metering. I have a customer which is rolling out 30 million smart meters across its entire system. This is a major energy utility company on the East Coast of North America. Again, I will not use the company's name.
Each smart meter takes a usage measurement once an hour. At 30 million smart meters that's the following volume of data:
30,000,000 x 24 = 72,000,000 per day x 365 = 26,280,000,000 per year
This data is used by the company to measure the usage pattern over time within its system. In the process, the company can design its capacity more efficiently. The company can also assist its customers in the area of conservation. Peak users can be easily identified, and contacted to alleviate their peak usage patterns.
The result is that the company is greener and more efficient. The savings in improved efficiency in generation far exceeds the costs of the smart meters.
For billing purposes, the company only needs one measurement per month. That's a much more reasonable volume of data:
30,000,000 per month or 360,000,000 per year
Note the difference between the data which is used for capacity design, conservation and so forth, and the data which is used for billing:
- The hourly data is a huge volume, far exceeding that commonly managed by traditional database applications. The monthly data is easily within the range of traditional database application.
- The hourly data is only significant in total: Any individual hourly measurement is of trivial importance. What is important is the overall texture or "shape" of the hourly data at any given time. ACID properties do not need to be applied to the hourly measurements. They can be summarized in a Big Data application, and the resulting summaries can be saved into a traditional database application. The company would not even attempt to back up the enormous volume of data coming out of the hourly measurements, because doing so would be cost prohibitive and of dubious value. On the other hand, the monthly measurements are used for billing and thus have core importance to the business. This is OLTP data representing the lifeblood of the business. It must be transactionally consistent, comply with ACID, be backed up over time, and so forth.
These applications show the difference between traditional databases and Big Data, which I summarize as follows:
- Big Data has a data volume and data flow which exceeds the capacity of traditional database hardware and software.
- Big Data must be at a lower cost per gigabyte than traditional database applications by an order of magnitude or more.
- Big Data is not typically backed up: Rather it is summarized into a traditional database, but these summaries represent a minute fraction of the data managed by the Big Data application.
- Big Data applications do not comply with traditional ACID properties. Any item of data in a Big Data application is of minimal importance. Rather, the important thing is the overall "state" of the data.
I think my employer, i.e. EMC, "gets it" with respect to the challenges represented by Big Data. Certainly, the theme of EMC World this year reflects this. We face a huge challenge in coming up with the products to address the technical issues posed by Big Data. Unfortunately, I cannot tell that Oracle is similarly clued in with respect to Big Data. Oracle has few products which are focused on this area. It remains to be seen whether Oracle will become a major factor in this area.