I read Jason Kotsaftis's recent blog post with interest, since I too am very biased. Rather than commenting on Jason's post (everyone knows comments very seldom get read), I thought I would wade in with my own take on this issue.

When I came to EMC, I was shocked at how Microsoft-centric we are. Not that Microsoft is an uninteresting technology space. Far from it. I simply maintain that Oracle is more interesting, and that Oracle has been woefully neglected by EMC in the past (although that is turning around rapidly due to the great work of folks in Jason's organization).

Looking at the two spaces: Microsoft and Oracle, which is more interesting for us? I maintain that Oracle is more interesting because of several factors:

  • First is the halo effect. I talk about this a lot, but for those who don't know me, here is the idea. Oracle is the most expensive piece of software ever written for general purpose use. (It is also the most complex.) To give you an idea of the cost of Oracle, I recently visited with a major Fortune 500 company and met with their DBA group. They shared the cost of their Oracle license with me. That cost for them is $22 per GB per month. They have petabytes of this stuff. Think about that for a minute. That is an astounding number. This company is launching an entire project to reduce the cost of the Oracle license, because it is such a huge component of their IT budget. The effect of the fact of the cost of Oracle is simply this: The customer cares, and cares deeply, about the Oracle infrastructure. It is their crown jewel. They have paid dearly for it. In order to play in that space, you need to be the most enterprise-ready, robust, reliable, resilent technology in there entire environment. What does this mean for you? If you can store and manage the Oracle database data, you are by definition the best vendor in their datacenter. You are handling and helping manage the customer's most important, expensive, precious environment. Are you then good enough to store and manage the Microsoft data? I would say, virtually by definition, yes. So, if you qualify yourself to store Microsoft data, you do not automatically qualify yourself to store Oracle data. On the other hand, if you have qualified yourself to store the Oracle database data, you are almost all of the way there in building credibility in the Microsoft space, especially with the higher level managers in the customer's organization who have visibility into both those technology spaces. This is the halo effect. It is very real. I have seen this work in exactly this manner many, many times in my career.
  • Of the two companies, Microsoft and Oracle, who is more aligned with EMC? Microsoft still makes the vast majority of their revenues off of the sale of consumer-oriented desktop software. Oracle is truly the enterprise software company. They dwarf Microsoft in that space. This is exactly the business we are in. If you look at the Symmetrix and compare it to Oracle, it amazing how many similarities there are in terms of the market they address. We own this space and so do they.

 

Is Microsoft easier? Absolutely. Oracle is a tough company to work with. They have a killer technology and they know it. They want to own the market. They do not believe that they need us. They know we need them. Their technology is far more complex and difficult than Microsoft.

All of that is true. But people make money doing things that are hard. Doing a great job of addressing the Oracle market will yield huge rewards, far beyond the costs of doing so. I strongly support the effort of Jason, Jeff, and Vince in building the Oracle relationship, and I am in for the long haul in helping them to do so.

I have received an email comment concerning my previous post on this subject, which expresses a common misconception. This comment is as follows:

I wanted to clarify I understand your design in your latest blog post entitled "Blended FCP/NFS Oracle Solution".

In your design, you have the Celerra unit acting like a NetApp gFiler/V-series would formating a LUN on the CX-3 and then servicing NFS requests to the Oracle hosts in addition to the Oracle hosts connecting directly to the CX-3 for FCP operation. This design implementation required two products (Celerra and CLARiiON) to implement versus a NetApp filer that could serve both NFS and FCP LUNs from the same storage engine.

More to the core of the issue is the performance issues that you mention. The tradeoff is that in the NetApp environment you have a filesystem designed for NFS and then the tacked on LUNs implemented as a file instead of EMC LUN environment with a unit tacked on for NFS.  On the other side of the coin you have the disk space reservations needed for LUNs on the NetApp filers that you would not see with the LUNs on the CLARiiON, but you have increased filesystem usage on CLARiiON from the Celerra that you would not have with WAFL.

Let me know if I am on the right track.

On the first issue, the writer of this comment is both correct and incorrect. Yes, Celerra uses CLARiiON as the back end. No, you do not need to purchase an additional product. This is because a Celerra, in either the integrated or multi-protocol versions, includes the CLARiiON back end.

Take the Celerra NS40. This is a Celerra head, consisting of two data movers and a control station (think of these as being similar to the NetApp head), connected via a FCP network to a CLARiiON CX3-40 back end. The following graphic makes this clear.

Untitled2_2

What EMC NAS Engineering did was actually quite brilliant. They simply exposed the FCP ports of the back-end CLARiiON for connection to hosts. Again, this provides both NFS (and CIFS) access via the Celerra front-end and FCP access via the CLARiiON back-end. This is what we now call the Celerra NS Multi-Protocol series array.

The way that this is bundled is also nice. I do not do pricing, as that is not really my area within EMC. However, I have been present when pricing was presented to many customers. It turns out that the cost of a Celerra NS40 and a CLARiiON CX3-40 are basically the same. The effect of that is that the customer gets the additional functionality that the Celerra provides for free.

It's kind of like buying a Lexus for the price of a Toyota. The FCP access to the CLARiiON CX3-40 is identical, but the Celerra NS40 provides additional functionality at minimal additional cost. That's a good deal for the customer.

Hence the blended solution. You can take a Celerra NS40, which inherently includes the CLARiiON CX3-40. You use the Celerra for NFS access to Oracle objects that do not need high-performance, low-latency I/O. The rest you place directly onto the CLARiiON CX3-40.

In the process, by the way, you actually install and configure less software on the database host. This is because, as I pointed out in my last post, you must configure a shared storage layer for the CRS files, which cannot be managed by ASM. Typically, this would require OCFS2, or even raw devices. You get this for free with NFS, which is already installed for you.

The commenter is actually on the right track on the second point. I have covered this fairly thoroughly in other posts on this blog, but the issues with NetApp FCP access include:

  1. A LUN is actually a file in the WAFL file system. This can be easily proven by mounting the volume where a LUN resides via NFS or CIFS. You will find there a file of the same size and name as the LUN.
  2. WAFL has some very troubling aspects with respect to sequential read performance. See my previous post on this point.

With EMC CLARiiON, again, a LUN is a LUN, i.e. a storage object you configure directly onto a RAID group. No file system in between you and the disk, in other words.

Keep the comments coming!

The program that I manage just published a solution that I am pretty jazzed about. This is in conjunction with the EMC Celerra NS Multi-Protocol Series Array. This array allows for both traditional NAS (i.e NFS and CIFS) access as well as FCP access to the CLARiiON CX-3 Series back end array.

Yes, I know. NetApp provides multi-protocol access already. However there is a very big difference between the FCP access provided on a NetApp filer and a CLARiiON CX-3 Series array. That is with NetApp, the FCP access is a band-aid solution which is really a special file running on top of the WAFL file system. With the CLARiiON CX-3, the LUN that you see over FCP is a really, live, good, old-fashioned LUN sitting on a RAID group. No extra WAFL file system to muck up your read access, or otherwise complicate things.

In a word, it's simpler.

Having said that, NFS has its place. There are lots and lots of files which must be managed by an Oracle RAC database server which do not require the low-latency, high-performance access of FCP. Further, if you are using ASM (which we are), then many of these files cannot be stored over ASM. This means that you need a clustered file system on top of ASM.

Guess what? You already have one. Which is completely ubiquitous and automatically installed on every UNIX and UNIX-like operating system in the industry. It's called NFS.

And it works just fine for files like the CRS files, i.e. the voting disk and the OCR file. It also works beautifully for backups, flashback recovery files, and archived logs. These files absolutely do not need the high-performance, low-latency access of FCP. Why not free up your expensive SAN and use NFS over IP to manage these files?

That's the idea behind the blended FCP / NFS solution. I have not done an exhaustive search, but as near as I can tell, no other storage vendor has done this yet. The blended solution looks like this:

Untitled_3

The blended solution can be found here. This will be the vehicle whereby our program showcases FCP solutions from EMC from now on. I hope you find it as innovative and interesting as I do.

In a recent post, I stated that I would circle back around and give you an update on Oracle 11g performance once we had tested the 64 bit version. I have done so at this point, and I must admit that I am rather disappointed.

To remind you of our approach to performance, we do not do any exotic tunings. We simply tune in a manner which a typical customer would find reasonable. No hidden leading underbar parameters, or that sort of thing. This is very, very different from the typical TPC-C you see posted on the TPC.org website.

On 10g, what we found was that tuning memory is easy. We set the parameters SGA_TARGET and SGA_MAX_TARGET and we are away to the races. Of course, we tune Huge Pages appropriately as well. For more information on that, see the excellent article on the Puschitz website.

We got a lot of improvement by tuning Huge Pages. Even as a storage company, who make money by selling disks, we recognize the scalability improvements from adding memory. In general, memory is our number one scaling bottleneck. We never recommend that customers add disk before we have investigated memory first.

So, needless to say the issue of memory is very central to our methodology on tuning Oracle.

When the 32 bit version of Oracle 11g came out, as I stated previously, the memory model was the main reason that we were not able to scale as well as we did on the 64 bit version of 10g. Hence, I had high hopes that the 64 bit version of 11g would scale well. In fact, I hypothesized that we could transition our entire program to 11g.

Wrong. The memory model on the 64 bit version of 11g is no better than the 32 bit version. I find this very strange, and honestly consider it to be a bug. The bottom line is that we have not been able to scale the 64 bit version of 11g any better than we did the 32 bit version.

For this reason, again, we cannot really assess the performance impact of dNFS in a real-world performance workload. (Very contrived testing could be conducted to prove that dNFS saves CPU cost, but what would that prove? Others have done that testing and we have no interest in doing so.) Our focus is on a real-world workload running on a real-world configuration. In this situation, 11g over dNFS does not perform better than 10g over kernel NFS.

We have gone back to 10g Release 2 for now. I will come back to 11g later, perhaps when Release 2 is available (although I have been told I may wait a while for that). In the meantime, I will continue to update you on what we find out in our testing. Our current focus is on the use of VMware in the Oracle context. More on that later.

In one of my previous posts in this blog I covered the different types of snapshots and their effect on storage performance. Basically, snapshots fall into two categories:

  1. Metadata based snapshots which have no write penalty but extract a read penalty due to the sequential read after random write (SRaRW) issue.
  2. Reserve LUN Pool (RLP) based snapshots which have no read penalty but extract a write penalty due to the Copy on First Write (CoFW) issue.

It is noteworthy that NetApp is well known for its metadata based snapshots using the WAFL file system, and EMC is well known as being a leader in the RPL based snapshots space.

I have recently learned that EMC has now largely solved the CoFW performance penalty, using a new feature called Asynchronous Copy on First Write (ACoFW). This means that EMC snapshot technology is now ideal for the IT professional who wants great write performance combined with great read performance.

ACoFW works like this:

  1. The storage array marks a write track as "versioned write pending", and accepts it without a CoFW penalty to the host.
  2. When the storage array's processor proceeds to write this track the disk, it notices the flag and performs the CoFW in the background before writing the new data to the disk.

This still creates extra disk I/O on the array, but for the vast majority of database applications, it also means that a snap or a clone does not enforce any write performance penalty.

My friends on the EMC SPEED site (for performance gurus only) have shown me the data: ACoFW dramatically reduces the write performance penalty for database applications using snapshots.

As usually, there is no free lunch. If the CoFW write activity on the array exceeds a threshold, the system will start using synchronous (i.e. old-fashioned) CoFW to better manage the pending writes. Thus, you need to manage the workload in such as way that there is excess capacity on the storage array to enable this performance benefit.

ACoFW is currently available on EMC Symmetrix arrays. I am investigating the status of this feature on other EMC arrays, and will report back on this blog as I learn more.

So I noticed that 11g shipped in 64 bit on the same day as I issued my last post. Rather embarrassing. Oh, well. Means I have lots of work to do. That's job security, though.

I will keep you posted on our progress in testing dNFS with EMC NAS. Should have something interesting by OOW, in a few weeks from now.

This post is an update on the 11g new feature called Direct NFS (dNFS). A lot of marketing buzz has been issued on this feature recently. I hope to add a bit of reality to the discussion.

Here is the theory behind dNFS. I/O on a database server occurs in a combination of user space and kernel space. Context swaps between the two spaces are expensive in terms of CPU cost. If you can move a part of that activity from kernel space into user space, you can save CPU cost due to reduced context swapping. One example is the difference between Linux kernel IP port bonding and the ability of dNFS to make use of multiple paths. Linux kernel bonding is expensive in terms of CPU cost and less efficient than dNFS, according to the dNFS white paper referenced below.

Further, dNFS avoids double-caching by skipping file system NFS client-side caching. The data is only cached in the database buffer cache, thus avoiding wasted memory and CPU cost in caching the data twice.

These benefits are thoroughly covered in the dNFS white paper on Oracle’s web site. This paper was co-authored by Kevin Clossen, and it does a pretty good job of stating the potential performance benefits of dNFS.

Fundamentally, I agree with these potential benefits. Problem is, that is all they are right now: potential. This is true because in our testing Oracle RAC 10g Release 2 provides far better performance than Oracle RAC 11g Release 1 does, even considering that 10g is running over plain-vanilla kernel NFS (kNFS) and 11g is running over dNFS. The reason for this is simple: 11g is only available in 32 bit. 10g is running in full-blown 64 bit.

You can say that comparing 10g 64 bit to 11g 32 bit is apples-to-oranges, and I will not necessarily disagree with you. However, bear in mind that in all other respects, the configuration was identical.

Here are the numbers. On 10g using Release 2 for x86-64, we are getting 10,200 users with a bit over 500 TPS. (Again, bear in mind that these are TPC-C style transactions. Do not compare these to normal Oracle transactions.) On 11g using Release 1 for x86-32 on identical hardware, after an excruciating level of tuning, we ultimately maxed out at 7,300 users at around 380 TPS.

Now could we get more on 11g R1 using dNFS than 11g R1 on kNFS? Maybe. I didn’t test it that thoroughly. Honestly, I don’t consider it particularly interesting. 10g is readily available and provides far more performance than 11g on dNFS.

So when does dNFS become interesting? Or stated in a larger context, when does 11g become interesting? From a performance standpoint (which is something near and dear to my heart), 11g becomes interesting when it ships in 64 bit. Until then, this is all simply theoretical. That will happen, from what I am told, sometime in Q4. Stay tuned on that.

Will we jump on 11g performance testing as soon as we have a 64 bit version with reasonable performance? Most definitely. Do I expect to see a performance benefit for dNFS at that point? Yes, I certainly do.

The other misconception which has arisen with respect to dNFS has to do with the press release issued by NetApp stating that the dNFS feature of 11g was developed collaboratively by NetApp and Oracle together. This press release is true but misleading.

It is certainly true that NetApp was directly involved in the development of dNFS. While I was not directly involved in this development, much of it occurred while I was at NetApp. I think I can even claim some minor amount credit for the initial idea behind this feature. This occurred way back in the late 90s. There were a series of discussions among several of members of the NetApp staff, myself included (most of whom are no longer there) around how you would optimize Oracle I/O to a NAS device. The outcome of those discussions was the idea of taking the NFS client and putting that inside the Oracle kernel. This is what dNFS effectively does.

So, yes, NetApp was certainly integrally involved in the development of the dNFS protocol. I do not question that at all.

The misleading aspect of the press release is the unspoken implication that dNFS will somehow work differently or better on NetApp NAS gear than the equipment manufactured by other vendors, including EMC. That implication is completely false.

Remember what we are talking about here: CPU cost on the database server. File system caching on the database server. All of this is about optimizing the use of resources on the database server. dNFS is an NFS version 3 implementation, as the Oracle white paper clearly states. As such dNFS will work in exactly the same manner, with identical performance benefits, on any NAS device from any vendor. Including Celerra by EMC.

I have just finished installing CRS (cluster ready services) for Oracle RAC 11g for the first time. If you would like to share my pain, take a look at this SR on metalink. As you can see, I meticulously followed the pre-installation instructions in the clusterware installation guide, including running the configuration checker runcluvfy. It ran perfectly. All tests passed. Green lights all the way.

Then I ran the Oracle Universal Installer, and it ran perfectly until it got to the point where you run the root.sh scripts.

At which point, those scripts hung, spewed out nasty little error messages and fell on the floor. Ouch!

Turns out this was related to a permissions mismatch. Last week was my week for this sort of thing. However, I have found in the past that issues installing CRS are depressingly common. Hence the need for a preinstallation checker. Which obviously does not check for everything you need to have a successful install.

Don’t get me wrong. I love the Oracle database product. It’s definitely the best database in the business as far as I am concerned. It is a mind boggling product. Absolutely the most reliable, recoverable, and high performing database on the planet when correctly configured, tuned and such.

But the cluster layer of RAC has many issues, not the least of which is installing it. One of those is cost. On the Enterprise Edition version of Oracle software, you get to pay a 50% up charge for the privilege of running this beast. Then there is the “RAC tax”. That’s the CPU cost of running CRS, which I am told by the folks at Hotsos is around ½ of a CPU for each node in the RAC. (Haven’t heard of Hotsos? You should. They are the best in the business as far as I am concerned in the area of Oracle database consulting, especially for performance tuning.)

Given all these issues, the question then is: Why run RAC? That is the question I am struggling with. The rest of Oracle, absent RAC, is fairly simple, and works extremely well. Yes, there are a few issues to installing and running Oracle Database 11g in single instance mode. But they are nowhere near as daunting as the issues of RAC.

RAC gives you two things:

  1. High availability. This is the heart and soul of RAC. You can encounter a failure of any component, and the overall database will stay up and accessible to users.
  2. Scalability. You can add nodes to the RAC, in order to scale up the solution. Actually, this is technically referred to as “scale out”, not “scale up”. Scale up is the concept of adding capacity to a single server. Frequently, this involves downtime or a forklift upgrade. The Sun Enterprise Series was an attempt to solve that problem while maintaining a single monolithic server solution. It was pretty much of a failure. Loosely coupled clusters are the current conventional wisdom on the way to solve this problem. However, there are much more mature and easier to manage cluster solutions than RAC.

Looking at the first of these items, the first question you would ask is: Can you provide high availability to Oracle in another way? Given the economic and CPU costs, complexity, and difficulty of CRS, alternatives should be relatively attractive. Expect to see a series of posts from me in the next few weeks exploring alternatives to RAC/CRS in providing high availability. Not surprisingly, a few of these solutions may come from EMC.

On the scalability side, I am not sure that the level of scalability that RAC provides is really required for most customers. There are two data points here: Memory and CPU. On the memory side, there is no question you can scale very high on a RAC solution. The following table compares a four node RAC solution to a single node solution. The RAC solution uses our current powerhouse, the Dell PowerEdge 2900. The single instance solution uses the high-end Dell server, the PowerEdge 6950.

Platform Memory CPU
Dell PE2900 48 GB (192 GB for a 4 node RAC) 2 2.66 GHz Quad-Core Intel Xeon processors (32 effective processors in a 4 node RAC)
Dell PE6950 64 GB 4 3.0 GHz Quad-Core AMD Opteron processors (16 effective processors)

As you can see, the four-node RAC does have more than three times the memory and twice the CPU as the single instance solution. However, the following table, taken directly from the Oracle price list, points out the price difference:

Oracle Enterprise Edition RAC Up Charge Standard Edition
$40,000 per CPU $20,000 per CPU $15,000 per CPU

Given the price difference, a single instance solution which works even a third as well for a third the price is a fairly good deal. That would assume that RAC did not cost you more in terms of CPU, memory and such. Which it does.

I can also tell you this: Very few folks need the throughput of even a four-node RAC. On our testing, a four-node RAC is maxing out at the high side of 10,000 users and over 500 transactions per second. That’s a lot of throughput. Very, very few customers need that much. Or even a third of that much. So the question is: How much scalability do you need? And is RAC overkill for that level of scalability, given its cost and other issues? In other words, if you can provide high availability and enough scalability for your application without RAC, do you really need it?

What will interest me greatly is how much we will get in terms of throughput on a single Dell PE6950 as opposed to the four-node RAC solution running on the Dell PE2900. I suspect that the price-performance of the Dell PE6950 solution using single instance Oracle is going to be better.

Stay tuned.

In my last post, I explained why I think thin provisioning is fairly useless for most Oracle database environments. I have been taking a bit of heat for this post, mostly from my good friend Barry Burke (AKA The Storage Anarchist). Given the respect and affection I have for Barry, I thought it might be worthwhile to put up a clarifying post, pointing out some of the areas where Barry believes thin provisioning is a really, really good thing for many usages cases he commonly sees. I think he makes some reasonably good points.

Bear in mind that I work in the Commercial program within EMC. That’s the set of customers who have revenues of between $25 million and $1 billion. These are most definitely low-end to mid-range customers. Barry works for the Symmetrix group, which largely address what we call the Enterprise customers (revenues of greater than $1 billion). This is a vastly different type of customer from mine. Admittedly, in my defense, the Commercial space accounts for the vast majority of business activity and technology spending in this country. So I have a bit of a bias here. However, EMC has traditionally served the Enterprise customer base more, and thus our customers are clustered in that direction. And there is also no question that Barry’s customers are the more well-heeled, household name type customers. In other words, most of the global financial services, telecommunications, insurance and manufacturing companies you probably know and love.

So let’s examine the usage cases between the two sets of customers. Taking the Commercial space first since that is what I am most familiar with. These customers have some of following characteristics:

  1. Small IT organizations with minimal politics and bureaucracy
  2. Limited budget
  3. Smaller allocations of storage due to cost constraints
  4. Granularity of the typical database space allocation is a significant percentage of the total space on the array 

Take an example of an application where the DBA has requested 100 GB of space, and has stated he or she will need 500 GB of space eventually. The storage administrator in a thin provisioning context would tend to allocate space as shown in the following graphic:

Thinprovisioning21small_2

In this graphic, you see that the storage administrator has given the DBA a thinly provisioned LUN of 500 GB in size. On an array with a few TB, that’s a significant amount of storage. I see that kind of allocation all the time. The DBA has created a database on the LUN with 100 GB of datafiles, again not an atypical size. The thinly provisioned LUN is backed up by 150 GB of space.

On the next occasion where the DBA needs additional storage, he or she would tend to do something like the following:

Thinprovisioning23small

Here, the DBA has allocated an additional 100 GB of space, probably by creating one or more datafiles. This pushes the thinly provisioned LUN over the edge. The storage administrator must now add storage to the LUN. Further, the DBA’s request to create files has been denied, much to his or her chagrin. After all, the DBA thinks he or she has 500 GB of space.

This is the dark side of thin provisioning. Like I said, thin provisioning is lying. Sometimes you get caught. The times you get caught are the times when the granularity of the additional space required by the user (in this case the DBA) is a very significant fraction of the total space in the LUN. Since Commercial DBAs tend to do that, that makes thin provisioning a risky proposition at best, and of questionable value, since the DBA will end up interfacing with the storage administrator in this case anyway. No saving in terms of administration here.

Now let’s look at a usage case that Barry pointed out to me where thin provisioning works very well for a large Enterprise customer. These customers have the following characteristics:

  1. Large IT organizations with lots of politics and bureaucracy
  2. Larger budgets with less stringent cost constraints
  3. Larger amounts of storage available on larger arrays
  4. Granularity of the space demanded by the DBA for a typical Oracle database space allocation (as in creating a new datafile) is a small percentage of the total space on the array

Point 4 is the key. In my first post, I said that thin provisioning works very well for file systems with unstructured data where the size of the marginal file is a small percentage of the array. In the case of an Enterprise class customer, the array is so huge that even a database allocation looks like a file on a file system. Let’s look at one such scenario, as illustrated by the following graphic:

Thinprovisioning22small_2

In this scenario, we have many, potentially hundreds, of apps being stored on the array. Each app has far greater provisioned space than they have utilized space. They are being stored on a fairly huge pool of physical storage as well. These may be testing or development databases, where careful configuration of the physical storage is less important. Further, the marginal granularity of each additional database space allocation is tiny compared to the total magnitude of the storage available. In this case, thin provisioning actually works very well, and provides some important benefits. In large organizations where the lead time to get additional space allocated is very long, it will save the DBA lots of time and headaches. Further, where there is a lot of political issues and the storage and database groups do not like or trust each other, this will significantly smooth the way for the DBA in getting storage for his or her database. Another benefit is free space pooling. While the apps each grow, they grow at different, somewhat unpredictable, rates. Giving them each a big thin provisioned LUN allows them to grow somewhat unfettered. As they grow physically, they are able to share the free space. This improves utilization significantly. This is the real main thrust of thin provisioning.

Thus, the critical issue is the size of the average additional space allocation from the DBA compared with the size of the array. If this is a small percentage, then thin provisioning works well, and provides important benefits. This tends to be true with larger customers who have larger arrays. In smaller customers, thin provisioning may be of less benefit.

In this post, I will discuss the concept of thin provisioning and database storage. There has been a lot of confusion on this issue recently, and I think that we need to clear the air and get some things straight.

First I would like to explain why thin provisioning is pretty much useless for Oracle datafiles right now. Then I will turn to how thin provisioning could be integrated into Oracle at which point it would be very interesting indeed.

OK, to start, I should define what I mean by thin provisioning. Assume you are a storage administrator, and you have a bunch of customers who need to store tons and tons of data. Many of you folks reading this post are probably in that category. What consumers of data on networked storage devices tend to do is the following:

  1. Figure out how much space they need to store their data.
  2. Double that because they know the data will grow over time.
  3. Double it again for ducks.

The effect is that storage consumers tend to be very liberal in their demands for storage. This becomes a problem for the storage administrator because costs balloon and utilization on the array is very low. A formula for aggressively and stupidly wasting money in other words. The following graphic illustrates this:

Thin1small

What you would like to do is the following:

  1. Give the customer a block of storage which is the size they are demanding.
  2. Under the covers, allocate a more realistic amount of storage, closer to what you think they really need right away.
  3. Monitor the storage, and add physical space to the device when you need it, in order to meet the demands for what your customer really needs.

The following graphic shows how this would work:

Thin2small

Note that the light grey, plus the yellow, represents the total space on the device. The over provisioned space is space which the array tells the customer he or she has, but it does not exist physically. It is then the storage administrator’s responsibility to make sure that physical space (i.e. disks) is added to the array in a timely manner in order to meet the customer’s expectations for space.

If this sounds like lying, you’re right. It certainly is. It’s a shell game. A very useful and cost effective shell game for many folks, though, given storage consumers’ propensity to ask for a lot of space they don’t need.

One assumption that is required in order for this to work at all, however, is the ability for the array to lie to the host operating system about the size of a file system or LUN which is served up to the host. For a file system like NFS or CIFS on a NAS box, this works well, as long as the data being stored is unstructured. Any given file is of a certain size, and that file’s space must be available for it to be stored. But the entire file system can certainly lie to the host and say its capacity is a terabyte, when it is really only 100 GB. No problem there, as long as the physical size is adequate to store the amount of data represented by the files actually in the file system.

The issue comes when you try to store structured data like Oracle. A DBA on an Oracle database will request the size of data he or she thinks the database will need, with the same propensity for over provisioning as any other consumer of data. The difference, though, is that the DBA will actually create datafiles which fill that space. And Oracle then makes a physical file on the device of that size, creates extents in this file, and writes zeros to it.

There is a concept in Oracle of auto extension of files. This concept seems like it would align well with thin provisioning. And it would if DBAs used it. Problem is, extending a file is a very expensive operation. Again, because Oracle likes to lock down that file and zero it out completely. DBAs hate that. A huge performance hit kicking the database in the teeth at any unpredictable time. Simply because the datafile ran out of space. Not good. Not good at all.

DBAs avoid auto extending files like the plague for this reason. They will allocate the space they need, always, when they create the database. And future expansions in space will be made intelligently, carefully, and methodically. That’s the way DBAs think. Believe me, I know. I am one of them.

This makes thin provisioning completely useless nonsense for Oracle data. Anyone who tells you otherwise should be viewed with deep suspicion. I say this without any bias whatsoever, since my employer sells arrays that provide thin provisioning too. I am simply telling you the way it is here.

Now, how could thin provisioning be made to work with Oracle? That’s a very interesting question. It would require integration between the storage array and the Oracle kernel. Then Oracle could avoid zeroing out a file, and simply allow the array to provide the storage Oracle needs to store the blocks presently in the file. This would probably occur within ASM. (ASM stands for Automatic Storage Management, Oracle’s storage layer.) There is some discussion of that type of integration between storage and Oracle, but do not expect to see it anytime soon if ever.