Oracle

I read Jason Kotsaftis's recent blog post with interest, since I too am very biased. Rather than commenting on Jason's post (everyone knows comments very seldom get read), I thought I would wade in with my own take on this issue.

When I came to EMC, I was shocked at how Microsoft-centric we are. Not that Microsoft is an uninteresting technology space. Far from it. I simply maintain that Oracle is more interesting, and that Oracle has been woefully neglected by EMC in the past (although that is turning around rapidly due to the great work of folks in Jason's organization).

Looking at the two spaces: Microsoft and Oracle, which is more interesting for us? I maintain that Oracle is more interesting because of several factors:

  • First is the halo effect. I talk about this a lot, but for those who don't know me, here is the idea. Oracle is the most expensive piece of software ever written for general purpose use. (It is also the most complex.) To give you an idea of the cost of Oracle, I recently visited with a major Fortune 500 company and met with their DBA group. They shared the cost of their Oracle license with me. That cost for them is $22 per GB per month. They have petabytes of this stuff. Think about that for a minute. That is an astounding number. This company is launching an entire project to reduce the cost of the Oracle license, because it is such a huge component of their IT budget. The effect of the fact of the cost of Oracle is simply this: The customer cares, and cares deeply, about the Oracle infrastructure. It is their crown jewel. They have paid dearly for it. In order to play in that space, you need to be the most enterprise-ready, robust, reliable, resilent technology in there entire environment. What does this mean for you? If you can store and manage the Oracle database data, you are by definition the best vendor in their datacenter. You are handling and helping manage the customer's most important, expensive, precious environment. Are you then good enough to store and manage the Microsoft data? I would say, virtually by definition, yes. So, if you qualify yourself to store Microsoft data, you do not automatically qualify yourself to store Oracle data. On the other hand, if you have qualified yourself to store the Oracle database data, you are almost all of the way there in building credibility in the Microsoft space, especially with the higher level managers in the customer's organization who have visibility into both those technology spaces. This is the halo effect. It is very real. I have seen this work in exactly this manner many, many times in my career.
  • Of the two companies, Microsoft and Oracle, who is more aligned with EMC? Microsoft still makes the vast majority of their revenues off of the sale of consumer-oriented desktop software. Oracle is truly the enterprise software company. They dwarf Microsoft in that space. This is exactly the business we are in. If you look at the Symmetrix and compare it to Oracle, it amazing how many similarities there are in terms of the market they address. We own this space and so do they.

 

Is Microsoft easier? Absolutely. Oracle is a tough company to work with. They have a killer technology and they know it. They want to own the market. They do not believe that they need us. They know we need them. Their technology is far more complex and difficult than Microsoft.

All of that is true. But people make money doing things that are hard. Doing a great job of addressing the Oracle market will yield huge rewards, far beyond the costs of doing so. I strongly support the effort of Jason, Jeff, and Vince in building the Oracle relationship, and I am in for the long haul in helping them to do so.

I have received an email comment concerning my previous post on this subject, which expresses a common misconception. This comment is as follows:

I wanted to clarify I understand your design in your latest blog post entitled "Blended FCP/NFS Oracle Solution".

In your design, you have the Celerra unit acting like a NetApp gFiler/V-series would formating a LUN on the CX-3 and then servicing NFS requests to the Oracle hosts in addition to the Oracle hosts connecting directly to the CX-3 for FCP operation. This design implementation required two products (Celerra and CLARiiON) to implement versus a NetApp filer that could serve both NFS and FCP LUNs from the same storage engine.

More to the core of the issue is the performance issues that you mention. The tradeoff is that in the NetApp environment you have a filesystem designed for NFS and then the tacked on LUNs implemented as a file instead of EMC LUN environment with a unit tacked on for NFS.  On the other side of the coin you have the disk space reservations needed for LUNs on the NetApp filers that you would not see with the LUNs on the CLARiiON, but you have increased filesystem usage on CLARiiON from the Celerra that you would not have with WAFL.

Let me know if I am on the right track.

On the first issue, the writer of this comment is both correct and incorrect. Yes, Celerra uses CLARiiON as the back end. No, you do not need to purchase an additional product. This is because a Celerra, in either the integrated or multi-protocol versions, includes the CLARiiON back end.

Take the Celerra NS40. This is a Celerra head, consisting of two data movers and a control station (think of these as being similar to the NetApp head), connected via a FCP network to a CLARiiON CX3-40 back end. The following graphic makes this clear.

Untitled2_2

What EMC NAS Engineering did was actually quite brilliant. They simply exposed the FCP ports of the back-end CLARiiON for connection to hosts. Again, this provides both NFS (and CIFS) access via the Celerra front-end and FCP access via the CLARiiON back-end. This is what we now call the Celerra NS Multi-Protocol series array.

The way that this is bundled is also nice. I do not do pricing, as that is not really my area within EMC. However, I have been present when pricing was presented to many customers. It turns out that the cost of a Celerra NS40 and a CLARiiON CX3-40 are basically the same. The effect of that is that the customer gets the additional functionality that the Celerra provides for free.

It's kind of like buying a Lexus for the price of a Toyota. The FCP access to the CLARiiON CX3-40 is identical, but the Celerra NS40 provides additional functionality at minimal additional cost. That's a good deal for the customer.

Hence the blended solution. You can take a Celerra NS40, which inherently includes the CLARiiON CX3-40. You use the Celerra for NFS access to Oracle objects that do not need high-performance, low-latency I/O. The rest you place directly onto the CLARiiON CX3-40.

In the process, by the way, you actually install and configure less software on the database host. This is because, as I pointed out in my last post, you must configure a shared storage layer for the CRS files, which cannot be managed by ASM. Typically, this would require OCFS2, or even raw devices. You get this for free with NFS, which is already installed for you.

The commenter is actually on the right track on the second point. I have covered this fairly thoroughly in other posts on this blog, but the issues with NetApp FCP access include:

  1. A LUN is actually a file in the WAFL file system. This can be easily proven by mounting the volume where a LUN resides via NFS or CIFS. You will find there a file of the same size and name as the LUN.
  2. WAFL has some very troubling aspects with respect to sequential read performance. See my previous post on this point.

With EMC CLARiiON, again, a LUN is a LUN, i.e. a storage object you configure directly onto a RAID group. No file system in between you and the disk, in other words.

Keep the comments coming!

In a recent post, I stated that I would circle back around and give you an update on Oracle 11g performance once we had tested the 64 bit version. I have done so at this point, and I must admit that I am rather disappointed.

To remind you of our approach to performance, we do not do any exotic tunings. We simply tune in a manner which a typical customer would find reasonable. No hidden leading underbar parameters, or that sort of thing. This is very, very different from the typical TPC-C you see posted on the TPC.org website.

On 10g, what we found was that tuning memory is easy. We set the parameters SGA_TARGET and SGA_MAX_TARGET and we are away to the races. Of course, we tune Huge Pages appropriately as well. For more information on that, see the excellent article on the Puschitz website.

We got a lot of improvement by tuning Huge Pages. Even as a storage company, who make money by selling disks, we recognize the scalability improvements from adding memory. In general, memory is our number one scaling bottleneck. We never recommend that customers add disk before we have investigated memory first.

So, needless to say the issue of memory is very central to our methodology on tuning Oracle.

When the 32 bit version of Oracle 11g came out, as I stated previously, the memory model was the main reason that we were not able to scale as well as we did on the 64 bit version of 10g. Hence, I had high hopes that the 64 bit version of 11g would scale well. In fact, I hypothesized that we could transition our entire program to 11g.

Wrong. The memory model on the 64 bit version of 11g is no better than the 32 bit version. I find this very strange, and honestly consider it to be a bug. The bottom line is that we have not been able to scale the 64 bit version of 11g any better than we did the 32 bit version.

For this reason, again, we cannot really assess the performance impact of dNFS in a real-world performance workload. (Very contrived testing could be conducted to prove that dNFS saves CPU cost, but what would that prove? Others have done that testing and we have no interest in doing so.) Our focus is on a real-world workload running on a real-world configuration. In this situation, 11g over dNFS does not perform better than 10g over kernel NFS.

We have gone back to 10g Release 2 for now. I will come back to 11g later, perhaps when Release 2 is available (although I have been told I may wait a while for that). In the meantime, I will continue to update you on what we find out in our testing. Our current focus is on the use of VMware in the Oracle context. More on that later.

In one of my previous posts in this blog I covered the different types of snapshots and their effect on storage performance. Basically, snapshots fall into two categories:

  1. Metadata based snapshots which have no write penalty but extract a read penalty due to the sequential read after random write (SRaRW) issue.
  2. Reserve LUN Pool (RLP) based snapshots which have no read penalty but extract a write penalty due to the Copy on First Write (CoFW) issue.

It is noteworthy that NetApp is well known for its metadata based snapshots using the WAFL file system, and EMC is well known as being a leader in the RPL based snapshots space.

I have recently learned that EMC has now largely solved the CoFW performance penalty, using a new feature called Asynchronous Copy on First Write (ACoFW). This means that EMC snapshot technology is now ideal for the IT professional who wants great write performance combined with great read performance.

ACoFW works like this:

  1. The storage array marks a write track as "versioned write pending", and accepts it without a CoFW penalty to the host.
  2. When the storage array's processor proceeds to write this track the disk, it notices the flag and performs the CoFW in the background before writing the new data to the disk.

This still creates extra disk I/O on the array, but for the vast majority of database applications, it also means that a snap or a clone does not enforce any write performance penalty.

My friends on the EMC SPEED site (for performance gurus only) have shown me the data: ACoFW dramatically reduces the write performance penalty for database applications using snapshots.

As usually, there is no free lunch. If the CoFW write activity on the array exceeds a threshold, the system will start using synchronous (i.e. old-fashioned) CoFW to better manage the pending writes. Thus, you need to manage the workload in such as way that there is excess capacity on the storage array to enable this performance benefit.

ACoFW is currently available on EMC Symmetrix arrays. I am investigating the status of this feature on other EMC arrays, and will report back on this blog as I learn more.

So I noticed that 11g shipped in 64 bit on the same day as I issued my last post. Rather embarrassing. Oh, well. Means I have lots of work to do. That's job security, though.

I will keep you posted on our progress in testing dNFS with EMC NAS. Should have something interesting by OOW, in a few weeks from now.

This post is an update on the 11g new feature called Direct NFS (dNFS). A lot of marketing buzz has been issued on this feature recently. I hope to add a bit of reality to the discussion.

Here is the theory behind dNFS. I/O on a database server occurs in a combination of user space and kernel space. Context swaps between the two spaces are expensive in terms of CPU cost. If you can move a part of that activity from kernel space into user space, you can save CPU cost due to reduced context swapping. One example is the difference between Linux kernel IP port bonding and the ability of dNFS to make use of multiple paths. Linux kernel bonding is expensive in terms of CPU cost and less efficient than dNFS, according to the dNFS white paper referenced below.

Further, dNFS avoids double-caching by skipping file system NFS client-side caching. The data is only cached in the database buffer cache, thus avoiding wasted memory and CPU cost in caching the data twice.

These benefits are thoroughly covered in the dNFS white paper on Oracle’s web site. This paper was co-authored by Kevin Clossen, and it does a pretty good job of stating the potential performance benefits of dNFS.

Fundamentally, I agree with these potential benefits. Problem is, that is all they are right now: potential. This is true because in our testing Oracle RAC 10g Release 2 provides far better performance than Oracle RAC 11g Release 1 does, even considering that 10g is running over plain-vanilla kernel NFS (kNFS) and 11g is running over dNFS. The reason for this is simple: 11g is only available in 32 bit. 10g is running in full-blown 64 bit.

You can say that comparing 10g 64 bit to 11g 32 bit is apples-to-oranges, and I will not necessarily disagree with you. However, bear in mind that in all other respects, the configuration was identical.

Here are the numbers. On 10g using Release 2 for x86-64, we are getting 10,200 users with a bit over 500 TPS. (Again, bear in mind that these are TPC-C style transactions. Do not compare these to normal Oracle transactions.) On 11g using Release 1 for x86-32 on identical hardware, after an excruciating level of tuning, we ultimately maxed out at 7,300 users at around 380 TPS.

Now could we get more on 11g R1 using dNFS than 11g R1 on kNFS? Maybe. I didn’t test it that thoroughly. Honestly, I don’t consider it particularly interesting. 10g is readily available and provides far more performance than 11g on dNFS.

So when does dNFS become interesting? Or stated in a larger context, when does 11g become interesting? From a performance standpoint (which is something near and dear to my heart), 11g becomes interesting when it ships in 64 bit. Until then, this is all simply theoretical. That will happen, from what I am told, sometime in Q4. Stay tuned on that.

Will we jump on 11g performance testing as soon as we have a 64 bit version with reasonable performance? Most definitely. Do I expect to see a performance benefit for dNFS at that point? Yes, I certainly do.

The other misconception which has arisen with respect to dNFS has to do with the press release issued by NetApp stating that the dNFS feature of 11g was developed collaboratively by NetApp and Oracle together. This press release is true but misleading.

It is certainly true that NetApp was directly involved in the development of dNFS. While I was not directly involved in this development, much of it occurred while I was at NetApp. I think I can even claim some minor amount credit for the initial idea behind this feature. This occurred way back in the late 90s. There were a series of discussions among several of members of the NetApp staff, myself included (most of whom are no longer there) around how you would optimize Oracle I/O to a NAS device. The outcome of those discussions was the idea of taking the NFS client and putting that inside the Oracle kernel. This is what dNFS effectively does.

So, yes, NetApp was certainly integrally involved in the development of the dNFS protocol. I do not question that at all.

The misleading aspect of the press release is the unspoken implication that dNFS will somehow work differently or better on NetApp NAS gear than the equipment manufactured by other vendors, including EMC. That implication is completely false.

Remember what we are talking about here: CPU cost on the database server. File system caching on the database server. All of this is about optimizing the use of resources on the database server. dNFS is an NFS version 3 implementation, as the Oracle white paper clearly states. As such dNFS will work in exactly the same manner, with identical performance benefits, on any NAS device from any vendor. Including Celerra by EMC.

I have just finished installing CRS (cluster ready services) for Oracle RAC 11g for the first time. If you would like to share my pain, take a look at this SR on metalink. As you can see, I meticulously followed the pre-installation instructions in the clusterware installation guide, including running the configuration checker runcluvfy. It ran perfectly. All tests passed. Green lights all the way.

Then I ran the Oracle Universal Installer, and it ran perfectly until it got to the point where you run the root.sh scripts.

At which point, those scripts hung, spewed out nasty little error messages and fell on the floor. Ouch!

Turns out this was related to a permissions mismatch. Last week was my week for this sort of thing. However, I have found in the past that issues installing CRS are depressingly common. Hence the need for a preinstallation checker. Which obviously does not check for everything you need to have a successful install.

Don’t get me wrong. I love the Oracle database product. It’s definitely the best database in the business as far as I am concerned. It is a mind boggling product. Absolutely the most reliable, recoverable, and high performing database on the planet when correctly configured, tuned and such.

But the cluster layer of RAC has many issues, not the least of which is installing it. One of those is cost. On the Enterprise Edition version of Oracle software, you get to pay a 50% up charge for the privilege of running this beast. Then there is the “RAC tax”. That’s the CPU cost of running CRS, which I am told by the folks at Hotsos is around ½ of a CPU for each node in the RAC. (Haven’t heard of Hotsos? You should. They are the best in the business as far as I am concerned in the area of Oracle database consulting, especially for performance tuning.)

Given all these issues, the question then is: Why run RAC? That is the question I am struggling with. The rest of Oracle, absent RAC, is fairly simple, and works extremely well. Yes, there are a few issues to installing and running Oracle Database 11g in single instance mode. But they are nowhere near as daunting as the issues of RAC.

RAC gives you two things:

  1. High availability. This is the heart and soul of RAC. You can encounter a failure of any component, and the overall database will stay up and accessible to users.
  2. Scalability. You can add nodes to the RAC, in order to scale up the solution. Actually, this is technically referred to as “scale out”, not “scale up”. Scale up is the concept of adding capacity to a single server. Frequently, this involves downtime or a forklift upgrade. The Sun Enterprise Series was an attempt to solve that problem while maintaining a single monolithic server solution. It was pretty much of a failure. Loosely coupled clusters are the current conventional wisdom on the way to solve this problem. However, there are much more mature and easier to manage cluster solutions than RAC.

Looking at the first of these items, the first question you would ask is: Can you provide high availability to Oracle in another way? Given the economic and CPU costs, complexity, and difficulty of CRS, alternatives should be relatively attractive. Expect to see a series of posts from me in the next few weeks exploring alternatives to RAC/CRS in providing high availability. Not surprisingly, a few of these solutions may come from EMC.

On the scalability side, I am not sure that the level of scalability that RAC provides is really required for most customers. There are two data points here: Memory and CPU. On the memory side, there is no question you can scale very high on a RAC solution. The following table compares a four node RAC solution to a single node solution. The RAC solution uses our current powerhouse, the Dell PowerEdge 2900. The single instance solution uses the high-end Dell server, the PowerEdge 6950.

Platform Memory CPU
Dell PE2900 48 GB (192 GB for a 4 node RAC) 2 2.66 GHz Quad-Core Intel Xeon processors (32 effective processors in a 4 node RAC)
Dell PE6950 64 GB 4 3.0 GHz Quad-Core AMD Opteron processors (16 effective processors)

As you can see, the four-node RAC does have more than three times the memory and twice the CPU as the single instance solution. However, the following table, taken directly from the Oracle price list, points out the price difference:

Oracle Enterprise Edition RAC Up Charge Standard Edition
$40,000 per CPU $20,000 per CPU $15,000 per CPU

Given the price difference, a single instance solution which works even a third as well for a third the price is a fairly good deal. That would assume that RAC did not cost you more in terms of CPU, memory and such. Which it does.

I can also tell you this: Very few folks need the throughput of even a four-node RAC. On our testing, a four-node RAC is maxing out at the high side of 10,000 users and over 500 transactions per second. That’s a lot of throughput. Very, very few customers need that much. Or even a third of that much. So the question is: How much scalability do you need? And is RAC overkill for that level of scalability, given its cost and other issues? In other words, if you can provide high availability and enough scalability for your application without RAC, do you really need it?

What will interest me greatly is how much we will get in terms of throughput on a single Dell PE6950 as opposed to the four-node RAC solution running on the Dell PE2900. I suspect that the price-performance of the Dell PE6950 solution using single instance Oracle is going to be better.

Stay tuned.

In my last post, I explained why I think thin provisioning is fairly useless for most Oracle database environments. I have been taking a bit of heat for this post, mostly from my good friend Barry Burke (AKA The Storage Anarchist). Given the respect and affection I have for Barry, I thought it might be worthwhile to put up a clarifying post, pointing out some of the areas where Barry believes thin provisioning is a really, really good thing for many usages cases he commonly sees. I think he makes some reasonably good points.

Bear in mind that I work in the Commercial program within EMC. That’s the set of customers who have revenues of between $25 million and $1 billion. These are most definitely low-end to mid-range customers. Barry works for the Symmetrix group, which largely address what we call the Enterprise customers (revenues of greater than $1 billion). This is a vastly different type of customer from mine. Admittedly, in my defense, the Commercial space accounts for the vast majority of business activity and technology spending in this country. So I have a bit of a bias here. However, EMC has traditionally served the Enterprise customer base more, and thus our customers are clustered in that direction. And there is also no question that Barry’s customers are the more well-heeled, household name type customers. In other words, most of the global financial services, telecommunications, insurance and manufacturing companies you probably know and love.

So let’s examine the usage cases between the two sets of customers. Taking the Commercial space first since that is what I am most familiar with. These customers have some of following characteristics:

  1. Small IT organizations with minimal politics and bureaucracy
  2. Limited budget
  3. Smaller allocations of storage due to cost constraints
  4. Granularity of the typical database space allocation is a significant percentage of the total space on the array 

Take an example of an application where the DBA has requested 100 GB of space, and has stated he or she will need 500 GB of space eventually. The storage administrator in a thin provisioning context would tend to allocate space as shown in the following graphic:

Thinprovisioning21small_2

In this graphic, you see that the storage administrator has given the DBA a thinly provisioned LUN of 500 GB in size. On an array with a few TB, that’s a significant amount of storage. I see that kind of allocation all the time. The DBA has created a database on the LUN with 100 GB of datafiles, again not an atypical size. The thinly provisioned LUN is backed up by 150 GB of space.

On the next occasion where the DBA needs additional storage, he or she would tend to do something like the following:

Thinprovisioning23small

Here, the DBA has allocated an additional 100 GB of space, probably by creating one or more datafiles. This pushes the thinly provisioned LUN over the edge. The storage administrator must now add storage to the LUN. Further, the DBA’s request to create files has been denied, much to his or her chagrin. After all, the DBA thinks he or she has 500 GB of space.

This is the dark side of thin provisioning. Like I said, thin provisioning is lying. Sometimes you get caught. The times you get caught are the times when the granularity of the additional space required by the user (in this case the DBA) is a very significant fraction of the total space in the LUN. Since Commercial DBAs tend to do that, that makes thin provisioning a risky proposition at best, and of questionable value, since the DBA will end up interfacing with the storage administrator in this case anyway. No saving in terms of administration here.

Now let’s look at a usage case that Barry pointed out to me where thin provisioning works very well for a large Enterprise customer. These customers have the following characteristics:

  1. Large IT organizations with lots of politics and bureaucracy
  2. Larger budgets with less stringent cost constraints
  3. Larger amounts of storage available on larger arrays
  4. Granularity of the space demanded by the DBA for a typical Oracle database space allocation (as in creating a new datafile) is a small percentage of the total space on the array

Point 4 is the key. In my first post, I said that thin provisioning works very well for file systems with unstructured data where the size of the marginal file is a small percentage of the array. In the case of an Enterprise class customer, the array is so huge that even a database allocation looks like a file on a file system. Let’s look at one such scenario, as illustrated by the following graphic:

Thinprovisioning22small_2

In this scenario, we have many, potentially hundreds, of apps being stored on the array. Each app has far greater provisioned space than they have utilized space. They are being stored on a fairly huge pool of physical storage as well. These may be testing or development databases, where careful configuration of the physical storage is less important. Further, the marginal granularity of each additional database space allocation is tiny compared to the total magnitude of the storage available. In this case, thin provisioning actually works very well, and provides some important benefits. In large organizations where the lead time to get additional space allocated is very long, it will save the DBA lots of time and headaches. Further, where there is a lot of political issues and the storage and database groups do not like or trust each other, this will significantly smooth the way for the DBA in getting storage for his or her database. Another benefit is free space pooling. While the apps each grow, they grow at different, somewhat unpredictable, rates. Giving them each a big thin provisioned LUN allows them to grow somewhat unfettered. As they grow physically, they are able to share the free space. This improves utilization significantly. This is the real main thrust of thin provisioning.

Thus, the critical issue is the size of the average additional space allocation from the DBA compared with the size of the array. If this is a small percentage, then thin provisioning works well, and provides important benefits. This tends to be true with larger customers who have larger arrays. In smaller customers, thin provisioning may be of less benefit.

In this post, I will discuss the concept of thin provisioning and database storage. There has been a lot of confusion on this issue recently, and I think that we need to clear the air and get some things straight.

First I would like to explain why thin provisioning is pretty much useless for Oracle datafiles right now. Then I will turn to how thin provisioning could be integrated into Oracle at which point it would be very interesting indeed.

OK, to start, I should define what I mean by thin provisioning. Assume you are a storage administrator, and you have a bunch of customers who need to store tons and tons of data. Many of you folks reading this post are probably in that category. What consumers of data on networked storage devices tend to do is the following:

  1. Figure out how much space they need to store their data.
  2. Double that because they know the data will grow over time.
  3. Double it again for ducks.

The effect is that storage consumers tend to be very liberal in their demands for storage. This becomes a problem for the storage administrator because costs balloon and utilization on the array is very low. A formula for aggressively and stupidly wasting money in other words. The following graphic illustrates this:

Thin1small

What you would like to do is the following:

  1. Give the customer a block of storage which is the size they are demanding.
  2. Under the covers, allocate a more realistic amount of storage, closer to what you think they really need right away.
  3. Monitor the storage, and add physical space to the device when you need it, in order to meet the demands for what your customer really needs.

The following graphic shows how this would work:

Thin2small

Note that the light grey, plus the yellow, represents the total space on the device. The over provisioned space is space which the array tells the customer he or she has, but it does not exist physically. It is then the storage administrator’s responsibility to make sure that physical space (i.e. disks) is added to the array in a timely manner in order to meet the customer’s expectations for space.

If this sounds like lying, you’re right. It certainly is. It’s a shell game. A very useful and cost effective shell game for many folks, though, given storage consumers’ propensity to ask for a lot of space they don’t need.

One assumption that is required in order for this to work at all, however, is the ability for the array to lie to the host operating system about the size of a file system or LUN which is served up to the host. For a file system like NFS or CIFS on a NAS box, this works well, as long as the data being stored is unstructured. Any given file is of a certain size, and that file’s space must be available for it to be stored. But the entire file system can certainly lie to the host and say its capacity is a terabyte, when it is really only 100 GB. No problem there, as long as the physical size is adequate to store the amount of data represented by the files actually in the file system.

The issue comes when you try to store structured data like Oracle. A DBA on an Oracle database will request the size of data he or she thinks the database will need, with the same propensity for over provisioning as any other consumer of data. The difference, though, is that the DBA will actually create datafiles which fill that space. And Oracle then makes a physical file on the device of that size, creates extents in this file, and writes zeros to it.

There is a concept in Oracle of auto extension of files. This concept seems like it would align well with thin provisioning. And it would if DBAs used it. Problem is, extending a file is a very expensive operation. Again, because Oracle likes to lock down that file and zero it out completely. DBAs hate that. A huge performance hit kicking the database in the teeth at any unpredictable time. Simply because the datafile ran out of space. Not good. Not good at all.

DBAs avoid auto extending files like the plague for this reason. They will allocate the space they need, always, when they create the database. And future expansions in space will be made intelligently, carefully, and methodically. That’s the way DBAs think. Believe me, I know. I am one of them.

This makes thin provisioning completely useless nonsense for Oracle data. Anyone who tells you otherwise should be viewed with deep suspicion. I say this without any bias whatsoever, since my employer sells arrays that provide thin provisioning too. I am simply telling you the way it is here.

Now, how could thin provisioning be made to work with Oracle? That’s a very interesting question. It would require integration between the storage array and the Oracle kernel. Then Oracle could avoid zeroing out a file, and simply allow the array to provide the storage Oracle needs to store the blocks presently in the file. This would probably occur within ASM. (ASM stands for Automatic Storage Management, Oracle’s storage layer.) There is some discussion of that type of integration between storage and Oracle, but do not expect to see it anytime soon if ever.

In my past few posts, I have explored the risks and benefits of snapshot technologies from both NetApp and EMC. This series has covered:

  •  Part 1: The nature of snapshots and their benefits to the Oracle user
  •  Part 2: Snapshot performance overhead
  • Part 3: Writable snapshots

In this post, the last of this series, I will discuss the manner in which a snapshot can consume so much space that it will cause writes to the active file system to fail, as well as the mechanisms which NetApp and EMC have created to avoid this fate.

Yes, it is true. You can get an ENOSPACE error when you are using a metadata approach for creating snapshots, which is the way WAFL manages snapshots on a NetApp filer. Recall a couple of posts ago, when I included this diagram:

Snap1small

Note that the additional blocks required by the snapshot are invading the free space in the active file system. It is actually the light-colored blocks (the “before” images of the blocks) which are held by the snapshot. At NetApp, we used to have debates over whether the snapshot occupied the space, or whether it was the active file system that did so. Whatever. The effect is exactly the same. The storage space cost of a snapshot is equal to the number of blocks which have been updated since the creation of the snapshot. Thus, you can think of the storage space overhead of snapshots in this way:

Snap31small

From this diagram, you see that we are running a file system that is about 70% full. We have another 10% of snapshot overhead. This creates a file system which has about another 15% before it runs out of space.

Absent space reservations, you could do this:

Snap32small

All available space has now been fully occupied by snapshot storage overhead, even though there has been no increase in the amount of data in the active file system. This is because we kept this snapshot around too long: A sufficient number of blocks were updated after creating the snapshot to exhaust all empty space. The next write to this file system will get an ENOSPACE error. This includes updates to files already in the active file system, that require no additional space to be allocated.

Hence the common NetApp heuristic: “Old snapshots are dear; new snapshots are cheap.”

This was a depressingly common issue at NetApp while I was there, particularly with storage administrators who migrated to NetApp NAS from a more traditional SAN storage environment (typically EMC). Those folks would behave like good storage professionals: They would utilize all available space. They regarded free space as wasted space. Further, these folks tended to think that if they had created an Oracle datafile of 100 GB in size, then that file was locked down and in place. They regarded a storage device returning of an ENOSPACE error as a result of an update to that file as naughty, irrational, and strange.

For these well-behaved storage professionals, the good habits they had developed in the SAN context were a formula for disaster when dealing with NetApp snapshots in an NAS context. By running with little or no free space, they allowed no headroom for the snapshot overhead. Thus, ENOSPACE errors were common.

I used to refer to snapshots as having a “dark side”. This is the dark side I was talking about. The space allocated to a datafile is no longer guaranteed. When you make a snapshot, you can run out of space on that file anyway, although it is already allocated in the file system.

This led NetApp to introduce the notion of space reservations. The architect of this concept was Bruce Gordon, the SAN marketing guy hired by Rich Clifton during the 2000 to 2001 period. I will readily admit that I fiercely resisted this concept. Basically, what space reservations do is simple. If there are not enough free blocks in the file system to completely duplicate all of the existing data, then the snapshot creation fails. An illustration will help. Before space reservations, if you had this:

Snap33small

You could not create a snapshot at all. You do not have enough free space to duplicate the existing data. You must either free some space or add capacity. Assuming you add capacity then at this point, you could create a snapshot:

Snap34small

Snapshot overhead then begins to invade the reserved space. As you begin to accumulated updated blocks, the snapshot overhead looks like this:

Snap35small

Since you have reserved enough space to duplicate all of the data that existed at the time of the creation of the snapshot, theoretically an ENOSPACE error is impossible.

I said previously that I resisted this concept. I used to tell Bruce Gordon that as far as I was concerned, he was an EMC plant. Why? Because space reservations destroy the one primary benefit of snapshots: Space efficiency.

Go all the way back to my first post on this series. I stated that the gold standard for Storage Layer Instantaneous Copy (SLIC) technologies is BCVs. BCVs have lots and lots of advantages. They have absolutely no performance penalty. They work beautifully. They have only one downside: They require another set of disks. Before space reservations, snapshots did not. By providing the same basic functionality as BCVs (instantaneous copy) without the storage overhead of another set of disks, snapshots became the best way to do the job of Oracle database instantaneous hot backup.

With space reservations, the cost of snapshots became effectively the same the same as BCVs. In that case, BCVs win. They do not have the performance issues that metadata based snapshots do. (This performance trade-off is discussed in detail in Part 2 of this series.) Removing the cost advantage of snapshots over BCVs was a major erosion in NetApp’s core value proposition.

But, as Bruce Gordon said, “No customer will ever have an ENOSPACE error on my watch.” Bruce attempted to establish a principle that space would always be reserved such that a snapshot could never exhaust the active file system free space.

Unfortunately, FlexClones, covered in detail in my previous post, violate this principle. That is because FlexClones create another write thread. Remember that each write thread has the potential to double the space requirements, by overwriting every block in the snapshot. That was illustrated by the following diagram from my previous post:

Snap21small

Note how FlexClone increases the space requirements by adding another set of “after” image blocks to the mix. Simply reserving space for one set of additional blocks is now insufficient. You would now need to reserve space for two. Thus FlexClones make the following scenario possible:

Snap36small

You are now out of space again. The next write will get an ENOSPACE error.

EMC snapshots make all of this impossible. By using a reserved LUN pool approach, EMC simply allocates the space required for the snapshot. The snapshot space is not shared with the active file system space. Thus, it is impossible for the active file system to receive ENOSPACE from a snapshot. The following graphic illustrates this:

Snap37small_2

The snapshot space is contained within the RLP. It is not shared with the active file system. Running out of space within the RLP will cause the snapshot to become invalidated. But it will not affect the active file system at all. An ENOSPACE error can never be returned to the active file system with this design, unless the user exhausts the space in the active file system itself. Further, you decide how much space you want to allocate to the snapshot. Unlike WAFL-based snapshots, you are not writing a blank check for snapshot overhead, up to the full amount of data in the active file system. Rather, you can decide that the snapshot will only be allowed to take up 10% of that space if you want to. This adds discipline to the whole proposition of snapshot space overhead.

Once again, it is for you as the customer to judge the relative merits of these approaches. In my series on snapshots, I have attempted to bring clarity to the debate between EMC and NetApp on the benefits and risks of snapshots for Oracle database backup. Based upon the number of comments this series has received, I think you are hearing me.

Future posts on this blog will cover how EMC NAS compares to NetApp NAS for Oracle database storage.