Every company in the industry has its own brand of Kool-Aid. This is the set of tenets which are believed inherently, regardless of the facts: Articles of faith if you will. Call me a skeptic, but this drives me crazy. At NetApp the Kool-Aid was largely around the notion of snapshots and the idea that only NetApp could create simple, fast, and easy snapshots. Well, I drank that Kool-Aid big time, and I dispensed it as well. When I was talking to a customer, I would try to sell them this brand of Kool-Aid, often with great success. However, once in a while the pitch would backfire. I well remember a trip to the UK where this fell apart. This was a major telecom customer in that region. Their DBA and storage networking staff were extremely competent and familiar with EMC CLARiiON arrays. The conversation went kind of like this:
Me: “NetApp snapshots are great! We have no write penalty. We have the best snapshots in the industry.”
Customer: “We use CLARiiON SnapView snapshots all the time. There is minimal, if any write penalty. We hardly notice it.”
Oops!
At this point, I have had the opportunity to thoroughly explore both NetApp and EMC snapshot technology. In this post, I will compare these technologies and discuss how they have been used by Oracle customers to implement instantaneous backup. I will conclude with a (hopefully fairly objective) discussion of the relative advantages and disadvantages of each approach. I promise to keep the Kool-Aid dispenser turned off.
EMC and NetApp took radically different approaches in creating their respective snapshot technologies. NetApp’s snapshots actually occurred as a result of serendipity (i.e. a happy accidental discovery). The design of NetApp snapshots is as an artifact of the design of NetApp’s file system called WAFL (Write Anywhere File Layout).
WAFL is so named because it never overwrites existing blocks in place when updates occur. Instead it writes new blocks containing the updated data, and then frees the old blocks. Therefore, a snapshot can be assembled by simply retaining the old blocks rather than freeing them. No additional I/O is required to do this, which leads to NetApp’s accurate claim that their snapshots have no write penalty.
On CLARiiON and Symmetrix arrays, EMC has no file system. Rather, the arrays provide LUNs to hosts which in turn run file systems like Veritas or Oracle’s ASM. Therefore EMC has no visibility into the meta-data that makes up a file system. In order to create a snapshot technology, EMC had to take a different approach. EMC copies the before image of each storage block into a set of special LUNs in something called the reserved LUN pool (RLP). EMC then writes the after image into the normal LUN in the same location as the original block. This preserves the integrity of the host file system, while allowing a point-in-time instantaneous copy of that file system to be assembled.
Interestingly, when EMC created the Celerra Snapsure Checkpoint (Celerra’s version of snapshots), they used the same approach. By then, EMC had some experience with snapshots. Even though the file system was now under EMC control, they made the same choice to use an RLP mechanism to store the snapshot data. The reasons for this will become clear from the discussion below.
The following graphics illustrate the differences between the NetApp approach and the EMC approach. First look at the NetApp approach:
The pink, buff and green file folders represent three different files in the WAFL file system. Since snapshots are being used on this file system, the old version of the blocks A, B, C, D, F and I are being retained. These are the light colored blocks. The darker versions of the blocks, marked with either a single or double quote mark, are the after images of the updated storage blocks. The double quotes represent blocks which have been updated more than once. Note that the first update of block C (block C') has been freed. It is not referenced by any snapshot. The normal colored (neither dark nor light) blocks have not been updated at all. They are shared by both the snapshot and the active file system.
Follow the I/O pattern. Writes to update existing blocks require only one I/O, that’s true. But subsequent reads of the file system are messy. If you read the file system sequentially (that is block A followed by block B, and so forth) the I/O pattern in the newly created file system is perfectly sequential. In the case of the second diagram, where the file system has been updated, it’s not. The most recent versions of the blocks have become scattered throughout the disks. Do it mentally right now. Note the amount of head movement required to read the blocks in this order: A', B', C", D', E, F', G, H, and I'.
This is the “sequential read after random write” (SRARW) performance problem of WAFL. It’s not just an artifact of snapshots. It’s inherent in the entire file system. This performance issue is real. It affects any NetApp customer who must do sequential I/O of database data after that data has been updated. That’s a lot of customers. Many, many databases are mixed use, involving OLTP-style data entry combined with DSS-style reports. Those customers will hit the SRARW issue big time.
Now consider the EMC snapshot approach. Examine the following diagram:
First note the write I/O pattern. Two writes and a read were required to update blocks A, B, C, D, F and I. This happened on the first write following the creation of the snapshot only. The first update to block C (C' in our diagram) was not written to the RLP. Only the second update (C") is stored in the production LUN. Thus, each subsequent update to these blocks will incur one write I/O only, exactly the same as if no snapshot existed. This is the “copy on first write” (COFW) performance issue. The “first write” verbiage indicates that the issue exists only on the first update to a block after the snapshot is taken.
Now examine the read I/O pattern. Blocks in the RLP are not organized sequentially, but that’s fine. You only read these blocks when you are accessing a snapshot. The blocks in the production LUN are perfectly laid out. Sequential I/O to these blocks is optimal. That’s because the blocks have been updated in place, not scattered throughout the disks.
That then is the trade off. With snapshots, as with life, there is no free lunch. A choice must be made between COFW and SRARW. In other words, a write performance penalty for RLP-based snapshots versus a read performance penalty for WAFL-based snapshots. The decision you must make as a customer is: Which of these is the more serious issue? I have spent some time seriously pondering this question. Consider the following (hopefully free of Kool-Aid):
- The COFW issue applies only to the first update to a block written after the snapshot. Therefore, this issue is temporary. The SRARW issue applies to all blocks which are updated after a given piece of data is written the first time. Therefore this issue is permanent, absent some operation to reorganize the file system. And the reorganization would require the reading and writing of every single block in the file system, potentially. A very expensive operation in other words.
- Most workloads are read intensive. Even TPC-C, certainly a very write intensive workload, is mostly reads. Index scans frequently involve sequential read I/O. Undo I/O is largely sequential. Even temp is largely written and read sequentially. Retaining the sequential I/O advantage is very important for many workloads.
- The testing my team is conducting at RTP indicates that the COFW issue is modest, and very temporary. Performance declines slightly for a few iterations, then returns to the previous level. After that, it’s exactly the same as if you had never taken the snapshot at all. The following chart illustrates this (taken from our Commercial Solutions testing effort):
- Oracle expects for storage vendors to honor a simple promise: Locality of reference. This promise is that database blocks that are laid out by Oracle near each other in the tablespace will be stored on disk near each other as well. Many Oracle storage tuning mechanisms rely on this promise. For example, Oracle has a tuning concept called a cluster in which tables are stored together. For two tables in a parent/child relationship the child rows are stored near the parent row. Index organized tables and materialized views are similar concepts. All of these tuning mechanisms within Oracle rely on the notion that data which are stored near each other in the datafile are also stored near each other on the disk. Some minor level of fragmentation can be tolerated, but wholesale scattering of the storage blocks throughout the disks will damage the effectiveness of these features. The promise of locality of reference is not kept in this case.
In my opinion having worked with both snapshot technologies, EMC makes the correct trade off. The ability to perform optimal sequential read I/O must be preserved. The promise of locality of reference must be honored. A temporary, modest write penalty is a reasonable price to pay to do that. RLP-based snapshots, such as the common implementation across CLARiiON SnapView snapshots, Celerra Snapsure Checkpoints, and Timefinder Snap, do that. NetApp’s WAFL-based snapshots do not.
What is extremely clear, though, is the choice to drink Kool-Aid – anyone’s Kool-Aid – is a fool’s choice. You need to look carefully at both types of snapshots and make an informed choice about which is appropriate for you in the context of your particular workload. Snapshots are wonderful, and they benefit the Oracle user greatly. But they are not without risks and trade offs. To contend otherwise would be dishonest.
My next post will discuss the issue of writable snapshots, and the competing technologies in that space.
Hi Jeff:
This is really interesting stuff.
I was hoping you can clarify NetApp's RAID choices for your readers. It used to be that NetApp said RAID 4 was good because they used an NVRAM card with a journaling file system and this made things safe and secure even though there was a hot disk. A few years ago they introduced RAID DP to make things more secure. Now they seem to be using RAID 1 for their Metrocluster. Can you help us understand all of the background to these RAID choice permutations?
Posted by: Mike | July 27, 2007 at 01:23 PM
These two types of snapshots do two different jobs. Copy on write snapshots are indeed higher performance, however it's at a cost. Their performance hit becomes much more serious when more than a couple of concurrent snapshots exist at the same time. A couple of percent degraded performance is fine for a single snapshot, however multiply that by 8 and see how fast your writes go.
Since most people only use snapshots as an intermediate step on the way to a real copy (either in the same array or elsewhere), there's no issue. Netapps, however, bills their snapshot as lower performance on average, but no degradation in performance if you take 100 snaps instead of 4. They are usually upfront about their WAFL technology not being performance driven, however they can do things that nobody else can.
-------
Response to this comment by the Oracle Storage Guy:
We actually tested this scenario in our Commercial Solutions Group, and the effect described by the commenter is not consistent with what we saw.
We kicked off six simultaneous snapshots and saw a bigger performance hit. Still an acceptable performance hit, but definitely bigger than a single snapshot. Bear in mind that this was six snapshots being generated simultaneously.
We then backed off from there an generated six snapshots with a four hour delay between them. Over a twenty four hour period, in other words. In this scenario, which I find to be far more real-world, each snapshot had exactly the same COFW overhead as a single snapshot did. That's because the COFW penalty had been paid, and by the time next snapshot hit, there were no further blocks being written for a given snapshot. In this situation, each snapshot was exactly symmetrical, and identical to the performance hit of a single snapshot being kicked off by itself.
So, no, in most real-world customer situations, multiple snapshots using COFW technology are not a problem. I will discuss this a bit more in the blog when I post the next time.
Posted by: open systems storage guy | July 27, 2007 at 01:39 PM
@mike:
There is no *hot disk* with raid4, if you believe there is, you do not understand raid4.
@op:
What you seem to have failed to mention is that the netapp *read* issue isn't an issue at all. It's easy to say *from a theoretical standpoint the mixed blocks are trouble*, I would expect more from you though as you came from netapp. We both know in reality, you still get more than enough IOPS for pretty much any database workload even with the *moving head* "problem".
------
Response by Oracle Storage Guy:
I did not say that RAID 4 causes a hot disk, so far as I know. Perhaps you can point out where you think I said that? That is not my intention.
In terms of the sequential read after random write I/O performance problem, this issue was well documented at NetApp while I was there. It is not the case that this issue is imaginary. I cannot give specific details without violating my confidentiality agreement with NetApp, which I will of course avoid doing. (Everything you see on my blog, other than personal stuff of course, is documented from public sources.) So, no, I do not agree with the commenter that this is a spurious issue.
I will always post comments to my blog, though, regardless of whether they are favorable, reasonable, or so forth. I may, as I am doing in this case, provide a clarifying response.
Posted by: TimC | August 09, 2007 at 03:32 PM
OMG, this is the best article on snapshots ever. Its easy to grasp and explains everything that EMC and NetApps never cared/wanted to mention.
Posted by: David | August 23, 2007 at 02:31 AM
Re: "A couple of percent degraded performance is fine for a single snapshot, however multiply that by 8 and see how fast your writes go."
CoFW (Copy on First Write) does not need to write the data 8 times, using your example. It would only need to copy the data to the RLP (reserved LUN pool) once, then update metadata for the 8 snapshots.
Nice write up, thanks for sharing.
-----------------------
I am very glad this was helpful. Thanks for your kind comments.
Regards,
TOSG
Posted by: Korwin | October 17, 2007 at 11:35 PM
@op:
"In terms of the sequential read after random write I/O performance problem, this issue was well documented at NetApp while I was there."
You've apparently forgotten, or are dismissing the impact of a little thing called NVRam. Your example of "look at all those head seeks" is completely invalidated when the data is sitting in NVRam. Exactly where it would be sitting if you did a sequential read immediately after taking a snapshot of that data. It's where it would be sitting after you read the first block of that data as well.
For someone trying to be honest and not drinking kool-aid from either vendor, you did a pretty poor job of it here.
---------------------------
Tim:
Sorry to burst your bubble, but NVRAM has no impact whatsoever on read performance. NVRAM is a write cache. It does improve write performance by buffering writes before they need to go to disk.
In terms of reads, the laws of physics apply. A block which is in the cache (not the NVRAM, as I said that is only used for writes), will not have to be read from disk, granted. In the case of a large sequential I/O, like a database full table scan, the cache will very quickly be exhausted and subsequent reads will have to go to disk. This is a very well know performance issue. Memory cache has minimal impact on large-scale sequential I/O operations like full table scans. Instead, the sequential ordering of the data is the critical area where optimization can be obtained.
I have personally seen and documented SRARW. It is not Kool-Aid.
Regards,
Jeff
Posted by: TimC | March 23, 2008 at 01:44 AM
I'm sorry, that's just not true. The caching more than makes up for the "scattered" blocks on sequential IO. It's 16GB of cache... with even a LITTLE bit of intelligence it can easily get the data read ahead of time. If you're suffering that badly you've GROSSLY undersized the back end spindle count, and would be suffering regardless of vendor.
I can only assume you haven't touched a filer since you left NetApp. I'll gladly setup any *test* you'd like in my lab with a FAS3XXX and a Clariion CX3-20. I've beat on both multiple times and this phantom performance hit you talk about with sequential reads is a fallacy at best.
On the other hand, I can easily create a severe performance degradation on the clariion by leaving multiple snapshots out there on a volume. It's easily enough engineered around, but to even begin to try to compare those two *problems* as equal is ludicrous.
There's PLENTY of things you can pick on netapp about, but this is pathetic. Anyone who's spent a day with a filer in production or lab will call your BS.
------------------------
Tim:
I assume you agree with me about NVRAM as being not for reads, since you are focused on the main cache at this point, so I will drop that.
Remember that you are talking about full table scans on large Oracle database files. 16 GB is simply not enough cache to avoid going to disk.
Do this for me: Go talk to Rich Clifton. Remind him of my conversation with him in which I discussed doing a TPC-H with NetApp storage. This occurred when I was the head of Database Performance Engineering at NetApp. We had recently completed the TPC-C that I oversaw there. Ask him why he told me that doing a TPC-H would be very, very difficult and would basically require a rewrite of WAFL.
Give my regards to Rich while you are at it.
Regards,
Jeff
Posted by: TimC | March 24, 2008 at 05:48 PM
I'm interested of the complimentary effects of things like read-ahead algorithms, striping, and mirroring.
I'm of the opinion that these additional processes generally flatten out the significant differences between the two methods.
At the application layer (large row updates, or media file updates), are the differences consistant over time?
thanks
brian
-----------------------------
Brian:
In some respects, the features you mention will worsen, not help, the sequential read after random write issue. Read ahead is a good example. In a storage context, read ahead makes the assumption that the next set of physical blocks on disk contain contiguous data. In a fragmented file system like WAFL, this is not necessarily the case. In that case, read ahead will crowd out the cache, and thus hurt performance. In an Oracle context read ahead (multiblock read count) will attempt to read the next database blocks on disk, again assuming that these blocks are both contiguous and therefore easily accessible, and will be used by a future query. If both those assumptions are correct, then this will help performance. If the blocks are not contiguous, then this will cause an increase in random I/O and hurt performance.
Mirroring in the RAID context of RAID 1 is not relevant to NetApp storage due to the fact that they don't support it. My blog discusses this. EMC has a much richer set of RAID configuration options than NetApp does. Mirroring does enhance read performance because twice as many spindles are available to read from. It costs you a little on write performance and uses more disk, of course.
Striping does not really enter into the discussion as much as that is inherent to both technologies.
Hope this explanation helps.
Regards,
Jeff
Posted by: brianh | August 29, 2008 at 02:15 PM
Jeff ,
gr8 article and very cool-headed responses which are backed by only facts . You should be doing freelancing for techstorage or networkworld .
I know EMC storage technologies pretty well but this article just re-inforces the known.In fact there are very few article/blogs free from tech bias & marketing spins .
----------------------------
Ranjit:
Thanks very much for your kind comments. It certainly makes what I do worthwhile.
Regards,
Jeff
Posted by: Ranjit | September 25, 2008 at 03:09 PM
Jeff,
First, thanks for the post...it's quite informative. Second, the way you explained sequential read on NetApp can cause huge performance degradation due to multiple copy of snapshots retention. So if my nightly snapshot of 300Gg db with 2 weeks retention could cause my db crawls to its knee given the fact a netapp block could have changed 14 times and it is needed at some point on the 15th day. Can you please confirm my understanding? And finally, my scenario above is somewhat exaggerated, but it seems having a 14-day retention is a bad idea on NetApp, can you please comment on data retention or if you have any recommendations.
Thanks,
--Hoang
------------------------
Response:
The period during which you retain the snapshot should not matter. The WAFL fragmentation issue will occur regardless of the period of time a snapshot is in place. If your file system is beginning to slow down after retaining a snapshot in place for a very long period of time, that may be due to space. The amount of space overhead occupied by a snapshot will tend to increase over time. NetApp WAFL is notorious for slowing down when it becomes crowded for space.
Posted by: Hoang | December 30, 2008 at 05:18 PM
Pretty good description of the two methodologies. Perhaps I've drunk the kool-aid, but my experience does not match what you're saying.
I have first hand experience that the Clariion grinds to a halt with as few as 2 snapshots active. Caused a Production issue based entirely on the overall SP performance (not on the snapshotted LUN itself). This was a 3/40, so perhaps it is better now with the new processors. But then again, NetApp could say the same - and I think you're saying that it isn't really a CPU issue.
On the NetApp we see no performance degradation with several snapshots - I'll admit that the read performance is perhaps harder to measure...
Have you taken into account how NetApp uses large RAID groups and multiple RAID groups in an aggregate. And the entire aggregate is available for WAFL to use. I also think you're underestimating how much 16GB of read cache will negate most applications. Databases might be the only concern. Which is why NetApp introduced FlashCache.
What EMC'ers tend to always fail to mention is how freakin' hard everything on EMC products is to administer - and more importantly how simplifying the storage environment can make a company more responsive and agile. Again, this sounds like kool-aid, but I've lived it. EMC products are disjointed and all these "features" are more difficult to get at. Clariion snapshots are entirely different administratively than DMX BCV's. If it's too hard to understand, the service won't be used properly - or at all.
I happen to think NetApp strikes a nice balance between these views. Not perfect, but nothing is.
Posted by: David Ross | April 14, 2011 at 02:18 PM
It's 2011. My 3240 comes with 512G flash cache. That's bigger than my top largest tables combined.
Posted by: Weiyi Yang | August 17, 2011 at 11:29 PM