Sticky Policies & Data Classifications

March 19th, 2010
  • Originally Posted by Tony Cerqueira on Thu, Jan 28, 2010 @ 12:02 AM

If there is an ominously absent capability missing from the legacy backup products of today (and yesterday), it would be sticky policies that follow data wherever it goes, and auto-data classification that occurs when a file is created.

The capabilities to serve these needs are so obviously missing from the old backup paradigms (still with us today, mind you), that 2 things are starting happen:

1) People know this stuff is missing. They feel it in their bones. They also feel it when they need to look at piles and piles of data, and want to somehow make sense of it. But they can’t. They also want protection to happen the moment it is needed, with a simple policy.  Not when a backup product tells it to.

2) Vendors know this stuff is missing as well. Many of them operate in the world of block and volume data, and simply have no chance to manage information while they are backing up or replicating blocks of bits, instead of information.  Others have no way to manage metadata intelligently or actively. So they try to “market” their way around it.

The problem is impossible to fix within today’s legacy backup infrastructures. And its not going away. It will simply grow, exponentially, unless you start getting after it.

AIMstor from Cofio was created with policy based data management, and classification of data in mind.

Giving users control of what they want to do with data (Live Backup, CDP, Real-Time Replication, Tracking, Archive, etc.) is one thing. Giving users the ability to do it intelligently, such as withworkflow and data flow is something else altogether.

CDP is a Dog -> Unless it’s UNIFIED with Backup

March 19th, 2010
  • Originally Posted by Tony Cerqueira on Thu, Jan 14, 2010 @ 03:32 PM

It’s true.  CDP is tremendous technology, offering granular point-in-time restore that backups simply cannot do. But CDP (Continuous Data Protection) has severe retention and data management limitations, so backup is absolutely necessary.

But – why do CDP if you cannot get it FULLY UNIFIED with your backup solution??  I don’t mean “integrated”.  Any moron can “integrate” a CDP product with their legacy backup product (and, many have, mind you). You just tell the people in marketing to make the box look the same, and update the user manual.

THE TRUTH: CDP is ONLY worth doing, if it comes FULLY UNIFIED with a next-generation backup solution (optimally, with inherent deduplication). That way, they share the same data mover, the same repository, the same metadata, the same underlying data structure and supporting infrastructure.

I dont know any other product besides Cofio’s AIMstor that does this. You get granularity of CDP, with smart retention flexibility of AIMstor’s next-gen Backup, and all the great policy driven capabilities that come together with it. You can also empower Bare Metal Restore from your backup and CDP sets, which are fully single-instanced for huge capacity savings.

More importantly, because AIMstor auto-classifies data, you can SELECT what you want to CDP, and what you want to Backup, and what retention you want for very specific types of data, or whole categories of data. Standalone CDP  products are kinda, well, dumb. They like to move . . . everything. Optimal? Uhm, not.

So what happens if you buy CDP that is NOT unified with your backup solution? Triple the data movers, double the repository setup and capacity usage, double the overhead to servers and clients, double the admin time, double the infrastructure. Plus, you probably can’t select what you really want, so you will just end up wasting even more resources.  Why do it?

The Legacy Backup Bubble (Part II)

March 19th, 2010
  • Originally Posted by Tony Cerqueira on Tue, Jan 12, 2010 @ 05:14 PM

Legacy Backup is a major market in the data protection space, and is still going strong. Regardless of its inefficiencies, people still buy it, and add onto their existing Legacy Backup environment. However, users are starting to take notice.

Every user backup forum will often point to lack of Legacy Backup products to deliver any upstream value, and their typical failure rates as a result of server-dependent architectures, and their terrible storage inefficiency.

In addition, many environmental factors have crept into the woodwork at user sites (business intelligence, eDiscovery needs, compliance requirements, etc.), and now that the paint is off, people are finally getting a look at what’s underneath the hood of Legacy Backup products. It won’t be long.

Deduplication was a key first mover that really made people question the insanity of Legacy Backup. Why create something so inherently inefficient that it required such a huge level of clean-up? (remember, 20X or greater is the typical deduplication cleanup rate).

Cloud architectures will soon expose even more inadequacies in the Legacy Backup camp. Forcing many vendors to accomodate Cloud storage in strange, non-optimal ways.

Virtual machine sprawl has added more headaches to the Legacy Backup camp because of I/O and overhead issues created by Legacy Backup, and multiplied by VM’s.

Additionally, users are becoming more reliant on other tools within the market to make up for the lack of flexible recovery capability of Legacy Backup. CDP, Replication, Bare Metal Restore, and others, are coming into play in the mid-market.  As are technologies that help manage information; index/search tools, data classificationpolicy management, and tools that control data for added layers of security or monitoring.

There are many others, but these ones stick out. When things be

The Legacy Backup Bubble (Part I)

March 19th, 2010
  • Originally Posted by Tony Cerqueira on Tue, Jan 05, 2010 @ 07:10 PM

The terrible inefficiency of Legacy Backup has created new markets and new companies over the past decade in the storage backup space.  Many are fixes applied to Legacy Backup itself, many others are another form of Legacy Backup, that solve some issues for a key market or vertical. Many have been proven to solve real world problems, caused, of course, by Legacy Backup.

So, what is Legacy Backup?  You are probably using it right now in your data center, your remote office, or your SMB, and most certainly, in your enterprise.  It’s a product that protects your data by doing several things based on a schedule, then sends a copy of some processed data to disk or tape. Unfortunately, it batch copies data, creates massive and unnecessary duplication of data, and has no ability to share its repository, its processes, policies, metadata, data movement, or any of its significant infrastructure with other data protection products (like CDPReplicationArchive, etc.).

The great thing about inefficiency is that it creates need.  And where there is need, there is opportunity. But the reason for the need, it is now being learned, is that Legacy Backup is the problem.  Like any boom or bubble, Legacy Backup will . . . utlimately . . . pop.

How Underdogs Win: Real-Time versus Batch Data Protection

March 19th, 2010
  • Originally Posted by Tony Cerqueira on Wed, Dec 02, 2009 @ 06:02 PM

The New Yorker Magazine has is a great read for anyone considering the strategic aspects of real-time versus batch processes, (from databases, to running a girl’s basketball team) in this article titled “How David Beats Goliath: When underdogs break the rules“.

In the world of storage and information management and protection, the parallels to current legacy point products are impressive. Today’s leading backup products reside completely upon legacy architectures.  They are still, by and large, run as batch processes, are not searchable, do not provide real-time differencing, and have no real-time capability to tie into other data movement or data management capabiliites. You could say many of the same things about many other tools used within the IT dept.

It would be nice to turn a key, and make it all real-time, but that won’t happen. Fundamentally, it requires changes in the way systems, physical or Virtual Machines, are managed, and how responsibilities are distributed (if they are).  The legacy Client/Servers approaches completely rely on outdated policy distribution communications (batch), where connectivity must remain intact to execute their “server -to- slave server -to- media server -to- client” laundry list of batch “stuff to do”.  They need a lot of hand holding in order for things to happen, and for policies to be executed. A short list of issues with legacy products:

o- Batch methods require scans, trawls, polls, etc., all of which drag down resources

o- Batch I/O stacks up fast on VMs, and goes medieval on their host systems

o- Data changes can be discerned, but data touches cannot be tracked

o- Data classification, if any, is after the fact, instead of at “point of creation”

o- Compliance is via “batch” time slices, not real world “second-by-second” views

o- Metadata consistency is always a day late and a dollar short

o- Repository data always has a “window” of difference with primary data

o- Deduplication remains after the fact and separate

If users want to explore the road of real-time, they will need to seek new solutions that are outside of the realm of their current vendor portfolio, because vendor leaders  just have too much invested in existing legacy code bases. New architectures, which provide self-managing nodes, together with scalable and distributed storage, are the key to deploying more value across the enterprise, on a granular, simple and cost effective basis.  . . . Did I mention . . uhm . . . AIMstor?

Don’t Shop with Barbarians: More on Volume CDP/Replication

March 19th, 2010
  • Originally Posted by Tony Cerqueira on Fri, Nov 27, 2009 @ 05:21 PM

In response to our CTO’s last blog (Why Volume CDP and Replication products are so Wasteful), and because it is Black Friday, a day with a serious shopping theme, we had a few comments out there from a few Volume Replication Vendors, so I thought I would answer them here and keep things better organized:

Comment 1:  I guess it depends on what the volume contains. And the purpose of doing it in the first place. Certainly replicating or CDPing a system volume doesn’t seem to make much sense unless the reason for it is Disaster Recovery at a remote site. But replicating or CDPing a volume that only contains business critical data could be meaningful particularly in compliant heavy environments. Were the donkeys nodding?

Answer 1: There are some noisy applications.  In this particular example it was Sophos anti-virus.   But the OS can do very well all on its own too. OS vendors even call out noisy directories that should be avoided during backups, because there is no value in a restore, and there is an obvious cost to replicate it.  It is also not untypical for an application to want to create temporary files that have no business value on the same volume as the database. You want to replicate all that too?

With Volume CDP grabbing it all, that extra 40GB-50GB per machine gets expensive.  Multiply that by a large number of machines, and the overhead is very large.  Plus, the extra of CPU, energy, and bandwidth sending wasteful and unneeded data is another big cost that adds up quickly, and then goes exponential once you consider the enterprise.  That is the essence of the problem with Volume CDP and Replication. It is indiscriminate by nature and grabs everything.

Kind of like a starving barbarian with a big shopping cart at the grocers on double-coupon day, she can’t even resist taking the trash with her.

The Donkey’s weren’t nodding, but they were chuckling.

Comment 2: Couple of things that are puzzling to me in your blog are the fact that there were 40GB of wasted capacity in a single server during 1 week? That would certainly not be the norm and if it was there would be other useful conversations to have with a client.  As for CDP being intelligent enough to distinguish useful data. Great idea and most enterprise CDP solutions will have this ability now or in the near future. Even more important when considering replication is evaluating solutions that will compare data on the local and remote site and deduplicate before replicating the changes across the wire. We have customer examples that were able to shave 70%+ off replicated data!

Answer 2: We are simply saying that it’s a good idea to avoid sending all that unneeded data, in the name of simple logic, speed and efficiency.  The only effective way of combating this is by understanding the data (which is what AIMstor has solved).

I’d be interested in seeing how the Volume replication vendors address this.  I suggest that they can’t.

Volume replication argument have generally been that the “customer” ought to reconfigure their system to suite the replication technologies inability to address data types or data classifications.  Have a volume for one thing, another volume for another , etc.   While it certainly may make sense to partition your system, the point is,  customer shouldn’t be forced to because of the failings of the CDP product. Let the customer partition storage based on what makes sense to his application, not because of the inability of the volume CDP product.

The fact is also, CDP shouldnt just be for the application.  Why shouldn’t it be used for the system volume as it provides a good DR image as well?  Or something even more radical, why not provide a hybrid, period transfers of parts of the system but CDP granularity of other part of the system.  Imagine you have a volume that is both the OS and the application (OK example normally for smaller setups), you could take periodic images of the OS, but then CDP the application data.  This will minimize data transmitted and provide very nice and granular application restore, with safe set of periodic images of OS. You also get big overhead reduction, plus, savings on CPU, energy, bandwidth, etc.

Bringing up the de-duplication topic is interesting too.  Understanding the data you are de-duplicating substantially increases the de-duplication rates, like we do.  That is also why Data Domain excels, it distinguishes the data boundaries and doesn’t treat everything as a dumb block.   Would be good to know how much of that 70% replicated data savings you mention was just white space elimination? – which should have never been transferred in the first place.  If so,  am puzzled because that approach, which is typical among all Volume-approach vendors, seems to be making a mistake, and then the vendor congratulates himself for later correcting his mistakes.

And that’s supposed to be a “solution”?

Why Volume CDP and Replication products are so Wasteful

March 19th, 2010
  • Originally Posted by Fabrice Helliker on Tue, Nov 03, 2009 @ 11:24 AM

I’m often bewildered by the prevalence of volume CDP  or volume replication products.  This is the type of replication that works at either the whole disk or the partition level.    At this level, everything that is replicated is a dumb block.  There is no context as to “what” the blocks are . . . so, everything is replicated.

So let’s talk about something fundamental – wasted data transfers, wasted storage, and unnecessary system loads.

First let me describe a real world problem.  We had AIMstor setup to Backup, Version and CDP an assortment of machines.  We’d select the whole machine so that we could perform point in time bare metal restores in conjunction with file versioning of user documents.  Many of the machines were office systems, although what we’ve observed would have been exactly the same for a file server.

We decided to analyze the weekend traffic.  Note: because it was the weekend, we really didn’t expect an awful lot of traffic as the systems weren’t in use.  What surprised us however, is the amount of useless data that collected over this period.  We know operating systems can generate noise in the way of unwanted, temporary files, but for this test, we turned “off” all of the filtering within AIMstor.  What shocked us though was the incredible amount of useless data that was generated that has absolutely zero value.

One system alone, generated a staggering 40GB of temporary files.  A large amount of this was created by a virus checker.  Fortunately, because AIMstor works at a very granular level, this type of waste and noise can be easily filtered out.

Take your average Windows OS and you will find a lot a data written to disk that has no value to the business.  The system’s pagefile and prefetch files are constantly being written to.  This is before you apply virus checkers or user applications like Skype (yes it writes a lot to disk), Temporary Internet Files, etc.

And this is where volume level replication is so wasteful.   With Volume Replication everything is transferred and stored.  Factor a CDP system and then you are looking at capturing, transferring and storing a lot of unnecessary data.

Consider also that every block transferred is a load on the source system, the network and storage subsystem. There is a awful lot of energy and resources that goes into supporting Volume Replication and Volume CDP products . . . for no good reason.

The Fallacy of Integrated Solution Marketing

March 19th, 2010
  • Originally Posted by Tony Cerqueira on Wed, Sep 16, 2009 @ 10:30 AM

So . . . Company A (which ships one of these, maybe a Backup, or a Replication, or a File Archive product) acquires Company X (the creator of say, a CDP Product), and then announces their sincere plans to combine both solutions and to deliver huge value to their existing user base.  A few weeks later, they have a brand spanking new CDP software box on the website, a new data sheet showing what seems to be tight integration of Company X Product into Company A Product, and a press release that extols the virtues of this newly integrated solution, promising “Single Pane of Glass” yadayada yada.  Hey, they might even have gotten the GUI from the CDP product to work a tad bit with the Company A Product (always in a meaningless way, but, working together none-the-less).

Don’t be fooled.  The game of product integration, the headaches it creates, and the expenses and risks associated with it are all still there.  Understand this: Integrated solutions are not bad. They are necessary, and they are your only choice in many circumstances.

What is bad is the “Marketing Spin” you get from some vendors, that things are “fully integrated” to a level where the products look and work in a “seamless” fashion.

Don’t they know you can go to hell for lying?

Sure, for the customer, now there is one vendor and one throat to choke, but those solutions are still separated, in every material way that matters.  And sure enough, too many calls to support will soon mean that your most recent investment in the new product from your old supplier, will eventually turn into shelf-ware. Your investment is lost, and your problem and pain remains.

Stove-piped solutions that are forced together by the sheer will and cost of vendor provided professional services, are lessons in complexity, poor ROI, and overworked IT staff. Sometimes you have no other choice, and must deal with it, regardless of the cost and pain. The desire customers have, to believe vendors who acquire products, and buy into their claims of getting a platform, instead of several separate products, is what gets them into trouble.

“Hey, wait a minute” you say, “these are big, big companies, with hundreds of engineers.  They will make the solutions work together, and I will get the solution that I want.”

That makes sense, until you look at the track records of all major vendors in the storage/data management space.  After hundreds of acquisitions, billions of dollars spent, thousands of infrastructures uprooted and redone, it is still hard to show tight integration between any of the solutions.

It is however, easy to show the disparities between them, the incompatible metadata, the separate business processes, the redundant repositories, the conflicting data movers, the contradictory data classification schemes, the incompatible policy schemes, and the archaic mindsets that emanate from legacy architectures that date back 15 to 20 years.

I have no beef with honest vendors who go out and give it their best to integrate with other products, or multiple products they offer, in order to deliver a solution.  We do the same thing at Cofio with some solutions, and of course we have the advantage of a TRULY UNIFIED set of solutions in a single product, AIMstor.  What  I have a hard time understanding, is how some vendors (you know who you are, big and small) can lie so boldly, mislead customers, and claim unified or tight integration, where none exists.

The 4-Step Dedupe for Backup

March 19th, 2010
  • Originally Posted by Tony Cerqueira on Sat, Aug 15, 2009 @ 08:36 PM

So, we all know now that the old legacy backup solutions create HUGE waste and require deduplication appliances. But now, for users considering upgrading their environments for “intelligent backup”, DR, replication, or archive, or simply to herd the cats of unstructured data, there seems to be confusion among users about the issues of:

-Source Deduplication (doing the dedupe at primary data), versus

-Target Deduplication (doing the dedupe at the repository).

A number of articles and postings have been put out there, but it typically comes down to the same question: What is best, doing it at Source or Target?

We asked the same question. Then we said, heck, why not BOTH?

This is why AIMstor was originally architected to provide 4-step Dedupe.  Source level dedupe via 2 methods, and Target level dedupe via 2 other separate methods:

Source Side Dedupe:
Step 1-Duplicate Transfer Avoidance: At the initial sync between node and repository, if the repository has the data already (from a previous node), it tells the node to only transfer only “new” data. Saves a lot of time, network bandwidth, and initial repository capacity is minimized.

Step 2-Real-Time Changed Byte Transfer: At the same time, with subsequent BackupsCDP,Replication and Versions, AIMstor will only transfer changed bytes from the node. That reduces network traffic and load on the node. Because AIMstor is real-time, there is no scan or trawl.  So data constantly trickles from the node when it changes, and hits the repository as the RPO settings for the backup.

Target Dedupe:
Step 3-Multi-Level Single Instance Storage: Because the AIMstor repository is unified across Backups, Replication CDP and Versions, it allows only a single instance of a file, no matter where it came from.

Step 4-Global Object Deduplication: Also in the repository, AIMstor runs a final step post processing deduplication algorithm across all data sets from all machines. Thereby finalizing the deduplication with four complete steps to best reduce the total amount of data capacity used in repositories.

The AIMstor repository automates all this as part of any policy.  The good news, is that is downloadable now, and available for Windows environments.