kyle Hailey

# The Principle of Least Storage

**We’re copying and moving redundant bits**

In any application environment, we’re moving a lot of bits. We move bits to create copies of Prod for BI, Warehousing, Forensics and Production Support. We move bits to create Development, QA, and Testing Environments. And, we move bits to create backups. Most of the time most of the bits we’re moving aren’t unique, and as we’ll discover, that means they we’re wasting time and resources moving data that doesn’t need to be moved.

**Unique Bits and Total Bits**

Radically reducing the bulk and burden of caring for all of the data in the enterprise has to start with two fundamental realizations: First, the bits we store today are often massively redundant. Second, we’ve designed systems and processes to ship this redundant data in a way that makes data consolidation difficult or impossible. Let’s look at a few examples:

**Backup Redundancy**

Many IT shops at major companies follow the Weekly Full Daily Incremental Model and will keep 4 weeks full of their data on hand for recovery. If we assume that for a data store (such as a database) the daily churn rate is 5% per day, then we could describe the total number of bits in the 4 week backup cycle as follows: (Using X as the current size of the database and ignoring annual growth):

Total Bits: 4*X + 24*5%*X = 5.20*X

But how much of that data is really unique? Again, using X as the current size of the database and ignoring annual growth:

Unique Bits: X + 27*5%*X = 2.35*X

The ratio of total to unique bits is 5.2 / 2.35 or 2.09. That is, our backups are 51% redundant at a bit level. Moreover, the key observation is that the more full backups you perform, the more redundant your data is.

**Environment Redundancy**

According to Oracle, the average application has 8 copies of their production database, and this number is expected to rise to 20 in the next year or two. In my experience, when backups have about a 5% daily change rate, Dev/Test/QA classes of environments have about a 2% daily change rate, and are in general 95% similar to their production parent database even when accounting for data masking and obfuscation.

If we assume an environment with 8 copies that are being refreshed monthly, start out 5% divergent and churn at a rate of 2% per day, then we could describe the total number of bits in these 8 environments as follows: (Using X as the current size of the database and ignoring annual growth):

Total Bits: 8*95%*X + 2%*30*8*X = 10*X

But how much of that data is really unique? Again, using X as the current size of the database and ignoring annual growth:

Unique Bits: X + 2%*30*8*X = 3.40*X

The ratio of total to unique bits is 10 / 3.4 or 2.94. That is, our copies are 65% redundant at the bit level. Moreover, the key observation is that the more copies you make, the more redundant your data is.

**Movement is the real redundancy**

Underlying this discussion of unique bits vs. total bits is the fact that most of the time, the delta in bits between the current state of our environment and the state we need it to be in is actually very small. In fact, if we eliminate the movement of bits to make operations happen, we can reduce the total work in any operation to almost nothing. If you’re hosting not just one copy but every copy from a shared data footprint, you have a huge multiplying effect on your savings.

The power of a shared data footprint is that it makes a variety of consolidations possible. If the copy of production data is shared in the same place as the data from the backup, redundant bits can be removed. If that same data is shared with each development copy, then even more redundant bits can be removed. (And, in fact, we see a pattern of only storing unique bits emerging.) Finally, if we need to fresh development, we can move almost NO bits. Since every bit that we want already exists in the production copy, we just have to point to those bits and do a little renaming. And because its a shared footprint, we don’t have to export huge amounts of data to a distant platform; we can just present those bits (e.g., via NFS).

Consider a developer who needs to refresh his 1 TB database from his production host to his development host in concert with his 2-week agile sprints. In a world without thin clones, this means we transmit 1 TB over network every 2 weeks. In a world with thin clones and a shared footprint, we copy 8 GB locally and don’t have to transmit anything to achieve the same thing.

**The better answer**

Regardless of our implementation, we reach maximum efficiency when we achieve our data management operations at the lowest cost. Reducing the cost of movement is part of that, so I offer the: Principle of Least Movement:

Move the minimum bits necessary the shortest distance possible to achieve the task.

**So what?**

There’s a workload attached to moving these bits around – a cost measured in bits on disk or tape, network bandwidth consumed, and hours spent. Since we’re moving a lot of redundant bits, much of that work is unnecessary. There’s money to be saved, and it isn’t a small amount of money. And, that cost doesn’t just end in IT. It costs the business every time a Data Warehouse can’t get the fresh data it needs to so that real time decisions can be made. (Should I increase my discount now, or wait until tomorrow? Should I stock more of Item X because there is a trend that people are buying it?) It costs the business when a production problem continues for an extra 4 or 6 or 8 hours because that’s how long it takes to restore a forensic copy. In fact, in my experience, the business benefit to applications for outweighs the cost advantage, which is not insignificant.