Big data or Agile data ?

kyle Hailey
Jun 27, 2014
7 min read

The big data phenomenon threatens to break the existing data supply chain (DSC) of many information providers, particularly those whose chains are neither flexible nor scalable and include too many error-prone, manual touch points. – Cognizant

Big data is getting big hype but what exactly is driving the hype behind big data? The driving force is demand by businesses to answer revenue driving questions. Questions that drive revenue generating decisions depend on the ability to access the right data at the right time by the right people. Accessing the right data at the right time by the right people unfortunately remains elusive with big data where, according to Gartner, 85% of Fortune 500 organizations in 2015 will still be unable to exploit big data for competitive edge. How can companies get the competitive edge? The competitive edge can be had be a new technology called data virtualization that uses existing relational databases but accelerates the access to the data.

Data is a hot topic.

Big Data is an even hotter topic.

But data agility? I don’t hear much about it.

As Forbes magazine put it :

It is time to stop the stampede to create capacity to analyze big data and instead pursue a more balanced approach that focuses on finding more data sets and understanding how to use them to improve your business.

What are we all striving for? We are striving to access the data we want, when we want it both quickly an efficiently. That’s what I call data agility.

Data agility is the big pink elephant in the room. Every one is talking about big data but no one is talking about how do you get the data to where you want when you want. If you want to do big data, how do you get the terrabytes (TB) of data onto your Hdoop cluster from where ever it was collected? How does it take? How much work is it?

As Infoworld put it:

The bigger big data gets, the more challenging it becomes to manage and analyze to deliver actionable business insight. That’s a little ironic, given that the main promise of big data is the ability to make better business decisions based on compute-intensive analysis of massive data sets. The solution is to create a supply chain that identifies business goals from the start — and deploy the agile infrastructure necessary to make good on those objectives

Getting results from big data even after you have the data is difficult. Most fortune 500 companies don’t know how to get results from big data:

Through 2015, 85% of Fortune 500 orgs will be unable to exploit big data for competitive advantage. http://www.gartner.com/technology/topics/big-data.jsp

Unlike big data, most companies already have burning questions they know how to answer if only they could get access to the data faster on their current relational databases. In their current relational databases there are many clear questions and known methods for answering those questions but the problem is getting the right data to the right place at the right time. One of the main goals of ERP is getting data faster to answer critical business questions. Why do companies invest millions in ERP? ERP gets millions of dollars of investment every year because companies want to get answers to important business questions faster and with fresher data. Fresh data means getting the right data to the right people at the right time, which is data agility.

Data agility means getting the right data to the right people at the right time. Data is the life blood of more and more of the economy and the economy is becoming a data economy. Data agility is crucial for companies to succeed in the current and future data economy.

Agile data is the solution to getting the right data to the right place at the right time. Agile data’s technological core relies on data virtualization which requires tracking data blocks at the storage level. By tracking data blocks at the storage level, duplicate blocks can be shared across many different copies of data while at the same time any block changes can be stored in such a way that only the copy that made the changes sees the changed block. Tracking block changes and sharing duplicate data blocks is the core of data virtualization. The core of data virtualization has been around for almost 20 years in the form of storage system snapshots but like the internet without the web or gasoline with out a car, agile data can’t happen without an agile data platform.

An virtual data platform automates all the pieces of tracking data and provisioning it and encapsulates it into a hardware agnostic software that provides a user friendly self service interface.

For example, how would one supply data to

a developer who required a copy of production yesterday at noon when bugs were seen
a QA team who required a copy of the development database with all it’s current changes
a BI team who required 24×7 access to production for ETL batch jobs
a production database team who needs to extract data before it was inadvertently deleted or incorrectly modified on production.

Problem with cloning a database file system snapshots

Let’s look at a concrete example of cloning a database with file system snapshots. Most any experienced storage admin can take a storage snapshot of an Oracle database running on some specialized storage capable of supplying storage snapshots. The snapshot is easy though it still may be necessary to shut the database down before taking the snapshot to ensure the data is consistent. Even when the database can remain running during the snapshot, it may still may require specialized functionality or extra vendor packages if the database spans multiple LUNs that require synchronization for the snapshots. Once the snapshot is made then an experienced DBA in coordination with the storage admin can start up a database using that snapshot. Startup up such a database will require renaming the database, and changing the location of files that were not part of the snapshot typically such as log files, trace files, initialization files and then the database will have to be recovered. If the database is being started on some other machine than that machine might require that the snapshot files been made accessible over fiber channel or mounted via NFS. If these NFS or fiber channel configuration change the file names to the datafiles then the parameter files and possibly other files will require that the datafile locations be changed before starting the database. If the copy is required up to the most recent point in time then this may require getting redo files from the source database and recovering the database down to the last minute or second. All of that is the easy case.

In the easy case we have a clone database, a “thin copy”, that shares duplicate blocks with the original database but stores changed blocks separate from the original. The problem is we now have a development, test, reporting or some other type copy of the source database running on the same storage as the source database. If the source database is an important database then our copy will be impacting the performance of the source database. Protecting the source database from performance impact it one of the huge factors for creating copies of databases. If the storage snapshot is copy on write, then there is an even bigger impact as all writes will induce a read and two writes (read the original block, write it somewhere else, and then write out the new copy). To solve this we want to get the original database files onto separate storage. We can copy the entire database to a separate filer, call it the development filer, and then make or thin copies there. The next problem that arrises when someone wants a copy of the source database tomorrow and all we have is a copy from yesterday. In this case we have to copy across the entire source database to the development storage array which defeats the purpose of thin cloning. The purpose of thin cloning is to provide fast storage efficient clones. How do we solve all these complications using thin cloning technology?

Solution to thin cloning : data virtualization

Thin cloning obstacles are solved using data virtualization. Data virtualization consists of 3 technologies. The first technology continuously collects all the changes from a data source and writes them to storage capable of file system snapshots. The second technology manages the storage saving all changes in a time window and purging data older than the time window that is no longer needed. The third technology harnesses the file system snapshots and the time window to provision database thin clones to target machines either over fibre channel or NFS. All of this technology can be rolled into a software stack that can run on commodity hardware and map a filesystem onto any storage. It can be cobbled together using a file system such as open source ZFS and scripting or it can be repacked in self contained software such as Delphix.

Data Virtualization is exploding

Four years ago data virtualization technology was non-existent. Since then hundreds of companies have moved to virtual data platforms . When looking at agile data technology, some of the key functionalities to look are are

Databases – support cloning major databases such as Oracle, SQL Server, PostGres
Applications – support thin cloning application stacks
Self Service – interface should be easily usable by application developers, QA staff and BI teams to provision their own clones on demand.
Branching – support branching clones meaning making clones of clones which crucial for supporting multi-version development or even just patching a previous version.
Synchronization – support cloning multiple related database such that each clone is cloned from the exact same point in time. Any Fortune 500 that has multiple revenue tracking databases will need to synchronize a clones of each for analyzing financial close discrepancies.
Cloud Ready – supports any storage and efficient low bandwidth replication across heterogenous storage types. Installs on any commodity hardware.
Any Time – make clones at any time down to the second with the timeflow window
Live Archive – save specific points in the timeflow for ever to support compliance and auditing

Now that data virtualization is quickly maturing, the next frontier is quickly arriving which is data supply chain.

Data Supply Chain

It’s time to start treating your data less as a warehouse and more as a supply chain. Having identified your sources of data, you must corral it for analysis, in the same way that the various components come together on an assembly line. Recognize that the data won’t be static—it will be manipulated as it goes through the supply chain, added to other pieces of data, updated as more recent data comes along, and transformed into new forms as you look at different pieces of data in aggregate. – Syed Rasheed Redhat

Data Supply Chain features are quickly evolving but include

Security
Masking
Chain of custody
Self Service
Login and Roles
Restrictions
Developer
Data Versioning and Branching
Refresh, Rollback
Audit
Live Archive
Modernization
Unix to Linux conversion
Data Center migration
Federated data cloning
Consolidation

Data Supply Chain re-invents data management and provisioning by virtualizing, governing, and delivering data on demand.

As Accenture put it:

Yes, data technologies are evolving rapidly, but most have been adopted in piecemeal fashion. As a result, enterprise data is vastly underutilized. Data ecosystems are complex and littered with data silos, limiting the value that organizations can get out of their own data by making it difficult to access. To truly unlock that value, companies must start treating data more as a supply chain, enabling it to flow easily and usefully through the entire organization—and eventually throughout each company’s ecosystem of partners too.

Big data or Agile data ?

Recent Posts

Comments