photo by Thomas Hawk
Data virtualization solutions also known as Copy Data Management (CDM), Virtual Copy Data (VCD) and Virtual Data Appliances (VDA) are rising rapidly as over 100 of the Fortune 500 have adopted data virtualization solutions between 2010 and end of 2015. Adoption is hardly surprising given that virtual data reduces the time to provision copies of large data sets from days down to minutes and eliminates most of the space required for copies of data. How many copies of large data sets do companies have? Database vendor Oracle claims that on average a customer has 12 copies of production databases in non-production environments such as development, QA, UAT, backup, business intelligence, sand boxes, etc and Oracle expects the number of copies to double by the time their latest version of Oracle, 12c, is fully adopted. With Fortune 500 companies often having 1000s of databases and these databases reaching multi terabytes in size, the down stream storage costs of these data copies can be staggering.
There are a number of virtual data solutions coming onto the market and several already in the market place such as Oracle, Delphix and Actifio. Delphix and Actifio are listed in The 10 Coolest Virtualization Startups Of 2014 and Delphix is listed in TechTarget‘s Top Ten Virtualization Companies in 2014 as well as Forbes Magazine Names Delphix One of America’s Top 25 Most Promising Companies of 2014. Oracle as well is flooding their product offerings with data virtualization solutions such as Clone DB, Snap Clone, Snapshot Management Utility for ZFSSA and ACFS thin cloning in Oracle 12c and new vendors will be coming to market over the next year.
Question to ask when look at data virtualization solutions are
What unique features does each vendor provide to help achieve my business goals?
Does the solution support the my full IT environment, or is it niche/vendor specific?
How much automation, self-service and application integration is pre-built vs. requires customization?
Are their customers similar in size and nature to myself using the solution?
Is the solution simple and powerful or just complicated?
Picking between the available solutions is further complicated by the common claims used by all the solutions in the market, thus we’ve come up with a list of the top 3 criteria to choose between these solutions.
Top 3 criteria for choosing a virtual data solution
The top 3 questions to ask when looking at a virtual data solution are
Is the solution addressing your business goals
Is the solution supporting your entire IT landscape
Is the solution automated, complete and simple
1. Address business goals
The first step is to identify the business problems and clarify if the solutions meet your business goals. The top use cases for data virtualization in the industry are:
Storage savings
Application development acceleration
Data protection & production support
Deciding which of the above uses cases apply will help in determining the best solution.
Storage savings
All data virtualization solutions offer storage savings by the simple fact that virtual data provides thin clones of data meaning that each new copy of data initially takes up no new space. New space is only used after the data copies begin to modify data. Modified data requires additional storage.
Comparing storage savings
To compare the storage savings of various solutions find out how much storage is required to store new modifications and how much storage is required to initially link to a data sources. Of the solutions we’ve looked at the initial required storage ranges from 1/3 the size of the source data up to 3x the size of the source data. Of the solutions we’ve looked at some can store new modified data in 1/3 the actual space thanks to compression. Other solutions don’t have compression and some solutions have to store redundant copies of changed data blocks.
Data agility more important that storage savings
Storage savings can be massive but surprisingly enough of the 100s of virtual data adopters we’ve talked to most mention that the data agility far is by far more important than storage savings. Agility means that a virtual copies can be made in minutes instead of the more traditional full physical copies which can take hours or days or even weeks to make when making copies of large databases.
Application development acceleration
The agility that virtual data provides such as provisioning a full read writable copy of a multi TB database in minutes can improve the efficiency of many different aspects of a company but the area we see the biggest improvement is application development. Companies report 20-80% improvement in application development timelines after moving to data virtualization solutions. Application development typically requires many copies source data when developing and/or customizing an application. These copies of data are required not only by the developers but also QA.
User friendly self service interface
When it comes to identifying the best data virtualization solution for application development look for solutions that provide user friendly self service developer specific interfaces. Some solutions only provide interfaces for a DBA or storage administrators. Administrator specific interfaces will continue to impede developers as developers will have to request copies from these administrators incurring wait time especially when those other administrators are already busy. The improvements to application development come when the solution gives users self service interfaces where they can directly make copies of data eliminating the costly delays of waiting for data.
Developer Centric Interface
When looking at application development acceleration make sure the solutions have a developer centric interface with per developer logins that can supply the correct security level of limiting what data developers have access to, how many copies they can make and how much extra storage they can use when modifying data. Data typically has sensitive content that should be masked before giving the data to developers. In the case of sensitive data look for solutions that include data masking. Important as well is looking for developer interfaces that give developers standard development functionality such as data versioning, refreshing, bookmarking and rollback. Can one developer bookmark a certain version of a database and can another developer branch a copy from that bookmark to look at a certain use case or bug?
Branching of virtual data copies crucial for QA support
The most important feature for application development acceleration is the ability of the solution to branch data copies. Branching data copies means making a new thin clone copy from an existing thin clone copy. Some solutions have this feature and some do not. Why is branching important? Branching is important for a number of reasons such as being able to branch a developers copy of data from a time before they might have made an error in data changes such as dropping a table. More importantly though branching is essential for being able to spin up copies of data for QA directly from development. One of the biggest bottlenecks in development is supplying QA with the correct version of data or database to run the QA tests. If there is a development database with schema changes and/or data modifications, then instead of having to build up a new copy for QA to use, with data virtualization and branching, one can branch a new clone, or many clones for that matter, and give them to QA in minutes. All the while development can go ahead and continue to use the data branch they were working on.
Data protection for developer virtual copies
Finally some data virtualization solutions offer by default data protection for development databases. Development databases are often not backed up as they are considered “just development” but we see an order of magnitude more incidences of developers inadvertently corrupting data on development databases than production DBAs accidentally damaging data on production databases. Ask the data virtualization solutions if they can provide branches of a damanged development database down to the second at a point in time before the developer accidentally damaged the development database. Some solutions offer no protection, others offer manual snapshots of points in time, and finally the best simply and automatically provide a time window of multiple days into the past from which a virtual database can be branched off if there were any mistakes or data corruption.
Data protection & production support
Data virtualization solutions can provide powerful data protection. For example if someone corrupts data on production such as dropping a table or a batch job that only half completes modifying some data but not all data before erroring out, a virtual database can be spun up in minutes and the uncorrupted data exported from the virtual database and imported into the production database. We have heard numerous stories of the wrong table being dropped on production or a batch job deleting and/or modifying the wrong data with the changed propagated immediately to the standby thus being unrecoverable from the standby. Data virtualization can save the day recovering the data in minutes. Data virtualization can offer impressively fine grain and wide time windows for Recovery Point Objects and fast Recovery Time Objectives. Time window size and granularity
When looking at data virtualization solutions for data protection make sure the solution provides a time flow, ie a time window of changes from the source data from which virtual copies can be made. Some solutions have no time window, other solutions have occasional snapshots of past states of data and the best solutions offer recovery to any point in time down to the second in a time window.
Time window storage requirements
The larger the time window of changes collected from the past the more storage will be required. Find out how much storage is required to maintain the time window. Some solutions require significant storage for this time window and some solutions can store the entire time window for multiple weeks in the size of the original data source thanks to compression.
Time and ease of provisioning
Finally look into how easy or difficult it is to provision the data required. If the data required is a database then provisioning the data can be a complicated task without automation. Does the solution offer a point and click provisioning of a running database down down to the second at a past point in time? How easy or difficult is it to chose the point in time from which the data is provisioned? Is choosing a point in time a simple UI widget or does it require manual application of database logs or manual editing of scripts?
2 . Support your entire IT landscape
Is the solution a point solution or does it expand to all the needs of the IT department?
Is the solution specific to a few use cases or does it scale to the full Enterprise requirements?
Is the solution a single data type solution such as only Oracle databases?
Is the solution software running on any hardware or does it require specialized hardware? Does the solution use any storage system in your IT landscape or is it restricted to specialize storage systems? Will the solution lock you into a specific storage type or will it allow full flexibility to use new storage types as they become market leaders such as new, better and more affordable flash storage systems. Does your IT landscape use the cloud and does the solution support your IT department’s cloud requirements?
Does the solution support all of your data types and operations systems? For example does your IT landscape use any of the following databases and does the solution automate support for these databases?
Oracle
Oracle RAC
MySQL
DB2
Sybase
PostGres
Hadoop
Mongo
Does your IT landscape require data virtualization for any of the following and does the solution automate support for these data types
web application tiers
Oracle EBS
SAP
regular files
other datatypes
Does your IT landscape use and does the solution support all of your operating system types
Linux
HP/UX
Solaris
Windows
AIX
3. Fully Automated, Complete and Simple
Automated
How automated is the solution? Can an end user provision data or does it required a specialized technician such as a storage admin or DBA? When provisioning databases such as Oracle, SQL Server, MySQL etc does the solution fully and automatically provision a running database or are manual steps required? For example some solutions only provision data from a single point in time from the data source. What if a user requires a different point in time? How much manual intervention is required? Some solutions only support provisioning data from specific snapshots in the past. What if a user requires a specific point in time in the past that is between snapshots. How much manual intervention is required? Does the solution collected changes automatically from the data source or does the solution require some other tools or manual work to collect changes from the source or get newer copies of source data?
Complete
How complete is the solution?
Is the solution a point solution for a specific database like Oracle or does it support multiple database vendors as well as application stacks and other file types?
Does the solution include masking of data?
Does the solution include replication or other backup and fail over support?
Does the solution sync with a data source and collect changes or is it simply a interface to manage storage array snapshots?
Does the solution offer point in time recovery down to the second or is it limited to occasional snapshots?
Does the solution provide interfaces for your end user self-service?
Does the solution offer performance monitoring and analytics?
Does the solution only provide data sharing on disk only or does it share data at the caching layer as well?
Simple
How long does it take to install the solution? We’ve seen systems set up in 15 minutes and others take 5 days.
How easy or hard is it to manage the solution? Can the solution be managed by a junior DBA or junior IT person or does it require expert storage admins and DBAs?
Does the solution come with an alerting framework to make administration easier?
Does the interface come with a “single pain of glass” to expand to 1000s of virtual data copies across potentially 100s of separate locations in your IT landscape?
Is it easy to add more storage to the solution? Is it easy to remove unneeded storage from the solution
In Summary
Find out how powerful, flexible and complete the solution is.
Is the solution a point solution or a complete solution ? Some solutions are specific point solutions for example only for Oracle databases. Some solutions are point solutions to specific hardware or storage systems while others are complete software solutions. Complete flexible solutions sync automatically with source data, collect all changes from the source providing data provisioning down to the second from anywhere within that time window, will support any data type or database on any hardware and support the cloud.
Does the solution provide self service and user functionality?
Point-in-time provisioning
Reset, branch and rollback of environments
Refresh parent and children environments with the latest data
Provision multiple source environments to the same point in time
Automation / self-service / auditing capabilities
Some simple technical differentiators
support your data and database types on your systems and OS
support your data center resources or require specialized hardware or storage
sync automatically with source data or does it leave syncing as a manual exercise or require other solutions
provision data copies down to the second from an extended time window into the past
branch virtual data copies
cloud support included
But when it comes down to it, even after asking all these questions, don’t believe the answers alone. Ask the vendor to prove it. Ask the vendor to provide in house access to the solution and see how easy or hard it is to install, manage and execute the functionality required.
For more information also see
Commenti