The Silent Threat of Dark Data

Posted on November 21, 2016

[su_spoiler title=”Disclosure” icon=”plus-circle” anchor=”disclosure”]
This post is based on observations I made while attending Tech Field Day 12. I wasn’t compensated for writing this post, nor was I compensated for attending the event. My travel and incidentals to attend the event were covered by Gestalt IT. For more info, see our Disclosures page.
[/su_spoiler]
If you’re responsible for IT operations in almost any business today, there’s a pretty decent chance that you’re charged with storing large amounts of valuable business data. Exactly what “large” means will depend on the organization in question, but it seems that the storage capacity needs of almost every business are growing.
In a day and age where security breaches are so common they’re rarely even newsworthy, it’s critical that all this data is stored safely. Additionally, as the quantity of data stored increases, it becomes more and more critical (and difficult) to be able to analyze and report on the data that’s being stored. There are strategic, operational, and compliance reasons for the desire (or requirement) to do so.

The Problem with Dark Data

A term that is often used to describe a lack of visibility into the data that’s being stored is “dark data.” Think of intelligence and context as if it’s a beam of light. Wherever the beam shines, the data is illuminated. But without the light shining on the data, it’s “dark.” I’ll define dark data as that data which is without context and metadata; we don’t know what it is or where it came from, and we don’t know anything about the value of the information that’s stored. When it comes to dark data, for all we know, a given file could be a VP of Product’s sketches of a top secret project that is underway, or it could be a log file from your trendy janitor’s IOT urinal. [I was totally joking when I wrote that, but apparently this is a real thing. See: http://www.peelytics.com/]

Dark Data is Risky

So here’s the major issue with dark data: it’s a big problem to not know what you don’t know. If an organization can’t say for sure whether they’re storing personally identifiable information such as patient health records or credit card information in a safe and secure way, data that should be serving their business is now nothing short of an eminent threat. At any given moment, a breach could cause the loss of control of that sensitive data. It could be passed to nefarious operators and the fallout of this security event could literally put the company out of business.

Dark Data is Inefficient

“Secondary storage” includes data like user shares, backup/archive data, test/dev, and anything else that doesn’t have absolutely critical performance requirements. Dark data in secondary storage can make planning accurately next to impossible. Because of the variety of different types of data that is stored, and the number of silos across which that storage is deployed and allocated, it might be a very tall order to find out whether funds in next year’s budget needed to be allocated to Test/Dev storage or whether they actually need to be allocated to file share capacity. When you have no idea what the data is, it’s just a big blob of bits, and it’s impossible to make a sound business decision about it.

Dark Data is Lost Opportunity

The crisis of many organizations today is that they have more data than they know what to do with. The rise of unstructured data processing platforms like Hadoop and Spark shows that businesses know that there’s plenty of intelligence to be obtained from all of the data they’re storing. The problem is that without context and meaning associated with that data, one can’t actually do anything useful with it. But what does it take to create that context and meaning?

Move Compute-Intensive Analytics to the Data

One way to get some intelligence into all of your dark data is to create an analytics cluster to do this processing. You’ll create a copy of the data you’re looking for insight about, move it to the analytics cluster, and run some jobs against it to learn about it. This is a laborious and time-wasting process, however.
What would be substantially more efficient is to process the data in-place because the platform is built to do that. And that’s exactly what you’ll find with the Cohesity Data Platform. The Cohesity Indexing Engine, which constantly runs on the converged storage cluster, can provide useful insight such as storage-level metrics, VM-level metrics, and file-level metrics. But many modern storage platforms do this to some degree.
What’s more impressive is that the Cohesity Data Platform has MapReduce natively built in. Cohesity Clusters make it possible for even the most complex MapReduce computations to run natively within the Cohesity Data Platform, never requiring external compute or data migrations that are typical with Hadoop-like analytics implementations.

Illuminate Your Data

In tandem with the built-in analytics features, users can take advantage of Cohesity’s Analytics WorkBench (AWB). AWB is an interface for easily creating precise analytics jobs across a large dataset.

Here are some examples of the way illuminating data with MapReduce and AWB might provide real value to a business:

eDiscovery: Rapid content analysis to find relevant case information for legal requests or holds.
Compliance: Ensure compliance with Personally Identifiable Information (PII) requirements with Cluster-wide content scans for names, phone numbers, and credit card information that may have been stored in clear text.
Threat Analysis: Log correlation across disparate packet capture solutions to identify potential threats or origin of a security breach.

Final Thoughts

As I mentioned in the disclosure, this post is based on observations I made while attending Tech Field Day 12. If you’d like to see all of the presentations from the event, they’re all freely available for your viewing pleasure on the event page. Also, this specific post is based on Cohesity’s CEO Mohit Aron’s opening remarks in Cohesity Strategy & Vision with Mohit Aron. A few of my fellow delegates wrote about Cohesity as well; here are links to their articles:

Eric Shanks: Cohesity Provides All of Your Secondary Storage Needs
Mike Preston: Cohesity bringing secondary storage to Tech Field Day

Although we talked about quite a few different things with Cohesity at TFD12, this topic caught my attention because there’s a distinct lack of ability to do this sort of analytics in most organizations today, at least without a lot of overhead. A solution that I LOVE from a technical standpoint, DataGravity, aims to do some of these same sorts of in-place analytics for primary storage. Unfortunately, they’ve been pretty quiet lately and seem to be working some kinks out without being too vocal about it. Hopefully they’ll right the ship and be front and center in the near future; I love the concept they’re working with.