Back when data was little and simple, self-service analysis advocates started the chant, “Free the data!” IT stood in the way, they said. Fast forward to 2016: “democratized data” has become common, but so has public concern over privacy.
That nettlesome struggle drove a discussion that now stands as the data industry’s’ most important discussion of 2016. Around the conference table at last summer’s Pacific Northwest BI Summit — the annual, invitation-only confab held in Grants Pass, Oregon — two dozen data leaders pondered the issue for almost two hours. They concluded with an idea that broke open industry assumptions.
UK-based consultant Mike Ferguson told of a meeting held on continental Europe. He expected the usual usual stakeholders, such as from marketing and HR. But alongside them were two staff from the corporate counsel’s office. The lawyers made it clear that anything decided there would need their approval.
Their worry? Compliance with the European Union’s impending data privacy law, the General Data Protection Regulation. When it takes effect in 2018, privacy violations — including failure to erase individuals’ online presence on request — could amount to 4 percent of global revenue. Other regions will likely comply, eager to ensure continuing access to the EU market and to ensure access to EU data. Across the globe, the old democratization-versus-privacy is just about to grow some big, sharp teeth.
It poses a dilemma. Everyday business now requires ready access to data. Even compliance with new privacy regulations requires access even as the regulations seek to limit it.
At the problem’s root, says Ferguson, is data integration. Multiple platforms and tools have evolved to serve big data’s proliferating, specific workloads. Streaming data, Hadoop, the enterprise data warehouse, NoSQL and others chug away, each one possibly processing another platform’s data. And all that data keeps coming in faster and faster.
Data integration’s too expensive
“What I hear from clients,” said Ferguson who is managing director of Manchester-based Intelligent Business Strategies, “is that the cost of data integration is way too high.” Skills are spread across lots of tools, everything gets re-invented continually, metadata is fractured or lost entirely as it runs through multiple tools, and there’s just too much repetition all around. Data integration among platforms seems to become more complex all the time.
Self-service data integration is cheap. Many in IT like it. But, says Ferguson, it quickly results in “a kind of Wild West.” Data moves uncontrollably, with no one guarding the sources. Users apply countless tools for data prep, ETL, data integration, and other functions, and silos proliferate.
“There’s got to be better way,” said Ferguson. He suggests supplanting the “data lake” with what he calls a “data reservoir”: a governed[ replaced “organized”; obviously it’s organized. “Governed” comes from the following slide.] collection of raw, in-progress and trusted data that incorporates multiple stores and streams. The “reservoir” would define data once to run anywhere and supply info fast.
“The smart thing is to offer virtual views, Amazon-like,” said Ferguson. Instead of copying the data, it would be offered in virtualized form, ready to use but not copied. Data’s “Wild West” would be tamed with riding stables: Ride a trusted horse on a known trail.
Local policies could be applied as the data’s dispensed. Users with proper rights would see the data. But not those without rights would be told, “Sorry, Dude, you can’t see it. Wrong jurisdiction!”
Urgency
Underscoring the urgency of controlling data, vice president of marketing at IBM Harriet Fryman told about a crashed drone on her roof and an unsettling tweet. The tweet read, “I think my drone is on your roof. Can I have it back?” Fryman went to her roof and, sure enough, there was a crashed drone. As the owner explained later, his drone was equipped to send one last photo home before it crashed. From that image, he matched the visible roofline with a Google Maps satellite view, and from there he followed a circuitous path to Fryman’s Twitter account.
Meanwhile, explained SAS vice president of best practices Jill Dyché, executives are fed up waiting for a solution to the problem Ferguson described. Dyché has observed “an utter lack of confidence” among executives in the ability of organizations to govern data.
Donald Farmer, principal at TreeHive Strategy, raised another problem. “It’s incredibly difficult to prove something’s been deleted,” especially when the data’s already been propagated. “How do you track it back?”
Solutions
The typically voluble group went quiet for a moment, attesting to the challenge.
A surprising suggestion came from Donald Farmer, principal at TreeHive Strategy: The solution may be organizational, he said, not technological. The risk of violating privacy laws could be minimized if companies isolated risk with spinoffs. The mother company would grow as far as it could with the current technology, governance, and practices. Then it would spin off a subsidiary that would own the risky data along with the liability. Eventual innovations would transfer homeward, abandoning the risk with the spinoff’s shell.
Merv Adrian, however, disagreed. “I don’t believe that for a minute,” he said. “They’ll find a way around it,” he said. Later, he wrote to me in email, “Companies don’t do spinouts lightly. It’s disruptive, complex and costly.” The incentive would have to be strong.
Farmer had a second, even more intriguing idea: “One of the myths is that we need more information,” said Farmer. If we think again about the data we use and why we use it, he explained, we might find just about the same value with bayesian noise added. Data can be slightly wrong, with enough noise inserted to prevent hacks, and still have equal benefit to business users.
That is, the data doesn’t have to be right, just slightly wrong — at first glance an outlandish idea. It invited quips, perhaps a natural response to an implicit admission that technology may not be the answer. But who can even hope that until-now unknown difficulties, founded on a new world of unheard of complexity and an aroused public, could be solved with technology alone? Farmer’s idea, or something like it, may prove itself yet.
These are hard problems,” observed Robert Eve, director of product management, data and analytics software at Cisco Systems, with one last quip before lunch. With a colloquialism denoting the need for deep thought and newfound finesse, he added, “At run time, you have to understand the kung fu.”
Leave a Reply