Home Data Data lake or data swamp?

The 'data lake' concept - storing as much as possible of an organisation's data in one system - seems to be gaining traction. The idea is that it provides economies of scale through consolidation thanks to better utilisation and simpler management, as well as allowing the data to be used for more purposes without duplication.

For example, storage vendor EMC positioned the Isilon systems it announced last month as the place to store a data lake.

But analyst group Gartner raised a warning flag at the end of July, suggesting its clients should beware of the data lake fallacy.

"The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it's available for analysis by everyone in the organisation," explained research director Nick Heudecker.

But the most important risk surrounding data lakes "is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or governance. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp. And without metadata, every subsequent use of data means analysts start from scratch," according to a Gartner statement.

"The fundamental issue with the data lake is that it makes certain assumptions about the users of information," said Mr Heudecker. "It assumes that users recognise or understand the contextual bias of how data is captured, that they know how to merge and reconcile different data sources without 'a priori knowledge' and that they understand the incomplete nature of datasets, regardless of structure."

Gartner subscribers can access the company's report The Data Lake Fallacy: All Water and Little Substance.

The industry is naturally more optimistic - see page 2.

EMC Isilon division CTO for APJ Charles Sevior told iTWire that organisations typically purchase storage to support a particular workload, and then find they can use the data with another application. The trick is to do that without disturbing the original application and without duplicating the data - hence the idea of a data lake that's accessible to multiple applications.

People are realising there is value in 'digital exhaust' - the massive data sets of call records, log files, surveillance video and much more - if they have the ability to extract useful and meaningful information from it.

But there's no substitute for human intelligence and curation, said Mr Sevior, adding that if he ran a business that depended on data, consulting a data scientist would be high on his agenda, if only for a preliminary investigation that would reveal how the data could be brought to bear on the business's processes.

"We are getting tremendous support" for the data lake concept, he claimed, as people wanted an easier way into this type of analysis than Hadoop offers.

Amr Awadallah, co-founder and CTO of Hadoop vendor Cloudera told iTWire that "governance and innovation go against each other." With a traditional system, a business user that needs an additional data column for a particular analysis has to go through all sorts of hoops to get approval, and then plenty of technical work is needed behind the scenes to make it happen. This makes the change difficult to justify as its value cannot be known until the new analysis has been performed - it could prove worthless, or it might lead to millions of dollars of new revenue.

Cloudera's concept of the enterprise data hub appears to be the equivalent of a data lake.

This approach makes it easy for that user to access the data he or she needs, and at the same time "our SQL is good enough for 90% of the workloads in the enterprise."

Organisations need to be agile and innovative as well as efficient and well-governed, Mr Awadallah observed.

Image: incorporates a public domain photograph via Wikimedia Commons


Site24x7 Seminars

Deliver Better User Experience in Today's Era of Digital Transformation

Some IT problems are better solved from the cloud

Join us as we discuss how DevOps in combination with AIOps can assure a seamless user experience, and assist you in monitoring all your individual IT components—including your websites, services, network infrastructure, and private or public clouds—from a single, cloud-based dashboard.

Sydney 7th May 2019

Melbourne 09 May 2019

Don’t miss out! Register Today!



Australia is a cyber espionage hot spot.

As we automate, script and move to the cloud, more and more businesses are reliant on infrastructure that has the high potential to be exposed to risk.

It only takes one awry email to expose an accounts’ payable process, and for cyber attackers to cost a business thousands of dollars.

In the free white paper ‘6 Steps to Improve your Business Cyber Security’ you’ll learn some simple steps you should be taking to prevent devastating and malicious cyber attacks from destroying your business.

Cyber security can no longer be ignored, in this white paper you’ll learn:

· How does business security get breached?
· What can it cost to get it wrong?
· 6 actionable tips


Stephen Withers

joomla visitors

Stephen Withers is one of Australia¹s most experienced IT journalists, having begun his career in the days of 8-bit 'microcomputers'. He covers the gamut from gadgets to enterprise systems. In previous lives he has been an academic, a systems programmer, an IT support manager, and an online services manager. Stephen holds an honours degree in Management Sciences and a PhD in Industrial and Business Studies.


Popular News




Guest Opinion


Sponsored News