MyPage is a personalized page based on your interests.The page is customized to help you to find content that matters you the most.


I'm not curious

Dipping in the Data Lake

Published on 04 September 15
10 Mario rewarded for 10 times 5 Mario rewarded for 5 times   Follow
1119
0
2

Over the past couple of years the term data lake has increasingly come into prominence. It was first used by James Dixon, the Chief Technology Officer of Pentaho back in 2010, and more recently the concept has been hyped up a lot by a number of technology product vendors to promote their offerings in this space.
Dipping in the Data Lake - Image 1

It's well known by now that although most enterprises collect and store a lot of data in warehouses a lot of data is not collected at all, and certainly not managed at all. With so much focus on data being "the new oil" the data lake concept was proffered as a way of centrally storing all possible kinds of data in native formats so that if it was needed at any time it could be accessed from the lake. A data lake stores all types of data as it is ingested from any source that is hooked up to, and stored without needing to conform to a model or structure. In addition, the data in the lake, since it is centrally available, would be visible to all departments, and therefore not be siloed in the traditional manner. The availability of Hadoop makes this possible to do, with massively parallel processing being done at the data level followed by storage.

There have been debates about whether all of this makes sense and whether the hype has gone too far in focusing on the creation of the lake rather than on how it could be made useful. Some advocate the concept of data reservoirs, where data is not completely ungoverned, but is subject to a level of curation, with the purpose of becoming available to downstream applications. There is obvious merit in doing this as well.

Personally, I prefer to go back to basics in order to make the choice. A data lake is an implementation of the concept of big data storage. As such, it is something that technology has made available, with the message "whatever data you want, you'll find it in there". But technology must have a cause, otherwise it could well be just a white elephant. And that, to me, is why we have to go back to the business and start from there with a question that needs an answer.

From there, we go to data science, which brings together a business person who must creatively figure out what bits of data to bring together for study and a solution, an IT person who does the technical work of extracting and helping cleanse and integrate the needed data, and a statistician who actually does the mathematical analysis on the data that's made available. In theory, it would be nice if whatever data was needed is already available in a data lake, but in practice, would it really be efficient to maintain a lake? It would probably be better to either source the data that's actually needed (if it's not already available), or maybe to maintain limited varieties of data on which metadata is maintained, and is managed in terms of security and ageing.

Even if Hadoop has made the collection and storage of Big Data affordable, it might still be a waste to keep ingesting and collecting every type of data simply because it's now possible to do so. As new data continuously flows in the older data would need to be archived at regular intervals. Once that's done, what are the chances that it'll ever really be looked up again? Slim to none, in all likelihood. So why bother collecting too much of it unless it's required? With data from a growing number of IoT streams already around the corner, and the amount of multimedia data from social media alone being amazingly huge, the risk of drowning in a data lake set up without a strategic objective beyond data availability becomes rather real.

A more prudent first step might be to create a data reservoir for the kinds of data already needed, and have it work as a source for existing applications. After that, new data sources can be added as and when needed.


Over the past couple of years the term data lake has increasingly come into prominence. It was first used by James Dixon, the Chief Technology Officer of Pentaho back in 2010, and more recently the concept has been hyped up a lot by a number of technology product vendors to promote their offerings in this space.

Dipping in the Data Lake - Image 1

It's well known by now that although most enterprises collect and store a lot of data in warehouses a lot of data is not collected at all, and certainly not managed at all. With so much focus on data being "the new oil" the data lake concept was proffered as a way of centrally storing all possible kinds of data in native formats so that if it was needed at any time it could be accessed from the lake. A data lake stores all types of data as it is ingested from any source that is hooked up to, and stored without needing to conform to a model or structure. In addition, the data in the lake, since it is centrally available, would be visible to all departments, and therefore not be siloed in the traditional manner. The availability of Hadoop makes this possible to do, with massively parallel processing being done at the data level followed by storage.

There have been debates about whether all of this makes sense and whether the hype has gone too far in focusing on the creation of the lake rather than on how it could be made useful. Some advocate the concept of data reservoirs, where data is not completely ungoverned, but is subject to a level of curation, with the purpose of becoming available to downstream applications. There is obvious merit in doing this as well.

Personally, I prefer to go back to basics in order to make the choice. A data lake is an implementation of the concept of big data storage. As such, it is something that technology has made available, with the message "whatever data you want, you'll find it in there". But technology must have a cause, otherwise it could well be just a white elephant. And that, to me, is why we have to go back to the business and start from there with a question that needs an answer.

From there, we go to data science, which brings together a business person who must creatively figure out what bits of data to bring together for study and a solution, an IT person who does the technical work of extracting and helping cleanse and integrate the needed data, and a statistician who actually does the mathematical analysis on the data that's made available. In theory, it would be nice if whatever data was needed is already available in a data lake, but in practice, would it really be efficient to maintain a lake? It would probably be better to either source the data that's actually needed (if it's not already available), or maybe to maintain limited varieties of data on which metadata is maintained, and is managed in terms of security and ageing.

Even if Hadoop has made the collection and storage of Big Data affordable, it might still be a waste to keep ingesting and collecting every type of data simply because it's now possible to do so. As new data continuously flows in the older data would need to be archived at regular intervals. Once that's done, what are the chances that it'll ever really be looked up again? Slim to none, in all likelihood. So why bother collecting too much of it unless it's required? With data from a growing number of IoT streams already around the corner, and the amount of multimedia data from social media alone being amazingly huge, the risk of drowning in a data lake set up without a strategic objective beyond data availability becomes rather real.

A more prudent first step might be to create a data reservoir for the kinds of data already needed, and have it work as a source for existing applications. After that, new data sources can be added as and when needed.

This blog is listed under Development & Implementations and Data & Information Management Community

Post a Comment

Please notify me the replies via email.

Important:
  • We hope the conversations that take place on MyTechLogy.com will be constructive and thought-provoking.
  • To ensure the quality of the discussion, our moderators may review/edit the comments for clarity and relevance.
  • Comments that are promotional, mean-spirited, or off-topic may be deleted per the moderators' judgment.
You may also be interested in
Awards & Accolades for MyTechLogy
Winner of
REDHERRING
Top 100 Asia
Finalist at SiTF Awards 2014 under the category Best Social & Community Product
Finalist at HR Vendor of the Year 2015 Awards under the category Best Learning Management System
Finalist at HR Vendor of the Year 2015 Awards under the category Best Talent Management Software
Hidden Image Url

Back to Top