So, what do you need? A data warehouse or shiny new and modern big data lake? “Both,” says Ingo Steins, *UM deputy director of operations. He even prefers to go one step further from data lakes to data analytic platforms.
Since the dawn of time, at least from an information technology point of view, companies have used data warehousing.
“These are valuable, but expensive systems to manage, run and expand,” Steins explains.
You have collected data for 15 to 20 years and are still collecting, so now you surely are stricken by storage space issues. Options are to buy costly machines and licenses or maybe archive or even delete data. Some compress and keep only summarized data.
Another headache for the data engineers is demands from other departments to blend in data from common sources like web shops and websites and even public sources like Facebook and LinkedIN.
“Into the engineers dearest and well-kept data warehouses, colleagues want to store all kinds of unstructured data which even might contain infections or other faults. This is not tempting and even not a sensible thing to do,” he says.
The solution is not to exit your data warehouse strategy, but to supplement with big data lake solutions.
“Data lakes don’t replace anything, but are a new and additional part of the data infrastructure that solves new problems. Data warehouses are still very important for companies, but not as flexible as data lake systems,” Steins says, aware that this is a well-known fact to any data engineer but still not to all managers and decision makers around. You know, those who always want to slice the data warehouse expense off their budgets.
Data lakes use all kind of information like stuff from your ERP system, spreadsheet files, csv files, xml files, doc files, pdfs and even mail files.
“It would take weeks and months to structure this kind of information in a data warehouse dependent on highly structured file systems, including doing rather extensive and complicated restructuring of your database schemes. In big data systems you can simply copy the data into the lake and process it later,” he says.
Big data swamp
“That might sound like at dream. And it is,” Steins continues. “But, it might also become your worst nightmare. Structure and documentation is key even for data lakes. You simply must know what’s in the lake. Otherwise, what you get is not a beautiful data lake but something more like a data swamp.”
In data lake terminology, raw data is called the landing zone and starting point for analysis. From the landing zone, data engineers will prepare data for analysis by taking it through the clean up and enrichment phase (where they even might delete stuff).
“In general terms, I really cannot describe how you should handle your data or even what the outcome will be. This is totally dependent on what data sources you have, what you want to achieve and what kind of business you run and those kind of issues,” explains Steins.
“But for sure, you can use the data for statistics or for analysis, feed it back into your production system or even load your data warehouse.”
Extending a data warehouse system will usually take lots of time adding another machine to your existing one or two, and splitting data manually. Big data lakes are born into the world of distributed systems, where you can scale data and processing systems in no time – even with thousands of machines from public clouds.
One point of data
“Big data lake is a hyped expression with an unclear content to many. We prefer to use the term data analytic platform. Data analytic platforms go even further than data lakes, as the term and solutions cover even data warehouse and other system – it is your one point of data for different data sources in your company,” says Steins.
Data Thinking: A guide to success in the digital age
How to become a sovereign data enterprise – step by step. Download the white paper here!