It’s humorous to me that when talking about data, somehow we end up referencing water. Whether it’s data flow, data lakes, waterfall (not really data but had to throw it in). As a Pisces and a data lover, I’m rather down with this water concept. But with all these terms out there, I thought it was best to get my bodies of data water defined. So here goes…
Let’s start off with Waterfall. This isn’t specific to data mind you. It’s a methodology used to describe a project management technique that’s very linear. For data warehouse development, this would mean you would have to get all your requirements up front and documented before you can model the data. You have to wait for data to be modeled before you can write your transformation scripts. And on and on. It reminds me of that children’s book, If You Give a Mouse a Cookie. Everything flows in a downward direction and is dependent on the previous step. Agile has replaced waterfall with its approach of failing fast and making course changes, development is in parallel when possible. And if you think about it, you never really deliver what was initially promised. The customer will typically change their minds when they start seeing their data in the flesh.
Data flow reminds me of a river and it makes sense. You essentially are talking about data flowing from source system to destination system. Some data moves in a quicker current (near real-time like the Columbia river before they dammed her up with the Grand Coulee) while other data is more sluggish (like monthly financial data where I picture Huck Finn chilling down the Mississippi). Data flow diagrams are really important to relay how your data gets into your data warehouse. So always have an up to date data flow diagram handy!
The term data lakes has surfaced recently to describe the storing of data in its raw form. That raw form could be:
- Structured data from relational databases
- Semi-structured data like CSV, JSON files
- Unstructured data like emails, PDFs
- Binary data like images, audio
Data lakes are sort of a dumping group. Not all data in the lake will make its way to the data warehouse. But since it might be used some day, why not just store it? It allows users to noodle on data and answer ad hoc data requests. The analyst who swims in a data lake will need to have skills to link disparate data and cleanse and be comfortable working with data in its native format. There probably will not be any doggy paddlers swimming in a data lake! If the data lake analysis has value and needs to be repeated or produced for a larger audience, then there should be provisions made to bring it into the data warehouse.
A swamp is essentially a forested wetland. So I like to think of a data swamp as a forested data lake that makes the data inaccessible. Beware, any lake can turn into a swamp! If people start dumping everything into the lake without planning for its downstream use, it can get quite swampy. And I don’t know about you, but swamps bring up images of leeches and alligators so I won’t be swimming there!
Sound off in the comments about other waterlogged data terms that you have heard in use.