Intermediate Datasets and Complex Problems

We often build data processing systems by starting from simple cases and slowly adding functionality. Usually, a system like this is thought of as a set of operations that take raw data and create meaningful output. These operations tend to organically grow into monoliths, which become very hard to debug and reason about. As such a system expands, development of new features tends to slow down and the cost of maintenance dramatically increases. One way to manage this complexity is to produce denormalized intermediate datasets which can then be reused for both automated processes and ad hoc querying. This separates the knowledge of how the data is connected from the process of extracting information, and allows these parts to be tested separately and more thoroughly. While there are disadvantages to this approach, there are many reasons to consider it. If this technique applies to you, it makes the hard things easy, and the impossible things merely hard.


Target Audience


schedule Submitted 11 months ago