The central data warehouse often becomes a bottleneck in larger data driven organizations. Absurdly, this is even more true the more data driven a company behaves and consequently the faster a company introduces, evaluates and potentially drops new products, services or tools since it takes a long time until their data is integrated. This may lead to hiring excessive numbers of analysts to manually gather and join data in ad-hoc analyses. Data contracts and data mesh are new organizational measures to decentralize data and to unblock data collection, transformation and analysis. (tl;dr)
A very brief history of data warehousing
Back in the last century it were essentially certain departments in large corporations like finance that generated and relied on data and built first data stores to analyze it. However, over time, as other departments in corporations adopted tools and digitized data collection, these departments also started producing data that could be meaningful for business development.
The departments produced and managed data basically in silos. To get the big picture, companies needed a central data processing and analyzing unit that provided a single source of truth with common metrics. This is how data warehouses came to existence. A data warehouse houses data from different operational units like sales, marketing, design, production, research, finance, etc. and translates data into a format that makes the data useable centrally.
As organizations became more data-driven, more business units began relying on data. Resultantly, these organizations started generating copious amounts of data from every department or team that must be formatted into facts that a central data warehouse can process.
With recent advances in cloud data warehouse technology in particular through almost unlimited and independently scalable compute and storage resources the central data warehouse can in principle deliver the desired un-siloed wholistic view on an organizations data.
This marks the current state of the art and many organizations managed to implement such data infrastructure. But it comes at costs.
New problems arise
A central data warehouse modularizes the data stack only vertically - in the sense that all data from all business units needs to pass the central data warehouse. This increases dependencies and thereby cause bottlenecks. For each data source the central team needs to assign resources and to understand details of the inner workings of the source’s business. This takes time. Still, some organizations managed this complexity by investing a lot of human and monetary resources. However, at that point the problem only begins.
Todays most successful organizations move fast and break things. Data driven means using data to drive innovation. So product teams use click stream data to optimize conversion funnels, finance identifies high performing products/audiences/business models, marketing optimizes campaigns and as a consequence new products, business models and tools are created, bought, evaluated and abandoned all the time generating new data streams, data formats and analytics requirements.
If building up the initial infrastructure of extracting, transforming, harmonizing and modeling the data from all sources took already several months or often even years, the central data warehouse team will now continuously be congested with change requests and the teams may need to wait longer for the data then the new product or feature to be analyzed may even exit.
The absurd consequence of this is that the business teams start having their analyst manually join data that they extract from their new tools to do ad-hoc analyses. This takes more time, makes metrics less comparable across business units and will cause additional headache if this type of analyses need to be performed regularly. Organizations try to compensate by hiring an excessive number of analysts to stay ahead of their data. In some cases we have seen that trust in the central data warehouse team diminishes and new initiatives are started to bring in a new (of course better because it uses new tools) central data infrastructure to life, which unfortunately will due to resources shortages have the similar problems and may even coexist with the old structures.
The most common method to handle complexity is modularization. Giving the business unit ownership of their own data stack will allow them to react to their changing needs more quickly. Building their own data domain, consisting of a full stack of ETL, data warehouse and BI / data science tools is of course less complex than building the same for a whole organization. Instead of only vertically, i.e. operations, ETL, DWH, Analytics modularization now additionally splits up ownership horizontally along business unit borders.
But wouldn’t it through us back to the old silo times? Yes, it would and therefore another ingredient is needed: Data Contracts. As common in modularization, the modules (the data domain) exchange data through defined interfaces (data contracts). They ensure that company wide KPI can be defined and whole picture analytics is still possible.
But how does that really help? Wouldn’t building first the data domains and then building data exchange infrastructure between the domains using contracts even make the process longer? No, it actually helps in 2 ways:
So data contracts are the key for modularization. They should define
Note: whether you allow one business object/table per contract or more is a matter of taste
Data domains and data contracts together form the concept of a data mesh, an architecture for data platforms that decentralizes and distributes data ownership among teams who treat data as an independent product based on the business domain. It was first described by Zhamak Deghani of ThoughtWorks and comprises four core principles:
Pitfalls of data contracts to watch out for
While this new approach to enterprise data infrastructure is promising, some challenges must be navigated through carefully.
Data contracts and data mesh have become a hot topic in enterprises weighing the benefits, costs, and risks of moving to a decentralized data sharing and analysis architecture. It’s a move away from central data warehousing that has been a de facto way of leveraging organizational data for key business decisions. While data mesh is still in its early stages, it’s addressing some of the fundamental problems of centralized data warehousing, so it could benefit enterprises that heavily rely on data and are evolving fast at the same time.
This shift to treating data as a product is a socio-technical concept with its challenges cut out. However, when implemented successfully, it has the potential to improve how enterprises collect, use, and analyze data. If you have any questions or comments let us know.