The central data warehouse often becomes a bottleneck in larger data driven organizations. Absurdly, this is even more true the more data driven a company behaves and consequently the faster a company introduces, evaluates and potentially drops new products, services or tools since it takes a long time until their data is integrated. This may lead to hiring excessive numbers of analysts to manually gather and join data in ad-hoc analyses. Data contracts and data mesh are new organizational measures to decentralize data and to unblock data collection, transformation and analysis. (tl;dr)
A very brief history of data warehousing
Back in the last century it were essentially certain departments in large corporations like finance that generated and relied on data and built first data stores to analyze it. However, over time, as other departments in corporations adopted tools and digitized data collection, these departments also started producing data that could be meaningful for business development.
The departments produced and managed data basically in silos. To get the big picture, companies needed a central data processing and analyzing unit that provided a single source of truth with common metrics. This is how data warehouses came to existence. A data warehouse houses data from different operational units like sales, marketing, design, production, research, finance, etc. and translates data into a format that makes the data useable centrally.
As organizations became more data-driven, more business units began relying on data. Resultantly, these organizations started generating copious amounts of data from every department or team that must be formatted into facts that a central data warehouse can process.
With recent advances in cloud data warehouse technology in particular through almost unlimited and independently scalable compute and storage resources the central data warehouse can in principle deliver the desired un-siloed wholistic view on an organizations data.
This marks the current state of the art and many organizations managed to implement such data infrastructure. But it comes at costs.
New problems arise
A central data warehouse modularizes the data stack only vertically - in the sense that all data from all business units needs to pass the central data warehouse. This increases dependencies and thereby cause bottlenecks. For each data source the central team needs to assign resources and to understand details of the inner workings of the source’s business. This takes time. Still, some organizations managed this complexity by investing a lot of human and monetary resources. However, at that point the problem only begins.
Todays most successful organizations move fast and break things. Data driven means using data to drive innovation. So product teams use click stream data to optimize conversion funnels, finance identifies high performing products/audiences/business models, marketing optimizes campaigns and as a consequence new products, business models and tools are created, bought, evaluated and abandoned all the time generating new data streams, data formats and analytics requirements.
If building up the initial infrastructure of extracting, transforming, harmonizing and modeling the data from all sources took already several months or often even years, the central data warehouse team will now continuously be congested with change requests and the teams may need to wait longer for the data then the new product or feature to be analyzed may even exit.
The absurd consequence of this is that the business teams start having their analyst manually join data that they extract from their new tools to do ad-hoc analyses. This takes more time, makes metrics less comparable across business units and will cause additional headache if this type of analyses need to be performed regularly. Organizations try to compensate by hiring an excessive number of analysts to stay ahead of their data. In some cases we have seen that trust in the central data warehouse team diminishes and new initiatives are started to bring in a new (of course better because it uses new tools) central data infrastructure to life, which unfortunately will due to resources shortages have the similar problems and may even coexist with the old structures.
Data contracts come to the rescue
The most common method to handle complexity is modularization. Giving the business unit ownership of their own data stack will allow them to react to their changing needs more quickly. Building their own data domain, consisting of a full stack of ETL, data warehouse and BI / data science tools is of course less complex than building the same for a whole organization. Instead of only vertically, i.e. operations, ETL, DWH, Analytics modularization now additionally splits up ownership horizontally along business unit borders.
But wouldn’t it through us back to the old silo times? Yes, it would and therefore another ingredient is needed: Data Contracts. As common in modularization, the modules (the data domain) exchange data through defined interfaces (data contracts). They ensure that company wide KPI can be defined and whole picture analytics is still possible.
But how does that really help? Wouldn’t building first the data domains and then building data exchange infrastructure between the domains using contracts even make the process longer? No, it actually helps in 2 ways:
- First, now the ownership for cleaning, validation and monitoring lays by a decentralized team which are in fact even experts for their own business models, distributing the load of change requests broader. This will scale much better with new domains.
- Second, not all business objects need to be exchanged with other data domains. This allows for example the marketing team to quickly onboard a new A/B Testing tool for conversion optimization and integrate it their exiting data infrastructure to for example enrich it with live time value data from the e-commerce domain. The specific marketing KPI may not be interesting to other domains. Only KPI with value to more domains will be exchanged and this would happen on a slower time scale.
So data contracts are the key for modularization. They should define
- a business object to be exchanged (e.g. transactions, customers, events, etc.)
- the attributes of the business object
- the business object + attributes basically define the schema of a table in your domains data warehouse’s export layer
- the delivery timing
- historicization mode
- and data quality and monitoring parameters, with respect to accuracy, uniqueness, validity, completeness, or timeliness of the data).
Note: whether you allow one business object/table per contract or more is a matter of taste
Data domains and data contracts together form the concept of a data mesh, an architecture for data platforms that decentralizes and distributes data ownership among teams who treat data as an independent product based on the business domain. It was first described by Zhamak Deghani of ThoughtWorks and comprises four core principles:
- Unify different data sources to provide a single source of truth regardless of the difference or lack of communication between scattered data sets.
- Protect data through governance.
- Highest quality data, regardless of the volume.
- Enable self-service without the help of the data team, allowing independent data management.
Pitfalls of data contracts to watch out for
While this new approach to enterprise data infrastructure is promising, some challenges must be navigated through carefully.
Data contracts and data mesh do not reduce the complexity of data in your organization. They make it more manageable, though, by modularization. It still requires a lot of people, in particular decentralized data engineers. However, it will improve data availability and speed and will reduce the necessity for certain roles, like analysts doing ad-hoc analyses.
While individual domains will have internal key performance indicators (KPIs), there will be a need for centralized KPIs similar to the data warehousing architecture for a bird’s eye view of the entire enterprise. This should be one of the first steps in setting up the data mesh because they are the prerequisite for the data contracts
It may be a good idea to create a central data domain which collects business objects from source domains and provides these to other domains. This domain is then capable of harmonizing KPI and also providing joined reporting layers and KPI.
Data contracts should be defined in the beginning but can cause implementation delays as different units wait for these contracts to be defined. I would recommend a more agile approach by starting implementation early to see potential errors in the contracts before all contracts are finalized.
Maintaining autonomy for individual business units is crucial. Don’t restrict some tools in the data stack to a central team because you think the domains don’t have the knowledge (like only team X can manage airflow). Use education to enable the domains.
Also don’t just assign people in a central team to do tasks for a specific business unit, e.g. that one data engineer for marketing, etc.. The people building marketing’s data stack must be part of this unit, attending their “all hands”, talking with their squad leads, sharing their OKRs etc.. They must fully live and understand the units business model.
Furthermore, don’t make only the domains data engineers responsible for data quality. The whole domain, including the product developers is responsible.
The data mesh allows in principle full autonomy for the data domains also with respect to the tools being used. There are also technical implementations like Trino/Starburst to allow federated queries across different data technologies. There may be situations where a certain technology makes sense for a particular unit, but in general and in particular if starting with new infrastructure, I would strongly recommend using the same technical basis for all units. This makes education and common guidelines much easier and can also drastically improve performance when data needs to be joined across domains. Data mesh is an organizational concept, not a technical one.
Of course one of the most prominent benefits of data ware houses is the availability of historical data. Still you may decide to have the history only inside the domain’s warehouses and not put into the data contracts. This would only allow whole picture analysis based on the current status which may be fine. However, plan carefully here. You may find out that a consumer domain, like a reverse ETL to your CRM, may not only need the todays status of the customers but also which customers have been deleted since yesterday. Be aware that needing to change historicization mode later means a lot of hassle.
Data contracts obviously will need to change over time when new data is produced or consumers have different needs. Include this into your planning and prepare for versioning. You may need to have different versions active as not all consumers will switch at the same time. This would require including versioning into your project/dataset/table_name naming schemes. Make sure to have a clearly communicated obsoletion scheme, like only providing the latest two versions or abandoning old version after X months.
It maybe tempting to do 1:1 contracts, i.e. between one data producing domain and one consuming domain. This also makes versioning simpler as you won’t really need old versions to still being active. If you are working with a central domain and most consumer domains connect to this one, 1:1 contracts may be feasible. However, if you encourage direct point to point connections, the number of contracts scales quadratically and may become infeasible quickly. In general I would go for export level contracts, defining all business objects a domain exports. There is one more advantage of having export level contracts: If you have already defined a contract with a consumer domain you can usually use that right away for a second domain without defining a new contract and implementing the tables in the export layer. This can be a strong speedup in the process.
Data contracts and data mesh have become a hot topic in enterprises weighing the benefits, costs, and risks of moving to a decentralized data sharing and analysis architecture. It’s a move away from central data warehousing that has been a de facto way of leveraging organizational data for key business decisions. While data mesh is still in its early stages, it’s addressing some of the fundamental problems of centralized data warehousing, so it could benefit enterprises that heavily rely on data and are evolving fast at the same time.
This shift to treating data as a product is a socio-technical concept with its challenges cut out. However, when implemented successfully, it has the potential to improve how enterprises collect, use, and analyze data. If you have any questions or comments let us know.