This is the second part of our data mesh blog post series. For a brief introduction to data mesh and discussion about the sorts of organisation it can benefit, see the first post in the series: When should organisations consider data mesh. In this post the focus will be on the main technological challenges related to the data mesh paradigm. These challenges are discussed based on their suggested approach, main complexities and how ready current cloud technologies are in these areas.
Zhamak Dehghani argues that in shifting the paradigm from traditional to data mesh architecture, technology is irrelevant. Let’s take a look at this claim from the following perspectives:
- Sharing data products across domains
- Cost management with a self-serve infrastructure platform
- Data versioning
1. Sharing data products across domains
Dividing data assets among separate business domain teams adds more pressure to create technical capabilities that enable easy sharing and integration of data products between individual domains. Traditional ETL workloads that move data between each team can lead to long development times and outdated data compared to having a centralised platform. As the domain teams are mostly concerned about new data products for their own domain, the sharing can easily be seen as an extra burden.
Dehghani proposes the architectural decision of offering data between domains through predefined and guaranteed interfaces. The core idea is to have a precisely standardised way in which all domains agree to share their data products for easier integration. The domain team is allowed to change their data model as long as they support earlier versions for as long as they have users. Technically these interfaces can be anything from REST APIs to specifically formatted CSV files. In the end, these standards are just well-documented practices.
The main difficulty faced is that many existing systems won’t be able to produce data in the desired format. This results in the creation of intermediate data transformation components that require both development work and make the overall architecture more complex. Even though the domain teams are free to use whatever technology they see fit, the defined interface might result in only a handful of tools that can easily handle the requirements.
The ecosystem of domain teams across the organisation must ensure that the workload of creating the interfaces from database or object storage is as small and automated as possible. Otherwise domains will either fail to make their data available or it will significantly increase the lead time of creating data products.
Out-of-the box solutions aren’t the answer
Some technologies already offer out-of-the-box solutions for easy ways to share datasets, often with close to no development required. For example, Snowflake Data Share makes it possible to share data between Snowflake accounts securely without creating any ETL pipelines or APIs to expose the datasets. The technical and security aspects are handled by Snowflake and the data queried is always up to date. The data producer simply creates a sharable dataset and grants read access to the consumer account. Many cloud providers already have their own similar data sharing tools, but these usually require both the provider and the customer to use the same technology or tooling. These types of tooling restrictions should be avoided when using data mesh methodology. Currently the cloud market doesn’t offer a straightforward way to share data products in a technology-agnostic way.
2. Cost management with a self-serve infrastructure platform
At the core of the data mesh paradigm is the freedom for domain teams to choose whatever tooling they prefer. This can easily increase the costs with different business domains running their own data solutions separately. According to Dehghani, this should be overcome with shared infrastructure that’s provided by the self-serve data infrastructure platform. Dehghani also points out that the tooling and techniques for this sort of shared infrastructure are not yet very mature in the data landscape.
A self-serve platform can be achieved with an infrastructure team providing APIs for each domain team to build, deploy, monitor and maintain their required software components, for example APIs to create infrastructure for a certain computing service or database instance. Infrastructure remains centralised, but is utilised and developed independently by each domain team based on their individual needs.
However, the current technical capabilities make creating a self-serve platform a manual process. Effort from the centralised infrastructure team is required to get access to new technologies that don’t yet have an API. This takes some autonomy away from the domain teams as they’re required to use the centrally provided component selection. In a way, this centralisation fights against the core advantages of data mesh by slowing down the lean creation of services. On the other hand, a centralised infrastructure platform can require a change in mindset when it comes to attributing costs as the divisions between domain teams can become less defined with technologies that have a shared resource pool.
On a positive note, with economies of scale centralising infrastructure creation can create major cost savings even in data mesh architecture. With a correctly utilised self-serve infrastructure platform, domain teams can focus on creating features efficiently with minimal infrastructure burden. A centralised infrastructure platform can also help to make the mesh a whole rather than separate data products spread across different domain teams.
In the end, the question of how a self-serve infrastructure platform should look comes down to organisational culture, the starting situation and views on data platform governance. In its purest form, data mesh allows domain teams to use whichever tools and technologies they prefer. This will most likely mean an increase in operational expenses, but it will also shorten the time to test and develop new data solutions, as well as validate new technologies. Meanwhile we might see versions of data mesh ideology where new infrastructure is requested from a centralised team and technology options are limited to include, for example, only a certain cloud provider’s products. While this might slow down the cycle of innovation, it will allow tighter governance and potentially result in lower running costs.
3. Data versioning
According to the principles of the data mesh paradigm, domain teams have the freedom to focus on adopting new data products and features as they best see fit. In practice, these features can mean anything, such as changing the data aggregation level or adding new fields in datasets. However, data mesh principles also outline that consumers of data products are not required to adapt to constant changes. Instead they can rely on the fact that earlier versions of the data product are still supported by the producing domain team with semantic data versioning.
Semantic data versioning makes it possible for the domain team to publish new versions of the data products according to their own preferred schedule. Data consumers can migrate to newer versions when they see fit – and with meaningful version numbers they can easily understand what sort of change they’re dealing with. If an organisation wants tighter governance on go-live dates, the responsibility for releasing new versions can be transferred to a separate team. This way domain teams can focus directly on the next deliverable after completing the previous one.
Versioning is quite often overlooked in centralised monoliths, or at least done only on a small scale supporting the current and preceding version using version paths. With data mesh most data products will need to run multiple versions all the time. This creates a pressure to design failsafe versioning and to serve data from the same location whenever possible. By using carefully decided development principles, changing to a new version should be made as easy as possible for the consumer of the product.
Technically there usually isn’t anything that prevents data versioning on most modern solutions. However, versioning and backwards compatibility are not as common in data teams as they are in other areas of software development. In the data landscape the learning curve can be seen to be more about understanding why versioning is required and how it helps. As domain teams are often both producers and consumers of different data products, in data mesh they should quickly encounter the benefits of versioning and include it in their everyday development work. Specific tooling and services focused on semantic data versioning could also help in this area.
One of the major effects of data mesh could be reducing technology-related discussion, bringing the team’s business purpose to the fore. In the end, technology is just a way to create the desired outcomes for a business. In the data mesh paradigm this already starts with the organisation of domain teams, which are organised based on their shared goal and domain, not on their technological capabilities and interests. Overall, Dehghani’s data mesh principles are very vague when it comes to technology. The idea is more about giving trust and freedom to individual teams to make the decisions that benefit and advance their goals the most. The important thing is that their products can be consumed in the agreed way – the level of technology is less important than the business and shared goals.
Even though we can recognise some details related to the data mesh paradigm, where technological development or at least practice changes are required, the main barriers for adapting the ideology are far more on the cultural and human side. The cloud market will see new services focusing on helping certain aspects of the paradigm in the near future. These could include tool standardisation for sharing and versioning data products, or out-of-the-box self-serve infrastructure and data security platforms.
Want to make your organisation truly data-driven? Watch our Connected Company webinar series and get inspired!