In this post I will define the concept of data governance from first principles and explain its place in the larger area of data management.
Modern companies require an ever-increasing amount of data to stay competitive. It is not uncommon for data lakes to reach the size measured in petabytes. But it's not just about the volume.
A few decades ago, businesses mainly dealt with data retrieved from their own production database. Since then, we have witnessed a boom of SaaS tools offering help with all aspects of the business. Most companies have contracts with several SaaS vendors. They share data with those service providers and retrieve data from them.
As a result, data comes from several different sources and reaches multiple destinations within and outside the company. With this increased complexity, it is more important than ever to have the right strategy for dealing with data.
The Elder Scripts blog covers the opportunities and challenges that software engineers and data engineers encounter. I am sharing insights gained from years of work in the industry as an engineer and as a manager of engineering teams. Subscribe to get notified of weekly posts if you haven't already.
Definition of data management
Data that companies collect about their business, customers, and market became a vital asset. This data is the main asset for many companies that creates most of their value.
It is not surprising then that a variant of asset management is used to manage this class of assets. Data management is the process of collecting, storing, distributing, and disposing of data. Any organization has to deal with these concerns. Successful companies plan this strategy proactively as a systematic process with defined owners, rules, and goals.
Let's break down data management into the list of concerns this process is dealing with:
- Retrieving the state of business entities and events from operational databases
- Collecting reports, metrics, and other information from external SaaS vendors
- Storing the data
- Transforming the data according to business needs and technical implementation details
- Reading the data for various purposes, such as analytics, decision support, machine learning, etc.
- Ensuring the privacy and security of data, and correct access policies within the company
- Sharing the data with third party processors, such as marketing campaign providers
- Archiving and deleting data according to retention rules and legal requirements
The role of data governance
In a small startup all work on the list of data management concerns might be done by a small team without a strict separation of roles. But medium or large companies tend to organize separate departments for multiple areas of data management. An example of such an organization could be:
- The Data Engineering team sets up the extract, transform and load pipelines
- The Operations team is responsible for the underlying infrastructure that stores and processes data
- The Data Analytics team manages reports and business-level metrics defined on the basis of prepared data
- The Legal team defines the constraints and guidelines based on the applicable regulation
As a result, dozens or hundreds of people perform their roles in the extensive process of data management. If this work is not well organized, the company faces risks related to incorrect data handling. These risks include delays in data processing, impact to business metrics related to low data quality, or even legal consequences for breach of compliance regulation.
It follows that companies have to orchestrate and regulate the management of data assets. They need to define a system of ownership, accountability, and decision rights for all processes involved in data management. This system is commonly called data governance.
The data governance programs in the company may define:
- What is the acceptable level of data quality for each dataset
- Which teams have the right to access or modify data in each stage of the data lifecycle
- Who is responsible for monitoring and enforcing compliance and supporting the other teams in building compliant designs
- Which rules regulate the retention of collected data
Difference between data management and data governance
Data governance is the core regulatory component of the data management process in the company. It is a collection of practices that defines the decision-making process, responsibilities, and rights for access and modification of data for each team and each business process that deals with the data.
Related concepts to data governance are data quality and data strategy.
Data quality refers to a set of requirements applied to the data, including completeness, freshness, accuracy, etc. Data governance programs define these requirements based on the company's business goals.
Data strategy is a concept with multiple popular definitions. Most commonly, it refers to a set of practices that the company uses to extract value from collected data.
As we've seen, data asset management is a complicated process where multiple teams and systems work to deliver and process the data assets. Data governance ensures that the participants of this process collaborate according to a single consistent set of rules.