Data vault modeling concepts through real-world examples (2024)

The two predominant modeling techniques for applications are

(1) fully normalized models (based on normalization principles evolved in the 70’s) usually following 2NF for batch-based databases and 3NF for DML-intensive hot OLTP databases
(2) read-optimized data warehouse models known as Star Schema or Dimensional model (based on Kimball’s methodology)

All other techniques (for instance, top-down data warehouse design known as Inmon methodology, snowflake models etc.) are a combination or slight variation of the above two.

Data Vault modeling solves a problem that these two techniques doesn’t solve. And that problem is rigidity. It is better explained with an example.

Consider a subscription business (that sells whatever — any product). Imagine two entities (ie, tables) Customers and Subscription. Each customer signs up with an email-id and has exactly one subscription

Let’s imagine the company has 5 types of subscriptions varying in cost and levels of service offered. Each subscription has many customers in them. So:

Customer to Subscription is one-to-one relationship.
Subscription to Customer is one-to-many relationship.

After a few years business found that there is revenue opportunity in allowing a customer to purchase multiple subscriptions, for instance when a customer wants a second subscription for their family member, but under the same account (email-id). So now the data model has to change such that:

Customer to Subscription is one-to-many relationship.
Subscription to Customer continues to be one-to-many relationship.
Hence this has become a many-to-many relationship.

What used to look like this:

Data vault modeling concepts through real-world examples (2)

should now look like this:

Data vault modeling concepts through real-world examples (3)

The second data model introduces a bridge table that allows the new subscription model. But this design change is costly: when data model changes all applications that write to and read from the database has to be changed. ETLs that load the database need to be rewritten. This could easily become a multi-month project for complex data models.

Data Vault as a solution

The problem here is that the 3NF model (or 2NF based on the potential attributes on those tables) is rigidly expressed in database schema for specific business rules. When these rules change, model has to change. Essentially this means the relationship between data could change over time and the model should accommodate it without forcing the applications to re-write their code.

What if the bridge table (or associative table, or relationship table) that resolved the new many-to-many relationship was always present there? It adds an overhead for sure (additional joins for example), but it will work for one-to-one, one-to-many and many-to-many relationships. At its core this is the idea of data vault modeling.

It introduces links between business entities such that changes in rules doesn’t require changes in software.

Data vault modeling concepts through real-world examples (4)

The core entities of a business — Customer and Subscription for instance — are called Hubs. Hubs only hold business keys, no descriptive attributes (for example, no customer name or address). Descriptive attributes like name and address go to Satellite tables. Link tables hold the relationship between hubs. For instance the bridge table in many-to-many relationship is a Link. Descriptive attributes pertaining to a Link (eg: subscription start date by a customer) go to the Link’s Satellite tables.

Because of the presence of Link tables, changes in business rules not only is flexible (no design change, no application rewrite), the Link also tracks business changes (using, for example, subscription start date or other relevant attributes). This can be audited either for internal reporting or for legal purposes.

The above example shows how data relationships are explicitly tracked through data records as opposed to foreign keys. This is one use case but DV models can be useful in other areas, just two of them given below.

(1) Data warehouses
In data warehouses, the Links could represent transactions that go into fact tables in an otherwise dimensionally modeled warehouse. Unlike fact tables, Links in DV only hold the keys from Hubs (which equate to dimension tables). All measures and descriptive attributes of the transaction record goes to Satellites of the Link. Similarly all descriptive attributes of Hubs go to their own Satellites.

Essentially the attributes that make up a data warehouse (measures and dimensions) sits on Satellite tables. The Type-2 or Type-4 changes are implemented on Satellites.

The challenge here is that this is not in a read-optimized format, requiring multiple joins. For analytics, data vault warehouses usually have a reporting layer or data mart on top of the Hub-Link-Satellite raw data. This means that if the sole purpose of a data warehouse is analysis of data and decision support, data vault is the wrong solution.

(2) Problem of multiple sources of truth
Imagine the above subscription business selling products that can have different definitions or descriptions between different vendors/resellers of the same product. You get data from these vendors, one of them has ProductA, a different vendor calls it ProdA, yet another calls it product-A.

Your application will have to interact with these vendors in their vernacular. But your reporting system should resolve this difference and show the sales of these products as the sales of a single item.

A data vault style model can accumulate each vendor’s data in its own Hub and use a Link to resolve the dependencies. In the above example, a Link table could be used to load one record each for ProductA, ProdA and product-A, and all of them set to product_id = 100 which would be an inorganic key that ETL generates. Obviously it’s upon us (not the vendors) to resolve the matching products to one item through matching techniques (similar to the record-linkage techniques used in master data systems).

Data Vaults are neither write-optimized (as in OLTP 3NF databases) nor read-optimized (as in OLAP or dimensional models), and this is its main drawback. Unless you have a specific need for flexibility and/or auditing, it isn’t worth implementing.

DV is usually implemented in data warehouses. This is because data warehouses are huge vertically scaled-up systems that can afford the lack of write-optimization — write-intensive applications usually have their own little database. To solve for read-optimization we are already used to building data marts in warehouses which can be built on top of DVs as well.

Data vault modeling concepts through real-world examples (2024)

FAQs

What are the concepts of data vault modeling? ›

Core Concepts of Data Vault Modeling

At the heart of Data Vault modeling are three primary concepts: Hubs, Links, and Satellites. Each plays a vital role in the structure and function of a Data Vault model. Hubs represent the unique business keys within the organization.

What are the use cases for Data Vault? ›

The Data Vault methodology can be applied to the Silver layer where data is transformed into Hubs, links and satellites. In the Gold layer, multiple data marts/data warehouses can be built as per dimensional modeling/Kimball methodology.

Is Data Vault 2.0 still relevant? ›

Data Vault 2.0 modeling methodology has gained immense popularity since its launch in 2013. It's a hybrid model that combines the benefits of Third Normal Form (3NF) and star schema architectures, making it a dream solution for data warehousing engineers.

What are the 4 types of data modeling? ›

Data Modeling Examples
  • ER (Entity-Relationship) Model. This model is based on the notion of real-world entities and relationships among them. ...
  • Hierarchical Model. ...
  • Network Model. ...
  • Relational Model. ...
  • Object-Oriented Database Model. ...
  • Object-Relational Model.

What is the major concept for data modeling? ›

Data modeling is the process of creating a visual representation of either a whole information system or parts of it to communicate connections between data points and structures.

What problems does data vault solve? ›

6 Typical problems that data vault architecture can solve
  • Data integration from multiple sources.
  • Mergers and acquisitions.
  • Compliance and auditing requirements.
  • Real-time analytics.
  • Scaling issues.
  • Complexity and change management.
Dec 11, 2023

What are the benefits of data vault model? ›

Flexibility in data storage: the Data Vault provides flexibility in data storage in a number of ways. Additional flexibility: being able to easily add new sources and entities. And for this, you do not have to modify the existing structure. More data storage: even wrong or incomplete data is stored.

What is the purpose of the data vault? ›

Data Vault is designed specifically for organisations that need to run agile data projects where scalability, integration of multiple source systems, development speed and business orientation are important.

What is the primary key in data vault? ›

Primary key is the hashed value of the business key or a sequence number (surrogate key). In Data Vault 2.0 [4], primary keys based on sequence number are replaced by hash-based primary keys. Load date indicates the date and time when the business key initially arrived in the hub.

How to load data in data vault? ›

There are two main types of data loading in a data vault: initial load and incremental load. Initial load is the process of populating the data vault with historical data from the source systems. Incremental load is the process of updating the data vault with new or changed data from the source systems.

What are data modelling techniques? ›

Data Modeling in software engineering is the process of simplifying the diagram or data model of a software system by applying certain formal techniques. It involves expressing data and information through text and symbols.

What are the disadvantages of Data Vault? ›

The Drawbacks of Data Vault

These include: 1. The Learning curve: In precisely the same way that 3rd Normal Form, Entity Relationship Modelling, and Dimensional Design are specific skills that take time to master, there is a learning curve with Data Vault.

What are the layers of Data Vault? ›

It consists of 3 layers: The landing zone is where all source data initially enters the data platform. The data maintains its source format and model. The Raw Data Vault contains raw, historical, unfiltered data from the sources.

What is the difference between data warehouse and Data Vault? ›

Data vaults store raw data as-is without applying business rules. Data transformation happens on-demand, and the results are available for viewing in a department-specific data mart. While a traditional data warehouse structure relies on extensive data pre-processing, the data vault model takes a more agile approach.

What are core business concepts in data vault? ›

Core Business Concepts are related to other ones through Natural Business Relationships (Links) and are described using Context (Satellites) entities. Any Hub entity in Data Vault can therefore have multiple active relationships to other entities through Links.

What are the four important components of data modelling? ›

What Are the Key Data Model Components?
  • Entities are the objects we want to represent in our data model and are usually represented by a table. ...
  • Attributes appear as columns in specific tables. ...
  • Records are shown in rows in each table. ...
  • Relationships define the associations between entities.
Apr 4, 2023

What are data modelling concepts in SQL? ›

SQL data modelling is the structure and relationship between data within system . It improves data understanding and result in efficient data usage . Key components of SQL data modeling include: Entities and Tables: Identify the entities (objects or concepts) that need to be represented in the database.

What are the five steps of data modeling? ›

The steps include:
  • Requirements analysis.
  • Conceptual modeling.
  • Logical modeling.
  • Physical modeling.
  • Maintenance and optimization.
Dec 13, 2023

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Edmund Hettinger DC

Last Updated:

Views: 6141

Rating: 4.8 / 5 (58 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Edmund Hettinger DC

Birthday: 1994-08-17

Address: 2033 Gerhold Pine, Port Jocelyn, VA 12101-5654

Phone: +8524399971620

Job: Central Manufacturing Supervisor

Hobby: Jogging, Metalworking, Tai chi, Shopping, Puzzles, Rock climbing, Crocheting

Introduction: My name is Edmund Hettinger DC, I am a adventurous, colorful, gifted, determined, precious, open, colorful person who loves writing and wants to share my knowledge and understanding with you.