Getting Started with Data Vault 2.0 (2024)

With the ever-changing landscape of source systems, modeling requirements, and data acquisition and integration options, the Data Vault 2.0 model provides the necessary patterns to adapt to these shifts.

However, even if you are familiar with Data Vault 2 (DV2), it is not as simple as just understanding Hubs vs Links vs Satellites. One would need to re-think their approach to appreciate and gain experience in building DV2.

In this article, I will reflect on how I have gained experience with DV2 in order to appreciate the agility of its solutions and the breadth of possibilities that it offers for various scenarios. I hope that sharing my insights and experiences with you provides you with a guided path as you learn more about the possibilities that Data Vault has to offer.

Get Introduced to Data Vault

Before you jump into modeling, I recommend that you understand the basics by going over Datavault’s video series: A Brief Introduction to Data Vault.

Building a Scalable Data Warehouse with Data Vault 2.0

Instead of jumping straight into experimenting, I recommend that you further familiarize your understanding of DV2 by reading chapters #1 and #2 in Dan Linstedt’s book: Building a Scalable Data Warehouse with Data Vault 2.0.

Getting Started with Data Vault 2.0 (4)

Note: The book was written in 2016 so it uses the features and functionalities of a SQL server. As of today, if you are thinking of adopting DV2 for Snowflake, Databricks, Big Query, Redshift, etc., many of these SQL statements are not of much value other than providing hints. Depending on the cloud data warehouse of choice, expect to re-think how you would adapt to the latest modern trends.

Also, you can skip/fast-forward through chapters 7-10, 13, and 15 — they offer some insights, but they are not relevant for modern cloud data warehouses.

It’s very easy to become misguided on the adoption perils of DV2 due to a number of blog posts written by disgruntled developers. The reason could be due to a lack of knowledge or improper implementation, hands-on experience, misguidance, or the ability to adapt and learn using modern approaches.

These mistakes happen to the best of us — myself included. When I first started with DV2, I remember thinking that it was too hard, there were too many joins, etc. However, once I started doing hands-on experimentation, I began to understand and appreciate the modeling. One article, in particular, Data Vault Issues Resolved, helped me re-think my approach and got me back on the road to experimentation.

Choosing a Dataset to Experiment

Implementing DV2 requires one to have a good understanding of the business, business processes, and its datasets. Unfortunately, poking around to get these datasets, bothering business analysts to take time to help you, and getting a dedicated data platform (like Snowflake) can be a time-consuming process that not everyone is able to do. This is especially true if you are learning on your own or outside of your organization.

The airline example presented in Linstedt’s book did not help much as it offered only bits and pieces of the business processes. For me, it did not allow me to think beyond what I already understood.

Ultimately, I ended up using the datasets from GitHub Archives. My reasons are:

  • GitHub is widely adopted in the developer's community — its processes (push, pull, commit, etc.) are well understood
  • Blogs like Everything You Always Wanted To Know
    About GitHub (But Were Afraid To Ask)
    , offered ready to implement queries that can be used as a demonstration of Business Vault or Information Mart
  • The presence of a large dataset, terabyte-sized. This is important as you can use this to load hubs, links, and satellites, and understand loading patterns
  • Also, the use of a large dataset means you can gain experience on how to improve performance by adopting query assistance tables like Pits and Bridges

I was also able to build a DV2 model from scratch, which you can see below. This led me to gain deep insights into concepts like multi-active satellites, same-as links, non-historized links, and model evolution.

Getting Started with Data Vault 2.0 (5)

I highly recommend exploring and building the models on your own. The version you end up with might not necessarily match up with mine, and that’s perfectly okay. It’s the thought process that goes into completing this activity that can help deepen your understanding and allow it to evolve.

Improving Further Knowledge

In conjunction with performing the hands-on experiment, I suggest following the wonderful, insightful articles from Patrick Cuba. His articles are well-articulated and explained clearly — they offer fantastic, deep insights into handling various scenarios like Ghost-Key adoptions.

There are many articles to list, and his contributions are ever-growing, so I suggest you explore and read the ones that interest you as your journey continues.

I have yet to buy his book, The Data Vault Guru: a pragmatic guide on building a data vault Kindle Edition, but he explains how the book benefits you in this video: Meet Patrick Cuba author of “The Data Vault Guru”.

Automation and Tooling

During your journey, you will hear a lot about DV2 Automation tools like Wherescape, DataVault Builder, or dbtvault. So which one should you choose? Well, my recommendation is that you simply ignore them for now.

The reason I say this is that if you adopt these tools early on, you will miss out on core concepts like how to adopt the right solution based on various scenarios like non-historized links. You will not be able to understand loading patterns and appreciate behaviors like Hash-diffs, insert-only operations, or templatization.

I have nothing against these tools. I feel adoption should be done after you have played around with your datasets heavily, identified business keys, and understood base business processes. Until then, I recommend that you hand-code the data ingestion loading tables manually and play around with automation to gain some great experience and insights.

NOTE: While strictly not automation, I personally prefer using dbt which allows me to implement “macros”, and certainly appreciated the effort in defining custom templates and easing the re-use.

Getting Started with Data Vault 2.0 (6)

The article from Patrick Cuba, Decided to build your own Data Vault automation tool? provides valuable insights on how to go about building an automation tool. Another great resource for steps on building automation tools can be found in Linstedt’s book, but be warned that it does take some effort to build one manually.

Learn From Others

Data Vault Training and Certification

Additional Data Vault training and certification are available if you wish to pursue it. In my personal experience, I wouldn’t advise starting the training as early as day one in your learning journey. I think it would be more beneficial to do self explorations and experiment first. This way you can familiarize yourself with the concepts before making a financial commitment to the cost of the training, whether that’s on behalf of yourself (as a self-sponsor) or an employer. However, I am not saying that you shouldn’t take the training — I am recommending that you do self explorations first like the above steps and experiment a lot before you ask others.

Also, from what I have observed, not all organizations or client managers are influenced to adopt DV2. In my opinion, I feel DV2 adoption is most successful if your manager, data architects, and at least some other members of the team are prepared to head into the DV2 adoption.

Understanding DV2 is a continuous, evolutionary process, and there are constant interactions in the DV community. Should you begin the process of DV2 adoption, being a part of the community and understanding various patterns/solutions really helps.

I also recommend that you check out my recent blog post on the 10 Capabilities of Data Vault 2.0 You Should Be Using.

The 10 Capabilities of DataVault 2.0 You Should Be UsingA Data Engineer’s Guide to Unlocking Creative Solutions with the Capabilities of Data Vault 2.0medium.com
3NF and Data Vault 2.0 — A simple comparisonData Vault, from its first edition ‘1.0’ as a data modelling specification has evolved into a more elaborate version…medium.com
Digging into Data Governance and Data ModelingTesting and validation are where data governance begins to play a massively important role. If you’ve documented where…medium.com

At Hashmap, an NTT DATA Company, we work with our clients to build better, together. We are partnering with companies across a diverse range of industries to solve the toughest data challenges — we can help you shorten time to value!

We offer a range of enablement workshops and assessment services, data modernization and migration services, and consulting service packages for building new data products as part of our service offerings. We would be glad to work through your specific requirements. Connect with us here.

Venkat Sekar is a Senior Architect at Hashmap, an NTT DATA Company, and provides Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

Getting Started with Data Vault 2.0 (2024)

FAQs

Is data vault still relevant? ›

The biggest advantage of having a data vault in place is its adaptability to change. If your source architecture is prone to changes, such as the addition or deletion of columns, new tables, or new/altered relationships, you should definitely implement a data vault.

What is the data vault 2 methodology? ›

To facilitate a more flexible approach, Data Vault 2.0 handles multiple source systems and frequently changing relationships by minimizing the maintenance workload. This means that a change in one source system that creates new attributes can be easily implemented by adding another satellite to the Data Vault model.

What is the difference between data hub and data vault? ›

Hubs represent core business concepts, links represent relationships between hubs, and satellites store information about hubs and relationships between them. The data vault is a data model that is well-suited to organizations that are adopting the lakehouse paradigm.

How to load data in data vault? ›

There are two main types of data loading in a data vault: initial load and incremental load. Initial load is the process of populating the data vault with historical data from the source systems. Incremental load is the process of updating the data vault with new or changed data from the source systems.

What are the pitfalls of Data Vault? ›

The disadvantages of the Data Vault

Afterwards, you have to “work back” to a dimensional data model. More knowledge required: with the Data Vault comes a 3rd modeling technique. Employees must master these techniques bit by bit. No integrity: the data in the Data Vault lacks integrity and is not always correct.

When did Data Vault 2.0 come out? ›

Data Vault 2.0 is a database modeling method published in 2013. It was designed to overcome many of the shortcomings of data warehouses created using relational modeling (3NF) or star schemas (dimensional modeling). Speci fically, it was designed to be scalable and to handle very large amounts of data.

What is the difference between Data Vault and Data Vault 2? ›

Understanding the Difference: Data Vault 1.0 vs Data Vault 2.0. The primary difference between the two lies in their implementation. Data Vault 2.0 adopts a groundbreaking approach by employing Hash Keys as surrogate keys for hubs, links, and satellites, effectively replacing the conventional sequence numbers.

How much does Data Vault cost? ›

Pricing Information
RequestsDescription1 MONTH
Lite1,500 requests per month$0
Plus30,000 requests per month$49.99
Premium400,000 requests per month$249.99

How safe is Data Vault? ›

To enhance security, DataVault login is managed on your device, not over the Internet like some password managers. Your private data is stored securely on your device, not in the Cloud. For this reason, you must create your DataVault password on each device.

What is Data Vault with an example? ›

“The Data Vault is a detailed oriented, historical tracking, and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema.

Is Data Vault a data lake? ›

Data Vault is a combination of dimen- sional modeling and third normal form [7] and supports agile project management and use-case-independent modeling [8, 9]. Because it is a simple and flexible modeling technique, Data Vault qualifies for data modeling in data lakes [5].

What does Data Vault solve? ›

A Data Vault (DV) provides you a robust foundation for building and managing enterprise data warehouses, especially in scenarios where data sources are numerous, diverse, and subject to change.

What is the data vault 2.0 methodology? ›

It is a methodology that contains new ways of working (NWOW). The Data Vault focus is designed for solving enterprise level issues such as: agility, scalability, flexibility, auditability and consistency.

What is the primary key in data vault? ›

Primary key is the hashed value of the business key or a sequence number (surrogate key). In Data Vault 2.0 [4], primary keys based on sequence number are replaced by hash-based primary keys. Load date indicates the date and time when the business key initially arrived in the hub.

Is data vault free? ›

DataVault for Android is available in freemium and paid versions exclusively from Google Play. What are the difference between the free and paid versions of DataVault for Android? The free version is ad-supported.

Are data warehouses still relevant? ›

Traditional data warehousing concepts may be obsolete, but data warehousing itself is not. Cloud computing, deep learning, and big data technologies have changed the way data warehousing is done, making it more accessible and affordable for organizations of all sizes.

What is the purpose of Data Vault? ›

Data vault focuses on ensuring long-term sustainability, and it does so by compartmentalizing the data warehouse into specific types of tables, each serving a unique function. Here are the primary components in a data vault architecture: Hubs. Links.

What is the difference between Data Vault and medallion? ›

Medallion Architecture: While Data Vault focuses on data modeling, Medallion architecture addresses broader aspects of data governance and integration, and can be used together to create a comprehensive data management solution.

What are the advantages of Data Vault over dimensional modeling? ›

Lineage and Audit: As Data Vault includes metadata identifying the source systems, it makes it easier to support data lineage. Unlike the Dimensional Design approach in which data is cleaned before loading, Data Vault changes are always incremental, and results are never lost, which provides an automatic audit trail.

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Rueben Jacobs

Last Updated:

Views: 6139

Rating: 4.7 / 5 (57 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Rueben Jacobs

Birthday: 1999-03-14

Address: 951 Caterina Walk, Schambergerside, CA 67667-0896

Phone: +6881806848632

Job: Internal Education Planner

Hobby: Candle making, Cabaret, Poi, Gambling, Rock climbing, Wood carving, Computer programming

Introduction: My name is Rueben Jacobs, I am a cooperative, beautiful, kind, comfortable, glamorous, open, magnificent person who loves writing and wants to share my knowledge and understanding with you.