My top recommendations for beginning your journey with DV2
Venkat Sekar · Follow
Published in · 7 min read · Jan 19, 2022
--
With the ever-changing landscape of source systems, modeling requirements, and data acquisition and integration options, the Data Vault 2.0 model provides the necessary patterns to adapt to these shifts.
However, even if you are familiar with Data Vault 2 (DV2), it is not as simple as just understanding Hubs vs Links vs Satellites. One would need to re-think their approach to appreciate and gain experience in building DV2.
In this article, I will reflect on how I have gained experience with DV2 in order to appreciate the agility of its solutions and the breadth of possibilities that it offers for various scenarios. I hope that sharing my insights and experiences with you provides you with a guided path as you learn more about the possibilities that Data Vault has to offer.
Get Introduced to Data Vault
Before you jump into modeling, I recommend that you understand the basics by going over Datavault’s video series: A Brief Introduction to Data Vault.
Building a Scalable Data Warehouse with Data Vault 2.0
Instead of jumping straight into experimenting, I recommend that you further familiarize your understanding of DV2 by reading chapters #1 and #2 in Dan Linstedt’s book: Building a Scalable Data Warehouse with Data Vault 2.0.
Note: The book was written in 2016 so it uses the features and functionalities of a SQL server. As of today, if you are thinking of adopting DV2 for Snowflake, Databricks, Big Query, Redshift, etc., many of these SQL statements are not of much value other than providing hints. Depending on the cloud data warehouse of choice, expect to re-think how you would adapt to the latest modern trends.
Also, you can skip/fast-forward through chapters 7-10, 13, and 15 — they offer some insights, but they are not relevant for modern cloud data warehouses.
It’s very easy to become misguided on the adoption perils of DV2 due to a number of blog posts written by disgruntled developers. The reason could be due to a lack of knowledge or improper implementation, hands-on experience, misguidance, or the ability to adapt and learn using modern approaches.
These mistakes happen to the best of us — myself included. When I first started with DV2, I remember thinking that it was too hard, there were too many joins, etc. However, once I started doing hands-on experimentation, I began to understand and appreciate the modeling. One article, in particular, Data Vault Issues Resolved, helped me re-think my approach and got me back on the road to experimentation.
Choosing a Dataset to Experiment
Implementing DV2 requires one to have a good understanding of the business, business processes, and its datasets. Unfortunately, poking around to get these datasets, bothering business analysts to take time to help you, and getting a dedicated data platform (like Snowflake) can be a time-consuming process that not everyone is able to do. This is especially true if you are learning on your own or outside of your organization.
The airline example presented in Linstedt’s book did not help much as it offered only bits and pieces of the business processes. For me, it did not allow me to think beyond what I already understood.
Ultimately, I ended up using the datasets from GitHub Archives. My reasons are:
- GitHub is widely adopted in the developer's community — its processes (push, pull, commit, etc.) are well understood
- Blogs like Everything You Always Wanted To Know
About GitHub (But Were Afraid To Ask), offered ready to implement queries that can be used as a demonstration of Business Vault or Information Mart - The presence of a large dataset, terabyte-sized. This is important as you can use this to load hubs, links, and satellites, and understand loading patterns
- Also, the use of a large dataset means you can gain experience on how to improve performance by adopting query assistance tables like Pits and Bridges
I was also able to build a DV2 model from scratch, which you can see below. This led me to gain deep insights into concepts like multi-active satellites, same-as links, non-historized links, and model evolution.
I highly recommend exploring and building the models on your own. The version you end up with might not necessarily match up with mine, and that’s perfectly okay. It’s the thought process that goes into completing this activity that can help deepen your understanding and allow it to evolve.
Improving Further Knowledge
In conjunction with performing the hands-on experiment, I suggest following the wonderful, insightful articles from Patrick Cuba. His articles are well-articulated and explained clearly — they offer fantastic, deep insights into handling various scenarios like Ghost-Key adoptions.
There are many articles to list, and his contributions are ever-growing, so I suggest you explore and read the ones that interest you as your journey continues.
I have yet to buy his book, The Data Vault Guru: a pragmatic guide on building a data vault Kindle Edition, but he explains how the book benefits you in this video: Meet Patrick Cuba author of “The Data Vault Guru”.
Automation and Tooling
During your journey, you will hear a lot about DV2 Automation tools like Wherescape, DataVault Builder, or dbtvault. So which one should you choose? Well, my recommendation is that you simply ignore them for now.
The reason I say this is that if you adopt these tools early on, you will miss out on core concepts like how to adopt the right solution based on various scenarios like non-historized links. You will not be able to understand loading patterns and appreciate behaviors like Hash-diffs, insert-only operations, or templatization.
I have nothing against these tools. I feel adoption should be done after you have played around with your datasets heavily, identified business keys, and understood base business processes. Until then, I recommend that you hand-code the data ingestion loading tables manually and play around with automation to gain some great experience and insights.
NOTE: While strictly not automation, I personally prefer using dbt which allows me to implement “macros”, and certainly appreciated the effort in defining custom templates and easing the re-use.
The article from Patrick Cuba, Decided to build your own Data Vault automation tool? provides valuable insights on how to go about building an automation tool. Another great resource for steps on building automation tools can be found in Linstedt’s book, but be warned that it does take some effort to build one manually.
Learn From Others
- Start with this video: The things I wish I knew before I started my first Data Vault Project!. From there, watch presentations from other DV adopters.
- If your target datastore is Snowflake, there are reviews on specific Snowflake features that make DV adoption earlier — look in Datavault’s Youtube Channel.
- Resolve misconceptions from this article as a start: DataVault Issues Resolved.
Data Vault Training and Certification
Additional Data Vault training and certification are available if you wish to pursue it. In my personal experience, I wouldn’t advise starting the training as early as day one in your learning journey. I think it would be more beneficial to do self explorations and experiment first. This way you can familiarize yourself with the concepts before making a financial commitment to the cost of the training, whether that’s on behalf of yourself (as a self-sponsor) or an employer. However, I am not saying that you shouldn’t take the training — I am recommending that you do self explorations first like the above steps and experiment a lot before you ask others.
Also, from what I have observed, not all organizations or client managers are influenced to adopt DV2. In my opinion, I feel DV2 adoption is most successful if your manager, data architects, and at least some other members of the team are prepared to head into the DV2 adoption.
Understanding DV2 is a continuous, evolutionary process, and there are constant interactions in the DV community. Should you begin the process of DV2 adoption, being a part of the community and understanding various patterns/solutions really helps.
I also recommend that you check out my recent blog post on the 10 Capabilities of Data Vault 2.0 You Should Be Using.
At Hashmap, an NTT DATA Company, we work with our clients to build better, together. We are partnering with companies across a diverse range of industries to solve the toughest data challenges — we can help you shorten time to value!
We offer a range of enablement workshops and assessment services, data modernization and migration services, and consulting service packages for building new data products as part of our service offerings. We would be glad to work through your specific requirements. Connect with us here.
Venkat Sekar is a Senior Architect at Hashmap, an NTT DATA Company, and provides Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.