Physical data modeling is the process of capturing the detailed technical solution. This is the first time we are actually concerning ourselves with technology and technology issues such as performance, storage, and security. For example, conceptual data modeling captures the satellite view of the business requirements, revealing that a Customer places many Orders. Logical data modeling captures the detailed business solution, which includes all of the properties of Customer and Order such as the customer’s name, their address, and the order number. After understanding both the high level and detailed business solution, we move on to the technical solution, and physical data modeling may lead to embedding Order within Customer in MongoDB.
In physical data modeling, we aim to build an efficient MongoDB design, addressing questions such as:
Physical Data Modeling Approach
There are five steps to physical data modeling:
Starting with our logical data model, we prepare the initial MongoDB structure by deciding for each relationship whether to embed or reference (Step 1). After making these changes to the model, we can determine where to accommodate history (Step 2). History means keeping track of how things change over time. There most likely will be other structural refinement needed such as adding indexes or sharding; both techniques are used to further improve retrieval performance (Step 3). In Step 4, review the work to get approval. Frequently, action items come out of the review and confirm step where we need go back to Step 1 and refine the structure. Expect iteration through these steps. Once we have received final confirmation from our work, we can move on to implementation. Let’s talk more about each of these five steps.
Step 1: Embed or Reference
MongoDB resolves relationships in one of two ways: embed or reference. Recall from Chapter 3 that embedded documents resolve relationships between data by storing related data in a single document structure, and a reference is a pointer to another document. MongoDB documents make it possible to embed document structures as sub-documents in a field or an array within a document. Embed is equivalent to the concept of denormalization, in which entities are combined into one structure with the goal of creating a more simplistic and performant structure, usually at the cost of extra redundancy and, therefore, storage space. One structure is often easier than five structures from which to retrieve data. Also, the one structure can represent a logical concept such as a survey or invoice or claim.
If I need to retrieve data, it can (most of the time) take more time to get the data out of multiple documents by referencing than by embedding everything into a single document. That is, having everything in one document is often better for retrieval performance than referencing. If I frequently query on both Order and Product, having them in one document would most likely take less time to view than having them in separate documents and having the Order reference the Product.
Notice the language though in the preceding paragraph; “most of the time” and “most likely” imply that embedding is not always faster, nor should retrieval speed be the only factor to consider when deciding whether to embed or reference. The top five reasons for embedding over referencing are:
An independent entity is an entity where each occurrence (instance) can be found using only its own attributes such as a Customer ID for Customer. A dependent entity, such as Account in the second example, contains instances that can only be found by using at least one attribute from a different entity such as Customer ID from Customer.
To make the explanation of independent and dependent clearer, here are our two data models with example values for two attributes within Account, Account Number and the foreign key back to Customer, Customer ID.
On the model in Example 1, we see that Account Number is unique, and therefore to retrieve a specific Account requires only knowing the Account Number and not anything from Customer, making Account an independent entity because Account Number belongs to Account. However, in Example 2, Account Number 34 appears three times, so the only way to find a specific account is to know who the customer is. So the combination of Customer ID and Account Number is required to distinguish a particular account, making Account in Example 2 a dependent entity because Customer ID, which is needed to help identify accounts, does not belong to Account.
Notice also that the relationship line looks different when connecting to an independent verses dependent entity. The dotted line means non-identifying and the solid line means identifying. Identifying relationships mean that the entity on the many side (the child) is always going to be a dependent entity (like Account in Example 2) to the entity on the one side (the parent).
If we are designing a MongoDB database and trying to decide whether we should embed documents or reference other documents, a major factor would be whether the relationship is identifying or non-identifying. An identifying relationship is a much stronger relationship than non-identifying, giving us more weight towards embedding the weak entity within the strong entity so that we have just one document.
Two more points on embedding versus referencing:
EXERCISE 7: Embed or Reference
Below are two scenarios with accompanying data models. Based on the scenario and the sample MongoDB document, decide whether you would embed or reference. Please see Appendix A for my responses.
Scenario |
Sample Embedded Document |
Sample Referenced Document |
We are building an application that allows easy querying of customers and their accounts. Customer has lots of other relationships, and in this scenario Account has no other relationships.
|
Customer : { _id : "bob", customerName : "Bob Jones", customerDateOfBirth : ISODate("1973-10-07"), accounts : [ { accountCode : "3281", accountOpenDate : ISODate("2014-02-05") }, { accountCode : "12341", accountOpenDate : ISODate("2014-03-21") } ] } |
Customer : { _id : "bob", customerName : "Bob Jones", customerDateOfBirth : ISODate("1973-10-07") } Account : { customerId : "bob", accountCode : "3281", accountOpenDate : ISODate("2014-02-05") } Account : { customerId : "bob", accountCode : "12341", accountOpenDate : ISODate("2014-03-21") } |
We are building an application to allow efficient data entry of product order information. Order contains over 100 fields, Order Line contains over 50 fields, and Product contains over 150 fields. An Order, on average, contains 20 Order Lines. Would you embed Order Line into Order?
|
Order : { _id : "1234", orderNumber : "AB-3429-01", orderDate : ISODate("2013-11-08") }, orderLines : [ { orderLineSequenceNumber : "1", productId : "2837", orderLineQuantity : 15 }, { orderLineSequenceNumber : "2", productId : "2842", orderLineQuantity : 25 } ] } |
Order : { _id : "1234", orderNumber : "AB-3429-01", ISODate("2013-11-08") } OrderLine : { orderId : "1234", orderLineSequenceNumber : "1", productId : "2837", orderLineQuantity : 15 } OrderLine : { orderId : "1234", orderLineSequenceNumber : "2", productId : "2842", orderLineQuantity : 25 } |
Recall from this section’s introduction that relational data modeling is the process of capturing how the business works by precisely representing business rules, while dimensional data modeling is the process of capturing how the business is monitored by precisely representing business questions. The thought process for this step is the same for both relational and dimensional physical data modeling, so let’s look at an example of each.
For Relational
On the facing page is our relational logical data model from Chapter 6. I would take a pencil and start grouping the logical entities together that I think would make efficient collections.
Here are my thoughts during this grouping process:
Here is the completed relational PDM after making these changes:
For Dimensional
A star schema is the most common physical dimensional data model structure. The term meter from the dimensional logical data model, is replaced with the term fact table on the dimensional physical data model. A star schema results when each set of tables that make up a dimension is flattened into a single table. The fact table is in the center of the model, and each of the dimensions relate to the fact table at the lowest level of detail.
Recall the dimensional logical data model example from the last chapter:
Let’s convert this to a star schema:
Notice I used a physical dimensional modeling technique called a degenerate dimension in this example. A degenerate dimension is when you remove the dimensional structure and just store the dimensional attribute directly in the fact table. Most of the time, the degenerate dimension contains just a single field—in this case, an indicator. So we removed those dimensions such as University that contained only a single field.
Step 2: Accommodate History
There are four options in addressing how field values change over time. Each option for handling these changes is given a different number (either zero, one, two, or three) and is often called a Slowly Changing Dimension (SCD for short). An SCD of Type 0 means we are only storing the original state and not storing changes. That is, a collection of Type 0 means that if the data changes, we just store the way the data was originally and do not apply any changes to it. A Type 1 means the most current view. With a Type 1 we are just storing the most current state of the collection by applying the latest changes. A Type 2 means we are storing all changes to the data in a collection—a complete history whenever anything changes. Type 2 is also known as “auditing” because we are storing all changes and therefore have the ability to analyze the way the world existed at any point in time. Type 2 is like having a time machine. Type 3 means we have a requirement for some history—for example, the most current view and the previous view or the most current view and the original view.
We need to determine the type of history needed for each collection by completing the Collection History Template:
Collection History Template
Store only the original state (Type 0) |
Store only the most current state (Type 1) |
Store full history (Type 2) |
Store some history (Type 3) |
|
Collection A |
||||
Collection B |
||||
Collection C |
Note that it is not uncommon for a collection to have more than type of history requirement such as Customer requiring a Type 1 for customerLastName and a Type 2 for emailAddress. Also, the thought process for this step is the same for both relational and dimensional physical data modeling, so let’s look at an example of each.
For Relational
Here is the completed Collection History Template for our Account example:
Account Project Collection History
Store only the original state (Type 0) |
Store only the most current state (Type 1) |
Store full history (Type 2) |
Store some history (Type 3) |
|
Customer |
✓ |
|||
AccountOwnership |
✓ |
|||
Account |
✓ |
|||
WithdrawalDebit |
✓ |
|||
DepositCredit |
✓ |
|||
InterestCredit |
✓ |
|||
CheckDebit |
✓ |
|||
BankAccountStatement |
✓ |
Note that transactions and documents are often created but rarely change, which is why a Type 1 is chosen. Note also that it is common to have a Type 2 on Customer and Account. There are a number of ways to model a Type 2, the most common being to put an effective date in the entity’s primary key:
There are other ways of modeling a Type 2, some of which are more efficient for child entities, such as WithdrawalDebit, which now has a larger and more complex primary key.
For Dimensional
Here is the completed Collection History Template for our University example:
Student Attrition Collection History
Store only the original state (Type 0) |
Store only the most current state (Type 1) |
Store full history (Type 2) |
Store some history (Type 3) |
|
Department |
✓ |
|||
Calendar |
✓ |
Note that because Type 1 is requested for both Department and Calendar, our data model does not need to change and remains with the same structure as our prior step:
Step 3: Index and Shard
The primary and alternate keys from our logical data model are converted into unique indexes on the physical. We can also add additional indexes for performance, called secondary or non-unique indexes. For example, if there is a continuous need to see customerLastName and customerFirstName, we can add indexes to these fields to reduce the amount of time it takes to retrieve.
Partitioning (called “sharding” in MongoDB) is when a document is split up into two or more parts. Vertical partitioning is when fields are split up and horizontal is when documents are split up. Frequently, both horizontal and vertical are used together. That is, when splitting documents apart, in many cases we learn that certain fields only belong with certain documents.
Both vertical and horizontal partitioning are common techniques when building analytics systems. A collection might contain a large number of fields and perhaps only a subset are volatile and change often, so this subset can be vertically partitioned into a separate collection. Or we might have ten years of orders in a table, so to improve query performance we horizontally partition by year so that when queries are run within a given year, the performance will be much faster.
For Relational
For our relational physical data model, these attributes are queried frequently and therefore need non-unique indexes: customerFirstName and customerLastName (we added a composite index6 on both of these attributes), customerTypeCode, accountTypeCode, branchCode, and accountName:
For Dimensional
For our dimensional physical data model, these attributes are queried frequently and therefore need non-unique indexes: departmentName, yearCode, and semesterCode:
Step 4: Review and Confirm
Previously, we identified the person or group responsible for validating the model. Now we need to show them the model and make sure it is correct. Often at this stage after reviewing the model, we go back to Step 1 and make some changes and then show the model again. This iterative cycle continues until the model is agreed upon by the validator and deemed correct so that we can move on to MongoDB implementation.
Step 5: Implement
We’re done with our modeling, and now we can start creating documents within each of these collections!
Key Points
|