Transform, Optimize and Deploy
1) business requirements (user stories) and
2) competency in physical modeling (for the target of your choice).
At this point in the process, you’ll likely already be quite familiar with the business requirements at hand; they’ve been relevant in every step. We cannot help you with the second requirement. There exist plenty of other good sources, closer to the vendors, which can help you with that.
Here we will examine how visual representations of graph solution data models can help us determine, which transformations we will need.
We’ll also get useful mapping information on the business level, if we visualize the selected transformations.
A good example of the synergy between the three modeling levels is the matter of hierarchies. These exist on the business level, and can be seen in concept maps and in solution data models. However, they tend to disappear (because of denormalization) at the physical level.
The type of physical model you end up using in a given solution can have some effects on requirements for detailed preservation of history (e.g. Type 2 level of versioning as it appears in data warehouses and other places). If you have those kinds of requirements, please make sure your choice of physical data store is compatible with them.
Also remember that most of these products do not “join” in the SQL sense.
Document and graph databases do offer good possibilities for relating things; otherwise, you will have to rely on applications to perform joins. This will influence your design; you should try to cater to frequent query patterns in the design of the physical data model. Identifying these is what the user stories are for. Go back to your concept maps and consult the business level relationships/dependencies before you get too deep into the physical design.
Denormalize with a Smile
To determine the keys and uniqueness criteria, you’ll need the solution data model.
The flat denormalization result contains two main parts:
Data: The selected leaf-node data (selected properties in the solution data model—in this case, the description fields)
Navigation fields: The intermediate and top-level node properties (including identity keys), and the intermediate levels of classification hierarchies (in this case, the TerritoryID and RegionID).
As you can see from the example above, it is important to include all necessary intermediate and top-level keys. Otherwise, you will not be able to uniquely identify each row in the result. This is called “getting the grain right” in star schema modeling; it is absolutely necessary for getting correct aggregated numbers in reports run on top of flat denormalization results.
Denormalization also introduces another basic violation of the normalization principles: the constructs called “repeating groups.” We see them all the time, a customer might well have more than 2 telephone numbers. The list of telephone numbers is a repeating group.
Note that denormalization can be used in different situations:
- Defining key-value structures, possibly containing multiple keys and/or multiple value columns
• Building concatenated data (e.g. JSON) to be used in single-valued keyvalue stores (also called “wide column stores”)
• Building aggregates in the sense used in Domain Driven Design
• Building nested sets for document style structures.
Another option is using repeating groups of columns. If there aren’t too many in a 1:M relationship, repeating groups could work well and give you extra retrieval speed. This is also known as "long row designs".
Key / Value Targets
Column-wise, they include either just one value column (which may become very large) or multiple value columns.
These are called many names: “column family databases,” “big tables,” “wide tables,” or “long row designs”. Essentially, the column-wise physical models are applications of the repeating group paradigm, such that every row contains column(s) for each occurrence of the repeating data. Consider, for example, the multiple phone number example above.
Denormalization will have to cater to that. For example, sketching an aggregate structure (in the domain-driven design paradigm) could look like this:
Aggregates essentially deliver hierarchies for each relevant sub-tree of the solution data model. Note that aggregation might imply denormalization as well; the result will always have redundant data. Consider this when designing your model. If the redundant data is processed at different stages of a lifecycle, it might work well.
Many document stores offer flexibility of choice between embedding or referencing. Referencing allows you to avoid redundancy, but at the cost of having to retrieve many documents. In fact, some products go to great lengths to deliver graph-style capabilities, both in the semantic “RDF triplets” style and in the
general “directed graph” style.
In this manner, what your physical design will likely end up with is a set of related collections that reference each other; these could be arranged in many-to-many relationships, if need be.
RDF Graph Databases (Triple Stores)
Here is a simplified solution data model modeled after the Internet Movie Database (www.imdb.com):
Property Graph Databases
Look at the examples in section 4.1, and later in chapter 5. If you want more “graph-ish” modeling tips, you can look into this (free) book from Neo4j: http://bit.ly/29VFvPZ.
Property graphs are quite rich in their expressiveness, as you have seen already in the solution modeling chapter.
What we did not discuss in depth were properties on relationships, which should be used for describing some important feature(s) of the relationship like strength, weight, proportion, or quality.
Relationships are essentially joins between nodes, with one big difference. A relationship called “produces” may well point to different types of business objects, like “part” and “waste.” This is both a strength and a weakness. It’s positive because it reflects the way people think about reality. It’s negative because people might get confused, if the semantics are unclear. These considerations are very good reasons to define a solution data model, even if your physical data model is going to be based on a “flexible” or even non-existing schema.
Much of the flexibility of the graph model can be used to handle specialization on the fly. That is fine in most cases, but specializations hang off more precisely defined structures. And the precisely defined structures is what you have in your solution data model.
In graph databases you can use ordered, linked lists. Neo4j mentions (in an article) two examples of such lists:
- A “next broadcast” relationship, linking the broadcasts in chronological order
• A “next in production” relationship, ordering the same broadcasts in sequence of production.
In this manner time-series data can be handled elegantly in graphical representations using next / prior linking.
If you find yourself working extensively with these applications that work well with graph databases, you can bite the bullet and go “all graphs.” Alternatively, you can architect a solution that is part-graph and part-something else (SQL, for example).
SQL as Target Model
In the little employee-department data model from the seventies (Peter Chen), the transformations look like this. First, the solution data model: