During the 1980's and early 90's, Information Engineering (IE) was in its prime. Most major corporations were utilizing some form of system development methodology that could be tied back to IE. The first step in any IE project was the Information Strategy Plan or ISP. The ISP would look at the data, process, organization, technology, and interactions of an enterprise. The ISP was top-down analysis at its best. Three key deliverables of an ISP were a data model, functional decomposition, and an interaction (CRUD) matrix. The data model was an entity relationship diagram that encompassed the entire enterprise. The functional decomposition diagram would examine the business functions and decompose to a process level. The CRUD matrix, which stands for CREATE, READ, UPDATE, DELETE, examined the interaction of data and process. These three deliverables provided a basis for top down analysis.
The mid-90's have seen the rise of data warehousing and its related disciplines. One aspect of data warehousing is the data mart. A data mart is a single subject area data warehouse usually developed to support a single business unit. The data warehouse market has realized that there must be a balancing act between the "build it and they will come" approach to data warehousing and the single subject legacy data marts or "legamarts".
The salvaging of some information engineering techniques can provide the contextual models and top down requirements needed to create architected data warehouses. Since many people in a leadership or architecture role in data warehousing are descendents of the information engineering age, they should be familiar with the CRUD matrix and its techniques.
The following series of articles will describe three techniques that can utilize existing information engineering in a data warehouse project. This week, I'll highlight the entity relationship diagram and its use in a three phase data model approach. Next week, I'll look at the functional decomposition diagram and its use in segmenting and defining key performance indicators and dimensions. Finally, I'll go over creating a modified CRUD matrix that deals with logical entities and current systems.
The Entity Relationship Diagram
The entity relationship diagram is the standard data technique for creating data models. The entity relationship diagram enables an analyst to create a graphical view of the data concepts of an organization and their relationships. Tradition system development dictates creation of an entity relationship (Entity Relationship) diagram that is converted to a database design of a relational database.
In a data warehouse environment the traditional normalized Entity Relationship can not be easily translated into a database design. By nature a normalized Entity Relationship diagram tends to separate the data concepts into separate entities. A traditional approach to Entity Relationship modeling is concerned with three concepts: entities, relationships, and attributes.
Components of the Entity Relationship Diagram
There are many works that describe Entity Relationship diagramming in detail. It is not the intention of this work to present exhaustive details, instead a brief description of the component is presented.
Entity: A data concept which has relevance to the enterprise. An entity can be a person, place, thing, or concept. Typically an entity consist of a single identifiable concept such as EMPLOYEE, STUDENT, CLASS, PURCHASE ORDER, or SHIPMENT. An entity can consist of subtypes. Subtypes are a decomposition of an entity into its various types. For example an EMPLOYEE entity can be modeled with subtypes FULL-TIME and PART-TIME. Subtyping is necessary when clarity is required about the data (and to some respect the behavior) of the Supertype entity.
Relationship: A relationship, as the name suggests, is a description about the relationship that exists between two entities. Information about how the entities relate, in particular, the optionality and cardinality of the relationship is modeled. A relationship should only be modeled when the relationship has relevance. If one desired, any entity could loosely be related to any other entity, but this is not the intention of modeling relationships. A special relationship, known as a recursive relationship, exists between and entity and itself, such as an EMPLOYEE to EMPLOYEE related by a REPORTS TO relationship.
Attributes: Attributes are details about a specific entity. These details provide greater clarification about the data that can or will be captured regarding an entity. One must be careful not to confuse entities and attributes. Entities can exist without attributes, but attributes can not exist without entities.
There are many nomenclatures for the actual diagrams. For demonstration purposes, this article will adopt the Crow's Foot diagramming technique.
Data Modeling for a Data Warehouse
There has been significant work done on utilizing specialized data modeling techniques for data warehousing. In particular, the dimensional approach has been adopted to model data warehouses for a relational database. With a dimensional modeling approach, many of the traditional normalization techniques are not utilized. Instead, the model utilized a mixed approach of highly normalized portions of the model and highly denormalized parts of the model.
The model is centered on two types of entities, facts and dimensions. Facts are entities that deal with measurements or indicators. A fact entity for a sales organization could measure revenue per month, or units sold per day. A fact for a manufacturing organization could measure defects per lot per day or units produced per week. Dimensions are entities that represent dimensional information about the facts. Dimensions are ways that the data can be sliced or viewed or segmented. Dimensions typically represent and n-leveled hierarchy such as a product hierarchy or sales organization hierarchy.
For example in a sales organization, the SALES TERRITORY can be a dimension which represents the sales territory, its district, the district's region, and the region's division. In a dimensional model, this is one entity. In a traditional Entity Relationship diagram, this would be four entities, TERRITORY, REGION, DISTRICT, DIVISION. The same is true for a product dimension. The product dimension can represent SKU, its brand, the brand's category, the category's division. In a traditional Entity Relationship diagram, this would be represented as four entities, SKU, BRAND, CATEGORY, and DIVISION.
Shortcomings of Dimensional Modeling
Although the dimensional model is a useful approach for modeling the data needed to create a database for data warehouses and quick queries, there are shortcomings. One of the shortcomings is similar to the problems that traditional Entity Relationship modeling encountered. The data model is designed with an implementation in mind. The dimensional model typically focuses on facts and dimensions for the reporting and analysis needs of the system being designed. The broad brush approach of traditional Entity Relationship modeling is discounted or ignored. This leads to a myopic view of the data as represented in the dimensional Entity Relationship model.
The dimensional model also pushes business rules and representations to a lower level of abstraction. This can lead to overlooking or discounting of business rules that would be much more apparent in a traditional Entity Relationship model. For example, if a sales organization represented a product in the following fashion. A multidivisional company sells some products which are part of consumer brand which is part of category. Some other products, though, do not have a consumer brand, since they are sold through a non consumer channel. Take a pharmaceutical company that sells prescription and over the counter (OTC) drugs. The OTC may have brands associated with the product and the category may be OTC. The pharmaceutical products may not have a name brand, but are associated with the category or PHARM. Two examples are listed below.
|0001||Pain-be-gone Ibuprofen Tablets||Pain-be-gone||OTC||Consumer|
|0002||Prescription Drug for Headache||N/A||PHARM||Pharmaceutical|
For Division A: SKU aggregates to PRODUCT aggregates to BRAND aggregates to CATEGORY aggregates to DIVISION
For Division B: SKU aggregates to PRODUCT aggregates to CATEGORY aggregates to DIVISION.
If this is represented in a traditional Entity Relationship diagram, the diagram could look as follows:
While in a dimensional model, the product would be compressed or denormalized into one table. The table would be encoded with attributes which would represent the entities in a hierarchy. The dimensional Entity Relationship would look as follows:
The Role of the Two Data Models
Based on the evaluation of each method, there are uses for each. Yet, each is not complete enough to represent the logical and dimensional data needs. This drives the requirement for a multiphase data model approach. This approach loosely follows a Zachman model of a Conceptual Model and a Logical Model. The conceptual is represented as a traditional Entity Relationship model and the Logical is represented as the dimensional Entity Relationship model. The importance of these separate, but related models is amplified in a data warehouse environment. In a traditional OLTP environment, normalization is the norm in the conceptual and logical Models. There may be some differences, but generally the models are quite similar in number of entities.
In a data warehouse environment, a conceptual of 25 entities could yield a logical model of 7 entities. The example above compressed the five entity structure of a product into one entity with many attributes. This compression is a double edged sword. The concise logical model is much easier to convert to a database structure which is geared toward data warehousing. This logical model, though, hides the business rules in attributes and their optionality and cardinality. It is only through the full examination of the attributes and their allowable values, optionality, and cardinality that the rules are uncovered. Even then, there is not a simple graphical way to represent the model short of represented as the conceptual model.
This leads to the proposal that in a data warehouse environment, a traditional Entity Relationship model is a business modeling tool while a logical model is a technology tool.
One may state that this is obvious, and in fact was the practice in traditional Information Engineering SDLC projects. The current problem is that much of the industry has shunned the traditional model as unnecessary. Instead the industry and methodologies are proposing to start with the dimensional model. It is true that the system development starts with the dimensional model, but business data understanding is facilitated through the traditional model.
For many dimensional modelers, this is not enough to justify creating a traditional model. There are other uses, though. First, the Entity Relationship model can be used as a contextual map of the data connecting the multiple dimensional modeling subject areas. Second, Entity Relationship model can represent divergent hierarchies in the data. Third, the Entity Relationship model can feed the CRUD matrix, which will be highlighted next time, for a technical scoping and increment management tool.
Anthony Politano is CEO of the US division of MIS AG. He has over 14 years experience in the IT industry, specializing in business intelligence, data warehousing, large-scale application development and systems development methodologies. Mr. Politano was previously a Director at American Home Products, where he was responsible for building an enterprise data warehouse and multiple data marts. Prior to AHP, he was Practice Manager for data warehouse solutions with Oracle Consulting and was a Managing Consultant for James Martin & Co.