The following is excerpted from the W.H. Inmon book "Building the Data Warehouse," courtesy of
Wiley. In Chapter 10, 'A Data Warehouse Design Review Checklist,' Inmon explains in detail how a
proper review can make or break your data warehouse. To return to the review preparation steps, click
here. To purchase the book, click
|A typical data warehouse design review (in 98 steps!)||Return to Table of Contents|
If part of the data warehouse, in fact, does exist outside the data warehouse environment, there should be a plan to bring that part of the DSS world back into the data warehouse environment.
6. Have the major subjects that have been identified been broken down into lower levels of detail?
- Have the keys been identified?
- Have the attributes been identified?
- Have the keys and attributes been grouped together?
- Have the relationships between groupings of data been identified?
- Have the time variances of each group been identified?
ISSUE: There needs to be a data model that serves as the intellectual heart of the data warehouse environment. The data model normally has three levels -- a high-level model where entities and relationships are identified; a midlevel where keys, attributes, and relationships are identified; and a low level, where database design can be done. While not all of the data needs to be modeled down to the lowest level of detail in order for the DSS environment to begin to be built, at least the high-level model must be complete.
7. Is the design discussed in question 6 periodically reviewed? (How often? Informally? Formally?) What changes occur as a result of the review? How is end-user feedback channeled to the developer?
ISSUE: From time to time, the data model needs to be updated to reflect changing business needs of the organization. As a rule, these changes are incremental in nature. It is very unusual to have a revolutionary change. There needs to be an assessment of the impact of these changes on both existing data warehouse data and planned data warehouse data.
8. Has the operational system of record been identified?
- Has the source for every attribute been identified?
- Have the conditions under which one attribute or another will be the source been identified?
- If there is no source for an attribute, have default values been identified?
- Has a common measure of attribute values been identified for those data attributes in the data warehouse environment?
- Has a common encoding structure been identified for those attributes in the data warehouse environment?
- Has a common key structure in the data warehouse environment been identified? Where the system of record key does not meet the conditions for the DSS key structure, has a conversion path been identified?
- If data comes from multiple sources, has the logic to determine the appropriate value been identified?
- Has the technology that houses the system of record been identified?
- Will any attribute have to be summarized on entering the data warehouse?
- Will multiple attributes have to be aggregated on entering the data warehouse?
- Will data have to be resequenced on passing into the data warehouse?
ISSUE: After the data model is built, the system of record is identified. The system of record normally resides in the operational environment. The system of record represents the best source of existing data in support of the data model. The issues of integration are very much a factor in defining the system of record.
9. Has the frequency of extract processing — from the operational system of record to the data warehouse environment — been identified? How will the extract processing identify changes to the operational data from the last time an extract process was run?
- By looking at time-stamped data?
- By changing operational application code?
- By looking at a log file? An audit file?
- By looking at a delta file?
- By rubbing "before" and "after" images together?
ISSUE: The frequency of extract processing is an issue because of the resources required in refreshment, the complexity of refreshment processing, and the need to refresh data on a timely basis. The usefulness of data warehouse data is often related to how often the data warehouse data is refreshed.
One of the most complex issues — from a technical perspective — is determining what data is to be scanned for extract processing. In some cases, the operational data that needs to pass from one environment to the next is straightforward. In other cases, it is not clear at all just what data should be examined as a candidate for populating the data warehouse environment.
10. What volume of data will normally be contained in the DSS environment? If the volume of data is large:
- Will multiple levels of granularity be specified?
- WIll data be compacted?
- Will data be purged periodically?
- Will data be moved to near-line storage? At what frequency?
ISSUE: In addition to the volumes of data processed by extraction, the designer needs to concern himself or herself with the volume of data actually in the data warehouse environment. The analysis of the volume of data in the data warehouse environment leads directly to the subject of the granularity of data in the data warehouse environment and the possibility of multiple levels of granularity.
11. What data will be filtered out of the operational environment as extract processing is done to create the data warehouse environment?
ISSUE: It is very unusual for all operational data to be passed to the DSS environment. Almost every operational environment contains data that is relevant only to the operational environment. This data should not be passed to the data warehouse environment.
12. What software will be used to feed the data warehouse environment?
- Has the software been thoroughly shaken out?
- What bottlenecks are there or might there be?
- Is the interface one-way or two-way?
- What technical support will be required?
- What volume of data will pass through the software?
- What monitoring of the software will be required?
- What alterations to the software will be periodically required?
- What outage will the alterations entail?
- How long will it take to install the software?
- Who will be responsible for the software?
- When will the software be ready for full-blown use?
ISSUE: The data warehouse environment is capable of handling a large number of different types of software interfaces. The amount of breakin time and "infrastructure" time, however, should not be underestimated. The DSS architect must not assume that the linking of the data warehouse environment to other environments will necessarily be straightforward and easy.
13. What software interface will be required for the feeding of DSS departmental and individual processing out of the data warehouse environment?
- Has the interface been thoroughly tested?
- What bottlenecks might exist?
- Is the interface one-way or two-way?
- What technical support will be required?
- What traffic of data across the interface is anticipated?
- What monitoring of the interface will be required?
- What alterations to the interface will there be?
- What outage is anticipated as a result of alterations to the interface?
- How long will it take to install the interface?
- Who will be responsible for the interface?
- When will the interface be ready for full-scale utilization?
14. What physical organization of data will be used in the data warehouse environment? Can the data be directly accessed? Can it be sequentially accessed? Can indexes be easily and cheaply created?
ISSUE: The designer needs to review the physical configuration of the data warehouse environment to ensure that adequate capacity will be available and that the data, once in the environment, will be able to be manipulated in a responsive manner.
15. How easy will it be to add more storage to the data warehouse environment at a later point in time? How easy will it be to reorganize data within the data warehouse environment at a later point in time?
ISSUE: No data warehouse is static, and no data warehouse is fully specified at the initial moment of design. It is normal to make corrections in design throughout the life of the data warehouse environment. To construct a data warehouse environment either where midcourse corrections cannot be made or are awkward to make is to have a faulty design.
16. What is the likelihood that data in the data warehouse environment will need to be restructured frequently (that is, columns added, dropped, or enlarged, keys modified, and so on)? What effect will these activities of restructuring have on ongoing processing in the data warehouse?
ISSUE: Given the volume of data found in the data warehouse environment, restructuring it is not a trivial issue. In addition, with archival data, restructuring after a certain moment in time often becomes a logical impossibility.
17. What are the expected levels of performance in the data warehouse environment? Has a DSS service-level agreement been drawn up either formally or informally?
ISSUE: Unless a DSS service-level agreement has been formally drawn up, it is impossible to measure whether performance objectives are being met. The DSS service-level agreement should cover both DSS performance levels and downtime. Typical DSS service-level agreements include such information as the following:
- Average performance during peak hours per units of data
- Average performance during off-peak hours per units of data
- Worst performance levels during peak hours per units of data
- Worst performance during off-peak hours per units of data
- System availability standards
One of the difficulties of the DSS environment is measuring performance. Unlike the operational environment, where performance can be measured in absolute terms, DSS processing needs to be measured in relation to the following:
- How much processing the individual request is for
- How much processing is going on concurrently
- How many users are on the system at the moment of execution
18. What are the expected levels of availability? Has an availability agreement been drawn up for the data warehouse environment, either formally or informally?
ISSUE: (See the issue for question 17.)
19. How will the data in the data warehouse environment be indexed or accessed?
- Will any table have more than four indexes?
- Will any table be hashed?
- Will any table have only the primary key indexed?
- What overhead will be required to maintain the index?
- What overhead will be required to load the index initially?
- How often will the index be used?
- Can or should the index be altered to serve a wider use?
ISSUE: Data in the data warehouse environment needs to be accessed efficiently and in a flexible manner. Unfortunately, the heuristic nature of data warehouse processing is such that the need for indexes is unpredictable. The result is that the accessing of data in the data warehouse environment must not be taken for granted. As a rule, a multitiered approach to managing the access of data warehouse data is optimal:
- The hashed or primary key should satisfy most accesses.
- Secondary indexes should satisfy other popular access patterns.
- Temporary indexes should satisfy the occasional access.
- Extraction and subsequent indexing of a subset of data warehouse data should satisfy infrequent or once-in-a-lifetime accesses of data.
In any case, data in the data warehouse environment should not be stored in partitions so large that they cannot be indexed freely.
20. What volumes of processing in the data warehouse environment are to be expected? What about peak periods? What will the profile of the average day look like? The peak rate?
ISSUE: Not only should the volume of data in the data warehouse environment be anticipated, but the volume of processing should be anticipated as well.
21. What level of granularity of data in the data warehouse environment will there be?
- A high level?
- A low level?
- Multiple levels?
- Will rolling summarization be done?
- Will there be a level of true archival data
- Will there be a living-sample level of data?
ISSUE: Clearly, the most important design issue in the data warehouse environment is that of granularity of data and the possibility of multiple levels of granularity. In other words, if the granularity of the data warehouse environment is done properly, then all other issues become straightforward. If the granularity of data in the data warehouse environment is not designed properly, then all other design issues become complex and burdensome.
22. What purge criteria for data in the data warehouse environment will there be? Will data be truly purged, or will it be compacted and archived elsewhere? What legal requirements are there? What audit requirements are there?
ISSUE: Even though data in the DSS environment is archival and of necessity has a low probability of access, it nevertheless has some probability of access (otherwise, it should not be stored). When the probability of access reaches zero (or approaches zero), the data needs to be purged. Given that volume of data is one of the most burning issues in the data warehouse environment, purging data that is no longer useful is one of the more important aspects of the data warehouse environment.
23. What total processing capacity requirements are there?
- For initial implementation
- For the data warehouse environment at maturity?
ISSUE: Granted that capacity requirements cannot be planned down to the last bit, it is worthwhile to at least estimate how much capacity will be required, just in case there is a mismatch between needs and what will be available.
24. What relationships between major subject areas will be recognized in the data warehouse environment? Will their implementation:
- Cause foreign keys to be kept up-to-date
- Make use of artifacts
What overhead is required in the building and maintenance of the relationship in the data warehouse environment?
ISSUE: One of the most important design decisions the data warehouse designer makes is that of how to implement relationships between data in the data warehouse environment. Data relationships are almost never implemented the same way in the data warehouse as they are in the operational environment.
25. Do the data structures internal to the data warehouse environment make use of:
- Arrays of data?
- Selective redundancy of data?
- Merging of tables of data?
- Creation of commonly used units of derived data?
ISSUE: Even though operational performance is not an issue in the data warehouse environment, performance is nevertheless an issue. The designer needs to consider the design techniques listed previously when they can reduce the total amount of I/O consumed. The techniques listed previously are classical physical denormalization techniques. Because data is not updated in the data warehouse environment, there are very few restrictions on what can and can't be done.
The factors that determine when one or the other design technique can be used include the following:
- The predictability of occurrences of data
- The predictability of the pattern of access of data
- The need to gather artifacts of data
26. How long will a recovery take? Is computer operations prepared to execute a full data warehouse database recovery? A partial recovery? Will operations periodically practice recovery so that it will be prepared in the event of a need for recovery? What level of preparedness is exhibited by:
- Systems support?
- Applications programming
- The DBA?
- The DA?
For each type of problem that can arise, is it clear whose responsibility the problem is?
ISSUE: As in operational systems, the designer must be prepared for the outages that occur during recovery. The frequency of recovery, the length of time required to bring the system back up, and the domino effect that can occur during an outage must all be considered.
Have instructions been prepared, tested, and written? Have these instructions been kept up-to-date?
27. What level of preparation is there for reorganization of restructing of:
- Systems support
- Applications programming?
- The DBA?
- The DA?
Have written instructions and procedures been set and tested? Are they up-to-date? Will they be kept up-to-date?
ISSUE (See the issue for question 26.)
28. What level of preparation is there for the loading of a database table by:
- Systems support?
- Applications programming?
- The DBA
- The DA?
Have written instructions and procedures been made and tested? Are they up-to-date? Will they be kept up-to-date?
ISSUE: The time and resources for loading can be considerable. This estimate needs to be made carefully and early in the development life cycle.
29. What level of preparation is there for the loading of a database index by:
- Systems support>
- Applications programming?
- The DBA?
- The DA?
ISSUE: (See issue for question 28.)
30. If there is ever a controversy as to the accuracy of a piece of data in the data warehouse environment, how will the conflict be resolved? Has ownership (or at least source identification) been done for each unit of data in the data warehouse environment? Will ownership be able to be established if the need arises? Who will address the issues of ownership? Who will be the final authority as to the issues of ownership?
ISSUE:Ownership or stewardship of data is an essential component of success in the data warehouse environment. It is inevitable that at some moment in time the contents of a database will come into question. The designer needs to plan in advance for this eventuality.
31. How will corrections to data be made once data is placed in the data warehouse environment? How frequently will corrections be made? Will corrections be monitored? If there is a pattern of regularly occurring changes, how will corrections at the source (that is, operational) level be made?
ISSUE: On an infrequent, nonscheduled basis, changes made need to be made in the data warehouse environment. If there appears to be a pattern to these changes, then the DSS analyst needs to investigate what is wrong in the operational system.
32. Will public summary data be stored separately from normal primitive DSS data? How much public summary data will there be? Will the algorithm required to create public summary data be stored?
ISSUE: Even though the data warehouse environment contains primitive data, it is normal for there to be public summary data in the data warehouse environment as well. The designer needs to have prepared a logical place for this data to reside.
33. What security requirements will there be for the databases in the data warehouse environment? How will security be enforced?
ISSUE: The access of data becomes an issue, especially as the detailed data becomes summarized or aggregated, where trends become apparent. The designer needs to anticipate the security requirements and prepare the data warehouse environment for them.
34. What audit requirements are there? How will audit requirements be met?
ISSUE: As a rule, a system audit can be done at the data warehouse level, but this is almost always a mistake. Instead, detailed record audits are best done at the system-of-record level.
35. Will compaction of data be used? Has the overhead of compacting and decompacting data been considered? What is the overhead? What are the savings in terms of DASD for compacting and decompacting data?
ISSUE: On one hand, compaction or encoding of data can save significant amounts of space. On the other hand, both compacting and encoding data require CPU cycles as data is decompacted or decoded on access. The designer needs to make a thorough investigation of these issues and a deliberate trade-off in the design.
36. Will encoding of data be done? Has the overhead of encoding and decoding been considered? What, in fact, is the overhead?
ISSUE: (See the issue for question 35.)
37. Will metadata be stored for the data warehouse environment?
ISSUE: Metadata needs to be stored with any archival data as a matter of policy. There is nothing more frustrating than an analyst trying to solve a problem using archival data when he or she does not know the meaning of the contents of a field being analyzed. This frustration can be alleviated by storing the semantics of data with the data as it is archived. Over time, it is absolutely normal for the contents and structure of data in the data warehouse environment to change. Keeping track of the changing definition of data is something the designer should make sure is done.
38. Will reference tables be stored in the data warehouse environment?
ISSUE: (See the issue for question 37.)
39. What catalog or dictionary will be maintained for the data warehouse environment? Who will maintain it? How will it be kept up-to-date? To whom will it be made available?
ISSUE: Not only is keeping track of the definition of data over time an issue, but keeping track of data currently in the data warehouse is important as well.
40. Will update (as opposed to loading and access of data) be allowed in the data warehouse environment? (Why? How much? Under what circumstances? On an exception-only basis?)
ISSUE: If any updating is allowed on a regular basis in the data warehouse environment, the designer should ask why. The only update that should occur should be on an exception basis and for only small amounts of data. Any exception to this severely compromises the efficacy of the data warehouse environment.
When updates are done (if, in fact, they are done at all), they should be run in a private window when no other processing is done and when there is slack time on the processor.
41. What time lag will there be in getting data from the operational to the data warehouse environment? Will the time lag ever be less than 24 hours? If so, why and under what conditions? Will the passage of data from the operational to the data warehouse environment be a "push" or a "pull" process?
ISSUE: As a matter of policy, any time lag less than 24 hours should be questioned. As a rule, if a time lag of less than 24 hours is required, it is a sign that the developer is building operational requirements into the data warehouse. The flow of data through the data warehouse environment should always be a pull process, where data is pulled into the warehouse environment when it is needed, rather than being pushed into the warehouse environment when it is available.
42. What logging of data warehouse activity will be done? Who will have access to the logs?
ISSUE: Most DSS processing does not require logging. If an extensive amount of logging is required, it is usually a sign of a lack of understanding of what type of processing is occurring in the data warehouse environment.
43. Will any data other than public summary data flow to the data warehouse environment from the departmental or individual level? If so, describe it.
ISSUE: Only on rare occasions should public summary data come from sources other than departmental or individual levels of processing. If much public summary data is coming from other sources, the analyst should ask why.
44. What external data (that is, data other than that generated by a company's internal sources and systems) will enter the data warehouse environment? Will it be specially marked? Will its source be stored with the data? How frequently will the external data enter the system? How much of it will enter? Will an unstructured format be required? What happens if the external data is found to be inaccurate?
ISSUE: Even though there are legitimate sources of data other than a company's operational systems, if much data is entering externally, the analyst should ask why. Inevitably, there is much less flexibility with the content and regularity of availability of external data, although external data represents an important resource that should not be ignored.
45. What facilities will exist that will help the departmental and the individual user to locate data in the data warehouse environment?
ISSUE: One of the primary features of the data warehouse is ease of accessibility of data. And the first step in the accessibility of data is the initial location of the data.
46. Will there be an attempt to mix operational and DSS processing on the same machine at the same time? If so:
- How much processing
- How much data?
ISSUE: For a multitude of reasons, it makes little sense to mix operational and DSS processing on the same machine at the same time. Only where there are small amounts of data and small amounts of processing should there be a mixture. But these are not the conditions under which the data warehouse environment is most cost-effective. See my previous book, Data Architecture: The Information Paradigm (Wellesey, MA: QED/Wiley, 1992) for an in-depth discussion of this issue.
47. How much data will flow back to the operational level from the data warehouse level? At what rate? At what volume? Under what response time constraints? Will the flowback be summarized data or individual units of data?
ISSUE: As a rule, data flows from the operational to the warehouse level to the departmental to the individual levels of processing. There are some notable exceptions. As long as not too much data "backflows," and as long as the backflow is done in a disciplined fashion, there usually is no problem. If there is a lot of data engaged in backflow, then a red flag should be raised.
48. How much repetitive processing will occur against the data warehouse environment? Will precalculation and storage of derived data save processing time?
ISSUE: It is absolutely normal for the data warehouse environment to have some amount of repetitive processing done against it. If only repetitive processing is done, however, or if no repetitive processing is planned, the designer should question why.
49. How will major subjects be partitioned? (By year? By geography? By functional unit? By product line?) Just how finely does the partitioning of the data break the data up?
ISSUE: Given the volume of data that is inherent to the data warehouse environment and the unpredictable use of the data, it is mandatory that data warehouse data be partitioned into physically small units that can be managed independently. The design issue is not whether partitioning is to be done. Instead, the design issue is how partitioning is to be accomplished. In general, partitioning is done at the application level rather than the system level.
The partitioning strategy should be reviewed with the following in mind:
- Current volume of data
- Future volume of data
- Current use of data
- Future use of data
- Partitioning of other data in the warehouse
- Use of other data>
- Volatility of the structure of data
50. Will sparse indexes be created? Would they be useful?
ISSUE: Sparse indexes created in the right place can save huge amounts of processing. By the same token, sparse indexes require a fair amount of overhead in their creation and maintenance. The designer of the data warehouse environment should consider their use.
51. What temporary indexes will be created? How long will they be kept? How large will they be?
ISSUE: (See the issue for question 50, except as it applies to temporary indexes.)
52. What documentation will there be at the departmental and individual levels? What documentation will there be of the interfaces between the data warehouse environment and the departmental environment? Between the departmental and the individual environment? Between the data warehouse environment and the individual environment?
ISSUE: Given the free-form nature of processing in the departmental and the individual environments, it is unlikely that there will be much in the way of available documentation. Documentation of the relationships between the environments is important for the reconcilability of data.
53. Will the user be charged for departmental processing? For individual processing? Who will be charged for data warehouse processing?
ISSUE: It is important that users have their own budgets and be charged for resources used. The instant that processing becomes "free," it is predictable that there will be massive misuse of resources. Achargeback system instills a sense of responsibility in the use of resources.
54. If the data warehouse environment is to be distributed, have the common parts of the warehouse been identified? How are they to be managed?
ISSUE: In a distributed data warehouse environment, some of the data will necessarily be tightly controlled. The data needs to be identified up-front by the designer and metadata controls put in place.
55. What monitoring of the data warehouse will there be? At the table level? At the row level? At the column level?
ISSUE: The use of data in the warehouse needs to be monitored to determine the dormancy rate. Monitoring must occur at the table level, the row level, and the column level. In addition, monitoring of transaction needs to occur as well.
56. Will Class IV ODS be supported? How much performance impact will there be on the data warehouse to support Class IV ODS processing?
ISSUE: Class IV ODS is fed from the data warehouse. The data needed to create the profile in the Class IV ODS is found in the data warehouse.
57. What testing facility will there be for the data warehouse?
ISSUE: Testing in the data warehouse is not the same level of importance as in the operational transaction environment. But occasionally there is a need for testing, especially when new types of data are being loaded and when there are large volumes of data.
58. What DSS applications will be fed from the data warehouse? How much volume of data will be fed?
ISSUE: DSS applications, just like data marts, are fed from the data warehouse. There are the issues of when the data warehouse will be examined, how often it will be examined, and what performance impact there will be for the analysis.
59. Will an exploration warehouse and/or a data mining warehouse be fed from the data warehouse? If not, will exploration processing be done directly in the data warehouse? If so, what resources will be required to feed the exploration and/or data mining warehouse?
ISSUE: The creation of an exploration warehouse and/or a data mining data warehouse can greatly alleviate the resource burden on the data warehouse. An exploration warehouse is needed when the frequency of exploration is such that statistical analysis starts to have an impact on data warehouse resources.
The issues here are the frequency of update and the volume of data that needs to be updated. In addition, the need for an incremental update of the data warehouse occasionally arises.
60. What resources are required for loading data into the data warehouse on an ongoing basis? Will the load be so large that it cannot fit into the window of opportunity? Will the load have to be parallelized?
ISSUE: Occasionally there is so much data that needs to be loaded into the data warehouse that the window for loading is not large enough. When the load is too large, there are several options:
- Creating a staging area where much preprocessing of the data to be loaded can be done independently
- Parallelizing the load stream so that the elapsed time required for loading is shrunk to the point that the load can be done with normal processing
- Editing or summarizing the data to be loaded so that the actual load is smaller
61. To what extent has the midlevel model of the subject areas been created? Is there a relationship between the different midlevel models?
ISSUE: Each major subject area has its own midlevel data model. As a rule, the midlevel data models are created only as the iteration of development needs to have them created. In addition, the midlevel data models are related in the same way that the major subject areas are related.
62. Is the level of granularity of the data warehouse sufficiently low enough to service all the different architectural components that will be fed from the data warehouse?
ISSUE: The data warehouse feeds many different architectural components. The level of granularity of the data warehouse must be sufficiently low to feed the lowest level of data needed anywhere in the corporate information factory (CIF). This is why it is said that the data in the data warehouse is at the lowest common denominator.
63. If the data warehouse will be used to store eBusiness and clickstream data, to what extent does the Granularity Manager filter the data?
ISSUE: The Web-based environment generates a huge amount of data. The data that is generated is at much too low a level of granularity. To summarize and aggregate the data before entering the data warehouse, the data is passed through a Granularity Manager. The Granularity Manager greatly reduces the volume of data that finds its way into the data warehouse.
64. What dividing line is used to determine what data is to be placed on disk storage and what data is to be placed on alternate storage?
ISSUE: The general approach that most organizations take in the placement of data on disk storage and data on alternate storage is to place the most current data on disk storage and to place older data on alternate storage. Typically, disk storage may hold two years' worth of data, and alternate storage may hold all data that is older than two years.
65. How will movement of data to and from disk storage and alternate storage be managed?
ISSUE: Most organizations have software that manages the traffic to and from alternate storage. The software is commonly known as a cross-media storage manager (CMSM).
66. If the data warehouse is a global data warehouse, what data will be stored locally and what data will be stored globally?
ISSUE: When a data warehouse is global, some data is stored centrally and other data is stored locally. The dividing line is determined by the use of the data.
67. For a global data warehouse, is there assurance that data can be transported across international boundaries?
ISSUE: Some countries have laws that do not allow data to pass beyond their boundaries. The data warehouse that is global must ensure that it is not in violation of international laws.
68. For ERP environments, has it been determined where the data warehouse will be located — inside the ERP software or outside the ERP environment?
ISSUE: Many factors determine where the data warehouse should be placed:
- Does the ERP vendor support the data warehouse?
- Can non-ERP data be placed inside the data warehouse?
- What analytical software can be used on the data warehouse if the data warehouse is placed inside the ERP environment?
- If the data warehouse is placed inside the ERP environment, what DBMS can be used?
69. Can alternate storage be processed independently?
ISSUE: Older data is placed in alternate storage. It is often quite useful to be able to process the data found in alternate storage independently of any consideration of data placed on disk storage.
70. Is the development methodology that is being used for development a spiral development approach or a classical waterfall approach?
ISSUE: The spiral development approach is always the correct development approach for the data warehouse environment. The waterfall SDLC approach is never the appropriate approach.
71. Will an ETL tool be used for moving data from the operational environment to the data warehouse environment, or will the transformation be done manually?
ISSUE: In almost every case, using a tool of automation to transform data into the data warehouse environment makes sense. Only where there is a very small amount of data to be loaded into the data warehouse environment should the loading of the data warehouse be done manually.
72. Will unstructured data be entered into the data warehouse?
ISSUE: Unstructured data can be very valuable in the data warehouse. But integrating unstructured data is difficult to do. To be useful, unstructured data must be edited and organized before it can be entered into the data warehouse. The editing includes the removal of stop words and the stemming of words. In addition, identifiers must be found. There are two kinds of useful identifiers: identifiers and close identifiers. Identifiers are those terms which serve to specifically identify an account. Typical identifiers are Social Security number, license number, and employee number. Typical close identifiers are name, address, and other descriptive information.
In addition to text being edited, unstructured data needs to be screened. Much of text is what can be termed "blather." Blather does not contribute to the business intelligence and does not belong in the data warehouse.
73. Will corporate communications be organized before being placed in the data warehouse?
ISSUE: Corporate communications can be very valuable to the data warehouse. Typically, corporate communications are useful for the purpose of complementing CRM data. However, in order for corporate communications to be useful, it must be edited and organized according to identifiers. In addition, it helps to sort through the communications, identifying which ones are critical and which ones are noncritical.
74. Is there ever a need for referring back to the unstructured environment?
ISSUE: On occasion, there is the need to refer back to the unstructured environment. The problem is that there is no guarantee that the data found in the unstructured environment will be there when referenced. Emails are deleted. Text files are erased. Data changes location. What is the contingency plan for finding data once displaced in the unstructured environment?
75. Unstructured data in the data warehouse environment can take up large amounts of space. How can the space required for unstructured data in the structured environment be minimized?
ISSUE: Data warehouses grow large with no help from anyone. But when unstructured data is added to a data warehouse, the potential for very large volumes of data grows. Several techniques can be employed to minimize the space required for unstructured data, including the following:
- The use of simple indexes. If a unit of unstructured data is not required online, simply create an index that points to where the unstructured data is found.
- Use the first n bytes of an unstructured document to allow the user to look at what the document look like from the beginning. In doing so, it may not be necessary to store the entire document.
- Carry context surrounding keywords. The context surrounding key words tells the user little about the document and a lot about the context within which keywords are found and used.
ISSUE: As data warehouses grow large, the volumes of data contained inside them starts to exhibit different patterns of use. Some data is used infrequently; other data is used frequently. It is very important that the data administrator knows when this division in the patterns of use of data starts to occur.
77. Is data monitored by column for use?
ISSUE: It is not enough to monitor the use of data by rows. Occasionally, columns of data that are never used will be included in a data warehouse. It makes sense to remove those columns.
78. How many resources does the data monitor consume?
ISSUE: Typically, the monitor provided by the DBMS vendor uses huge amounts of resources. A common practice is to turn off the monitor during peak-period processing. Unfortunately, the peak period is the very moment when you want the monitor to be turned on. Third-party software is usually much more efficient in monitoring the data warehouse than vendor-supplied software.
79. Is data monitored by row for use?
ISSUE: The most common way of measuring use is by row. Rows of data that are not being accessed should be removed from the data warehouse to near-line storage or archival storage.
80. How will data be managed in terms of transmitting the data from disk storage to near-line storage or from disk storage to archival storage?
ISSUE: Will a CMSM be used? Will the transportation be done by hand? How long will the transportation be done by hand? What sort of operating window is required?
81. What signals the system that a query may be accessing data from nearline storage or archival storage?
ISSUE: If the system waits until a query goes into execution, then the system has only awkward choices as to accessing data from near-line storage or archival storage. It is much better if the end user submits a queuing request before the query is submitted. An alternative is to parse and queue requests that will look at data from the near-line environment.
82. What rate of growth will there be for the data?
ISSUE: Not only is a large amount of data needed for the data warehouse, but the data's growth is an issue as well. It is wise to anticipate growth and make sure that there is space enough before the space is needed.<.p>
83. Will a miltidimensional database design be used for the data warehouse?
ISSUE: Miltidimensional design is for data marts and other analytical structures. Miltidimensional design does not fit with the data warehouse.
84. Will statistical analysis be done to any extent in the data warehouse?
ISSUE: If statistical analysis will be done to any extent in a data warehouse, strong consideration should be given to an exploration warehouse.
85. Will external data be entered into the data warehouse for the purpose of use in statistical analysis?
ISSUE: If external data is being placed in the data warehouse for the purpose of statistical analysis, it may make sense to build a separate exploration warehouse and put the external data in the exploration warehouse.
86. Are data marts being placed on the same physical processor as the data warehouse?
ISSUE: For a variety of reasons, it makes almost no sense to place data marts on the same physical processor as the data warehouse. Because of the differences in work load, the cost of the machine cycles, and the volumes of data that are collected there, moving data marts to a separate processor always makes sense.
In fact, moving data marts to separate processors from other data marts usually makes sense. By moving data marts to separate processors, different organizational units can be assigned to a processor and can take ownership of the data and the processing that occurs inside the data mart.
87. Has the data velocity been calculated for a data warehouse? Does there need to be a higher velocity?
ISSUE: Data has velocity as it passes through the data warehouse. On occasion, it is necessary to push the data through the data warehouse as quickly as possible. When that is the case, the cost of high velocity can be quite high. The business case for high velocity needs to be examined. On occasion, the need for high velocity may be a symptom of the building of operational systems in the data warehouse environment. If that is the case, the operational processing needs to be placed elsewhere.
88. Is clickstream data being entered into the data warehouse? If so, is the clickstream data being passed through the Granularity Manager?
ISSUE: Clickstream data is data that is generated by the Web environment. Approximately 90 percent of clickstream data needs to be eliminated or consolidated before it is entered into the data warehouse.
89. Is clickstream data entered directly into the data warehouse without first passing through a Granularity Manager?
ISSUE: Data should never enter the data warehouse directly from the Web environment without first passing through the Granularity Manager. There is enough data in the data warehouse without contaminating the data there with massive amounts of irrelevant detailed data.
90. Does data flow directly from the data warehouse to the Web environment?
ISSUE: The proper flow of data is from the data warehouse to the ODS environment. Once in the ODS environment, data is collected and then is available to the Web-based environment. In almost every case, the flow of data from the data warehouse environment to the Web is a very bad idea. The data warehouse is not geared to support the response times needed by the Web environment.
91. Is "real-time" data warehousing being done in the data warehouse?
ISSUE: Real-time data warehousing is best done in the ODS environment. The ODS environment is physically separate from the data warehouse environment. While it is true that the data warehouse environment can withstand some small amounts of real-time processing at certain slow periods of the day, to make a real-time processing environment out of the data warehouse is a strategic mistake.
92. Are profile records being collected and created in the ODS environment?
ISSUE: One of the best uses of the ODS environment is to collect detailed transaction data in the data warehouse and then use that detailed data for the creation of a profile record in the ODS. Once the profile record is created in the ODS, the profile data is ready in milliseconds for access, from the Web environment or elsewhere.
93. Is data being used directly from the data warehouse?
ISSUE: Over time, the use of data from the data warehouse shifts from direct use to indirect use. If, after five years, there is still a lot of direct use of data from the data warehouse, then it may be time to ask the question, "Should we build some data marts of other analytical applications?"
94. Are end users being prevented from using data in the data warehouse?
ISSUE: While it is true that data from the data warehouse has few users, it is a bit extreme to say that no one should directly use the data in the data warehouse. Even in the mature days of a data warehouse, there will be a few cases where direct access to the data warehouse is needed and is a good thing to do.
95. Is data being monitored to see if patterns of use are starting to appear?
ISSUE: If patterns of use are starting to appear in the data warehouse, it is time to ask the question, "Should data marts or other forms of analytical processing be created?"
96. What kind of training are your end users receiving in the usage data warehouse?
ISSUE: In many cases, training your user on the capabilities of the data warehouse can pay very big dividends.
97. How are your users being kept abreast of the changes that are being made in the data warehouse?
ISSUE: The data warehouse changes over time. New data and new features of the data warehouse may produce big benefits for your users, but not if your users don't know about what's new.
98. Is the data warehouse being used primarily by farmers or by explorers?
ISSUE: In the well-rounded data warehouse environment, there normally is a mixture of both farmers and explorers who use the data warehouse. If you are not using the data warehouse in multiple ways, then you are not getting your full money's worth from your data warehouse.
. To purchase the W.H. Inmon book "Building the Data Warehouse," click here.
This was first published in February 2006