~Note~

Please note that you can always click on an image in my postings and it will render a clear full sized version in a separate browser page! Also please note that this blog is best viewed with Firefox and Chrome
Google
 

Saturday, June 28, 2008

Data Warehouse Structures: a bridge table

Bridge tables; I see this topic come up quite often. They are also known in some circles as gap tables, associative tables, helper tables, or sometimes just B tables. I have referenced a few very valuable bits about this structure below. Unfortunately, google needs a few more words than just bridge table or helper table, since the card game 'bridge' actually uses a special table, thus making the search for our info a bit harder. Sheesh! Copied here for fear they might be removed for some reason...I cannot always trust that the Internet Wayback Machine will work perfectly. So here in its original glory:

Help for Dimensional Modeling

Helper tables let you design and manage multivalued dimensions successfully.

By Ralph Kimball, DBMS online magazine

The goal of dimensional modeling is to represent a set of business measurements in a standard framework. Dimensional modeling is attractive because end users usually easily understand this framework. The schemas that result from dimensional modeling are so predictable that query tool vendors can build their tools around a set of well-known structures. The classic star join schema that results from a dimensional model usually has between six and 10 dimension tables connected directly to the fact table, as Figure 1 shows. These dimension tables have a very specific and precise relationship to the fact table.

FIG 1

Think of the fact table as a set of measurements made in the marketplace. These measurements are usually numeric and taken before the creation of a given fact table. A fact table record may represent an individual transaction, such as a customer withdrawal made at an ATM, or the fact table record may represent some kind of aggregated total, such as the sales of a given product in a store on a particular day. The dimension tables, on the other hand, usually represent textual attributes that are already known about such things as the product, the customer, or the calendar.

When you are building a fact table, the most important step in the detailed logical design is to declare the grain of the fact table. The grain declares the exact meaning of an individual fact record. In these examples, the first grain is individual customer transaction and the second grain is daily product total in the store. If the database designer is very clear about the grain of the fact table, then choosing the appropriate dimensions for the fact table is usually easy.

The secret of choosing dimensions for a given fact table is to identify any description that has a single value for an individual fact table record. In this way, you can work outward from the grain of the fact table and "decorate" the fact table with as many dimensions as you can imagine. For the ATM customer transaction example, imagine that the following dimensions all have a single value at the instant of the transaction:

  • Calendar Date
  • Time of Day
  • Customer Account
  • Physical Location
  • Transaction Type (deposit, withdrawal, balance inquiry, and so on)
  • Weather Summary for the Day.

These are all excellent dimensions to attach to the transaction-grained fact table record. Each of these dimensions has many descriptive attributes. The descriptive attributes are usually textual and are the basis for constraining and grouping in the user's reports. In the case of Time of Day, the textual attributes could refer to specific periods such as Morning Rush Hour, Mid Morning, Lunch Hour, Mid Afternoon, Afternoon Rush Hour, and so on. In this case, the sole numeric measured fact is the amount of the transaction. This is the basis for Figure 1.

Strict adherence to the grain definition lets you disqualify dimensions that do not have a single value for a given fact record. In the daily product totals example, you can propose the classic retail dimensions of:

  • Calendar Date
  • Product
  • Store
  • Promotion (assuming the promotion remains in effect the entire day).

But you must omit the Customer and the Check Out Clerk dimensions that you might assume would be included because these dimensions have many values at the daily grain we have chosen.

These examples give a very powerful insight. You are more likely to disqualify a dimension if you are dealing with an aggregated or summarized table. The more the fact table is summarized, the fewer the number of dimensions you can attach to the fact records. The converse of this is eye opening. The more granular the data, the more dimensions make sense. The lowest-level data in any organization is the most dimensional. This is why I argue that the "atomic data store" must be presented as a dimensional model.

Having made a pretty tidy argument for single-valued dimensions, perhaps I should consider whether there are ever legitimate exceptions. Are there situations where you might need to attach a multivalued dimension to a fact table? Is this even possible, and what problems might arise?

Consider the following example from healthcare billing. You are handed a data source for which the grain is the individual line item on a doctor bill. The data source could be patient visits to doctor offices or individual charges on a hospital bill. These individual line items have a rich set of dimensions that we can list rapidly:

  • Calendar Date (of incurred charge)
  • Patient
  • Doctor (usually called "Provider")
  • Location
  • Service Performed
  • Diagnosis
  • Payer.
The numeric additive facts in this design (which are the core of every fact table in a dimensional design) would include Amount Charged and perhaps others including Amount Paid, depending on the nature of the data source.

So far, this design seems to be pretty straightforward, with obvious single values for all of the dimensions. But there is a sleeper. In many healthcare situations, there may be multiple values for Diagnosis. What do you do if a given patient has three separate diagnoses at the moment the service was performed? How about really sick people in hospitals that might have up to 10 diagnoses? How do you encode the Diagnosis dimension if you wish to represent this information?

Database designers usually take one of four approaches to this kind of open-ended, multivalued situation:

  • Disqualify the Diagnosis dimension because it is multivalued
  • Choose one value (the "primary" diagnosis) and omit the other values
  • Extend the dimension list to have a fixed number of Diagnosis dimensions
  • Put a helper table in between this fact table and the Diagnosis dimension table.

Let's not take the easy way out by disqualifying the Diagnosis dimension from the design.

Frequently, designers choose a single value (the second alternative). In the healthcare field this often shows up as the primary, or admitting, diagnosis. In many cases, your hands are tied because you are given the data in this form by the production system. If you take this approach, the modeling problem goes away, but you are left doubting whether the Diagnosis data is useful.

The third approach of creating a fixed number of additional Diagnosis dimension slots in the fact table key list is a hack, and you should avoid it. Inevitably, there will be some complicated example of a very sick patient who exceeds the number of Diagnosis slots you have allocated. Also, you cannot easily query the multiple separate Diagnosis dimensions. If "headache" is a diagnosis, which Diagnosis dimension should be constrained? The resulting or logic across dimensions is notorious for running slowly on relational databases. For all these reasons, you should avoid the multiple dimensions style of design.

FIG 2

If you really insist on modeling this multivalued situation, then a "helper" table placed between the Diagnosis dimension and the fact table is the best solution. (See Figure 2.) The Diagnosis key in the fact table is changed to be a Diagnosis Group key. The helper table in the middle is the Diagnosis Group table. It has one record for each diagnosis in a group of diagnoses. If I walk into the doctor's office with three diagnoses, then I need a Diagnosis Group with three records in it. It is up to the modeler to build either these Diagnosis Groups for each individual or a library of "known" Diagnosis Groups. Perhaps my three diagnoses would be called "Kimball's Syndrome."

The Diagnosis Group table is joined to the original Diagnosis dimension on the Diagnosis key. The Diagnosis Group table in Figure 2 contains a very important numeric attribute: the weighting factor. The weighting factor allows reports to be created that don't double count the Billed Amount in the fact table. For instance, if you constrain some attribute in the Diagnosis dimension such as "Contagious Indicator" with the values Contagious and Not Contagious, then you can group by the Contagious Indicator and produce a report with the correct totals. To get the correct totals, we must multiply the Billed Amount by the associated weighting factor. This is a correctly weighted report.

You can assign the weighting factors equally within a Diagnosis Group. If there are three diagnoses, then each gets a weighting factor of 1/3. If you have some other rational basis for assigning the weighting factors differently, then you can change the factors, as long as all the factors in a Diagnosis Group always add up to one.

You can, interestingly enough, deliberately omit the weighting factor and deliberately double count the same report grouped by Contagious Indicator. In this case, you have produced an "impact report" that shows the total Billed Amount implied partially or fully by both values of Contagious Indicator. Although the correctly weighted report is the most common and makes the most sense, the impact report is interesting and is requested from time to time. Such an impact report should be labeled so that the reader is not misled by any summary totals.

Although the helper table clearly violates the classic star join design where all the dimension tables have a simple one-to-many relationship to the fact table, there is no avoiding the issue of what to do with multivalued dimensions that designers insist on attaching to a fact table. Fortunately, designers rarely insist on attaching a multivalued dimension to a set of measurements. In most cases, you have no rational basis for assigning weighting factors, or even worse, you don't have a plausible short list of candidates from the multivalued dimension. Thus you don't waste time trying to attach the retailer's Customer dimension to an aggregated daily or monthly sales fact table. It just isn't interesting or useful.

When you attach a helper table to a fact table, you can preserve the star join illusion in your end-user interfaces by creating a view that prejoins the fact table to the helper table. The resulting view then appears to have a simple Diagnosis key that joins to our Diagnosis table in the healthcare example. The view can also predefine the multiplication of the weighting factor with any additive facts in the fact table.

I have seen two other situations besides healthcare where a multivalued dimension makes sense. In addition to the Diagnosis example discussed in this column, I have built multivalued dimensions for retail banks and standard industry classifications.

Retail Banks. There are often cases where the Account Dimension is "inhabited" by one or more human customers. If the bank wants to associate individual customers with account balances, then the Account dimension must play the same role as the Diagnosis Group dimension in the healthcare example. Banks are quite interested in both correctly weighted reports and impact reports.

Standard Industry Classifications. This term defines the so-called SIC code that is assigned to a commercial enterprise to describe what industry segment they are in. The problem with SIC codes is that all big enterprises are represented by multiple SIC codes. If you encode a fact table with an SIC dimension, you have the same problem as with the multiple diagnoses. SIC codes are really quite useful, however. If you want to summarize all the business you do selling to "Manufacturers" and "Retailers" by using their SIC code, you will need an SIC Group helper table. Here's a case where with some careful thought you may assign unequal weighting factors if you believe the enterprise is mostly a manufacturer and only a little bit a retailer.

Next month I will describe a different kind of helper table for dimensions. In this case, the helper table does not describe a many-to-many relationship but rather describes a hierarchical relationship. We can use this second kind of helper table to navigate organizational hierarchies and manufacturing parts explosions.

Friday, June 20, 2008

On the OTN forums lately...

I have taken a brief reprieve from my blogging for a bit so that I can contribute more to the OTN forums. I have been posting anything from 1-5 replies daily if I think I might be able to help anyone with what I know. I think a lot of my postings have been in the OWB area for now.

I'm likely going to Oracle Open World this year, as I got an invite with the nice free full conference pass. I am not sure I'd go without that thing being somehow covered! :) Hopefully I'll see my Oracle friends out there, and my family who (some of them) live in the bay area, as well as old school friends in SF and Oakland that went to Berkeley and my old U of I buddies that live in the Silicon Valley now.

Woohoo!



Wednesday, June 11, 2008

OWB can be a bit 'blind' when Oracle RAC is involved!

I will update this bit soon I hope. I'll post a little nasty bit that tricked us and cause our mappings to run horribly slow on a RAC environment. Hopefully when I get a little more information on RAC from Dan Norris, I'll update this with a clear explanation!