Transportation 315
Passenger Profile Frequent Flyer Home Airport Club Membership Lifetime Mileage
Key Tier ATL Status Tier
1 Basic ATL Non-Member Under 100,000 miles
2 Basic BOS Club Member Under 100,000 miles
3 Basic ... Non-Member Under 100,000 miles
... ... ATL ... ...
789 MidTier ATL Non-Member 100,000-499,999 miles
790 MidTier BOS Club Member 100,000-499,999 miles
791 MidTier ... Non-Member 100,000-499,999 miles
... ... ATL ... ...
2468 WarriorTier ATL Club Member 1,000,000-1,999,999 miles
2469 WarriorTier BOS Club Member 2,000,000-2,999,999 miles
2470 WarriorTier ... Club Member 1,000,000-1,999,999 miles
... ... ... ...
Figure 12-3: Passenger mini-dimension sample rows.
The aircraft dimension contains information about each plane flown. The origin
and destination airports associated with each flight are called out separately to
simplify the user’s view of the data and make access more efficient.
The class of service flown describes whether the passenger sat in economy, pre-
mium economy, business, or first class. The fare basis dimension describes the terms
surrounding the fare. It would identify whether it’s an unrestricted fare, a 21-day
advance purchase fare with change and cancellation penalties, or a 10 percent off
fare due to a special promotion.
The sales channel dimension identifies how the ticket was purchased, whether
through a travel agency, directly from the airline’s phone number, city ticket office, or
website, or via another internet travel services provider. Although the sales channel
relates to the entire ticket, each segment should inherit ticket-level dimensional-
ity. In addition, several operational numbers are associated with the flight activity
data, including the itinerary number, ticket number, flight number, and segment
sequence number.
The facts captured at the segment level of granularity include the base fare rev-
enue, passenger facility charges, airport and government taxes, other ancillary
charges and fees, segment miles flown, and segment miles awarded (in those cases
in which a minimum number of miles are awarded regardless of the flight distance).
Linking Segments into Trips
Despite the powerful dimensional framework you just designed, you cannot easily
answer one of the most important questions about your frequent flyers, namely,
“Where are they going?” The segment grain masks the true nature of the trip. If
you fetch all the segments of a trip and sequence them by segment number, it is still
316 Chapter 12
nearly impossible to discern the trip start and endpoints. Most complete itinerar-
ies start and end at the same airport. If a lengthy stop were used as a criterion for
a meaningful trip destination, it would require extensive and tricky processing at
the BI reporting layer whenever you try to summarize trips.
The answer is to introduce two more airport role-playing dimensions, trip origin
and trip destination, while keeping the grain at the flight segment level. These are
determined during data extraction by looking on the ticket for any stop of more
than four hours, which is the airline’s official definition of a stopover. You need to
exercise some caution when summarizing data by trip in this schema. Some of the
dimensions, such as fare basis or class of service flown, don’t apply at the trip level.
On the other hand, it may be useful to see how many trips from San Francisco to
Minneapolis included an unrestricted fare on a segment.
In addition to linking segments into trips on the segment flight activity schema,
if the business users are constantly looking at information at the trip level, rather
than by segment, you might create an aggregate fact table at the trip grain. Some of
the earlier dimensions discussed, such as class of service and fare basis, obviously
would not be applicable. The facts would include aggregated metrics like trip total
base fare or trip total taxes, plus additional facts that would appear only in this
complementary trip summary table, such as the number of segments in the trip.
However, you would go to the trouble of creating this aggregate table only if there
were obvious performance or usability issues when you use the segment-level table
as the basis for rolling up the same reports. If a typical trip consists of three seg-
ments, you might barely see a three times performance improvement with such an
aggregate table, meaning it may not be worth the bother.
Related Fact Tables
As discussed earlier, you would likely create a leg-grained flight activity fact table
to satisfy the more operational needs surrounding the departure and arrival of each
flight. Metrics at the leg level might include actual and blocked flight durations,
departure and arrival delays, and departure and arrival fuel weights.
In addition to the flight activity, there will be fact tables to capture reservations
and issued tickets. Given the focus on maximizing revenue, there might be a rev-
enue and availability snapshot for each flight; it could provide snapshots for the
final 90 days leading up to a flight departure with cumulative unearned revenue and
remaining availability per class of service for each scheduled flight. The snapshot
might include a dimension supporting the concept of “days prior to departure” to
facilitate the comparison of similar flights at standard milestones, such as 60 days
prior to scheduled departure.
Transportation 317
Extensions to Other Industries
Using the airline case study to illustrate a voyage schema makes intuitive sense
because most people have boarded a plane at one time or another. We’ll briefly touch
on several other variations on this theme.
Cargo Shipper
The schema for a cargo shipper looks quite similar to the airline schemas just
developed. Suppose a transoceanic shipping company transports bulk goods in
containers from foreign to domestic ports. The items in the containers are shipped
from an original shipper to a final consignor. The trip can have multiple stops at
intermediate ports. It is possible the containers may be off-loaded from one ship to
another at a port. Likewise, it is possible one or more of the legs may be by truck
rather than ship.
As illustrated in Figure 12-4, the grain of the fact table is the container on a spe-
cific bill-of-lading number on a particular leg of its trip. The ship mode dimension
identifies the type of shipping company and specific vessel. The container dimen-
sion describes the size of the container and whether it requires electrical power or
refrigeration. The commodity dimension describes the item in the container. Almost
anything that can be shipped can be described by harmonized commodity codes,
which are a kind of master conformed dimension used by agencies, including U.S.
Customs. The consignor, foreign transporter, foreign consolidator, shipper, domestic
consolidator, domestic transporter, and consignee are all roles played by a master
business entity dimension that contains all the possible business parties associated
with a voyage. The bill-of-lading number is a degenerate dimension. We assume the
fees and tariffs are applicable to the individual leg of the voyage.
Travel Services
If you work for a travel services company, you can complement the flight activity
schema with fact tables to track associated hotel stays and rental car usage. These
schemas would share several common dimensions, such as the date and customer.
For hotel stays, the grain of the fact table is the entire stay, as illustrated in Figure
12-5. The grain of a similar car rental fact table would be the entire rental episode.
Of course, if constructing a fact table for a hotel chain rather than a travel services
company, the schema would be much more robust because you’d know far more
about the hotel property characteristics, the guest’s use of services, and associated
detailed charges.
318 Chapter 12
Date Dimension (views for 2 roles) Shipping Transport Fact Port Dimension (views for 4 roles)
Business Entity Dimension (views for 7 roles)
Ship Mode Dimension Voyage Departure Date Key (FK)
Container Dimension Leg Departure Date Key (FK)
Commodity Dimension Voyage Origin Port Key (FK)
Voyage Destination Port Key (FK)
Leg Origin Port Key (FK)
Leg Destination Port Key (FK)
Ship Mode Key (FK)
Container Key (FK)
Commodity Key (FK)
Consignor Key (FK)
Foreign Transporter Key (FK)
Foreign Consolidator Key (FK)
Shipper Key (FK)
Domestic Consolidator Key (FK)
Domestic Transporter Key (FK)
Consignee Key (FK)
Bill-of-Lading Number (DD)
Leg Fee
Leg Tariffs
Leg Miles
Figure 12-4: Shipper schema.
Date Dimension (views for 3 roles) Travel Services Hotel Stay Fact Customer Dimension
Hotel Property Dimension Sales Channel Dimension
Reservation Date Key (FK)
Arrival Date Key (FK)
Departure Date Key (FK)
Customer Key (FK)
Hotel Property Key (FK)
Sales Channel Key (FK)
Confirmation Number (DD)
Ticket Number (DD)
Number of Nights
Extended Room Charge
Tax Charge
Figure 12-5: Travel services hotel stay schema.
Combining Correlated Dimensions
We stated previously that if a many-to-many relationship exists between two groups
of dimension attributes, they should be modeled as separate dimensions with sepa-
rate foreign keys in the fact table. Sometimes, however, you encounter situations
where these dimensions can be combined into a single dimension rather than treat-
ing them as two separate dimensions with two separate foreign keys in the fact table.
Transportation 319
Class of Service
The Figure 12-2 draft schema includes the class of service flown dimension.
Following a design checkpoint with the business community, you learn the
users also want to analyze the booking class purchased. In addition, the business users
want to easily filter and report on activity based on whether an upgrade or down-
grade occurred. Your initial reaction might be to include a second role-playing
dimension and foreign key in the fact table to support both the purchased and
flown class of service. In addition, you would need a third foreign key for the
upgrade indicator; otherwise, the BI application would need to include logic to
identify numerous scenarios as upgrades, including economy to premium economy,
economy to business, economy to first, premium economy to business, and so on.
In this situation, however, there are only four rows in the class dimension table
to indicate first, business, premium economy, and economy classes. Likewise, the
upgrade indicator dimension also would have just three rows in it, corresponding to
upgrade, downgrade, or no class change. Because the row counts are so small, you
can elect instead to combine the dimensions into a single class of service dimension,
as illustrated in Figure 12-6.
Class of Service Class Purchased Class Flown Purchased-Flown Group Class Change
Key Economy Economy Economy-Economy Indicator
1 Economy Prem Economy Economy-Prem Economy No Class Change
2 Economy Business Economy-Business Upgrade
3 Economy First Economy-First Upgrade
4 Prem Economy Economy Prem Economy-Economy Upgrade
5 Prem Economy Prem Economy Prem Economy-Prem Economy Downgrade
6 Prem Economy Business Prem Economy-Business No Class Change
7 Prem Economy First Prem Economy-First Upgrade
8 Business Economy Business-Economy Upgrade
9 Business Prem Economy Business-Prem Economy Downgrade
10 Business Business Business-Business Downgrade
11 Business First Business-First No Class Change
12 First Economy First-Economy Upgrade
13 First Prem Economy First-Prem Economy Downgrade
14 First Business First-Business Downgrade
15 First First First-First Downgrade
16 No Class Change
Figure 12-6: Combined class dimension sample rows.
The Cartesian product of the separate class dimensions results in a 16-row
dimension table (4 class purchased rows times 4 class flown rows). You also have
the opportunity in this combined dimension to describe the relationship between
320 Chapter 12
the purchased and flown classes, such as a class change indicator. Think of this
combined class of service dimension as a type of junk dimension, introduced in
Chapter 6. In this case study, the attributes are tightly correlated. Other airline fact
tables, such as inventory availability or ticket purchases, would invariably reference
a conformed class dimension table with just four rows.
NOTE In most cases, role-playing dimensions should be treated as separate logi-
cal dimensions created via views on a single physical table. In isolated situations,
it may make sense to combine the separate dimensions into a single dimension,
notably when the data volumes are extremely small or there is a need for additional
attributes that depend on the combined underlying roles for context and meaning.
Origin and Destination
Likewise, consider the pros and cons of combining the origin and destination airport
dimensions. In this situation the data volumes are more significant, so separate role-
playing origin and destination dimensions seem more practical. However, the busi-
ness users may need additional attributes that depend on the combination of origin
and destination. In addition to accessing the characteristics of each airport, business
users also want to analyze flight activity data by the distance between the city-pair
airports, as well as the type of city pair (such as domestic or trans-Atlantic). Even
the seemingly simple question regarding the total activity between San Francisco
(SFO) and Denver (DEN), regardless of whether the flights originated in SFO or
DEN, presents some challenges with separate origin and destination dimensions.
SQL experts could surely answer the question programmatically with separate air-
port dimensions, but what about the less empowered? Even if experts can derive
the correct answer, there’s no standard label for the nondirectional city-pair route.
Some reporting applications may label it SFO-DEN, whereas others might opt for
DEN-SFO, San Fran-Denver, Den-SF, and so on. Rather than embedding inconsis-
tent labels in BI reporting application code, the attribute values should be stored
in a dimension table, so common standardized labels can be used throughout the
organization. It would be a shame to go to the bother of creating a data warehouse
and then allowing application code to implement inconsistent reporting labels. The
business sponsors of the DW/BI system won’t tolerate that for long.
To satisfy the need to access additional city-pair route attributes, you have two
options. One is merely to add another dimension to the fact table for the city-pair
route descriptors, including the directional route name, nondirectional route name,
type, and distance, as shown in Figure 12-7. The other alternative is to combine
Transportation 321
the origin and destination airport attributes, plus the supplemental city-pair route
attributes, into a single dimension. Theoretically, the combined dimension could
have as many rows as the Cartesian product of all the origin and destination air-
ports. Fortunately, in real life the number of rows is much smaller than this theo-
retical limit because airlines don’t operate flights between every airport where they
have a presence. However, with a couple dozen attributes about the origin airport,
plus a couple dozen identical attributes about the destination airport, along with
attributes about the route, you would probably be more tempted to treat them as
separate dimensions.
City-Pair Directional Non-Directional Route Distance
Route Key Route Name Route Name in Miles Route Distance Band Dom-Intl Ind Transocean Ind
1 BOS-JFK BOS-JFK 191 Less than 200 miles Domestic Non-Oceanic
2 JFK-BOS BOS-JFK 191 Less than 200 miles Domestic Non-Oceanic
Transatlantic
3 BOS-LGW BOS-LGW 3,267 3,000 to 3,500 miles International Transatlantic
Transpacific
4 LGW-BOS BOS-LGW 3,267 3,000 to 3,500 miles International Transpacific
5 BOS-NRT BOS-NRT 6,737 More than 6,000 miles International
6 NRT-BOS BOS-NRT 6,737 More than 6,000 miles International
Figure 12-7: City-pair route dimension sample rows.
Sometimes designers suggest using a bridge table containing the origin and
destination airport keys to capture the route information. Although the origin
and destination represent a many-to-many relationship, in this case, you can
cleanly represent the relationship within the existing fact table rather than
using a bridge.
More Date and Time Considerations
From the earliest chapters in this book we’ve discussed the importance of having a
verbose date dimension, whether at the individual day, week, or month granular-
ity, that contains descriptive attributes about the date and private labels for fiscal
periods and work holidays. In this final section, we’ll introduce several additional
considerations for dealing with date and time dimensions.
Country-Specific Calendars as Outriggers
If the DW/BI system serves multinational needs, you must generalize the standard
date dimension to handle multinational calendars in an open-ended number of coun-
tries. The primary date dimension contains generic calendar attributes about the date,
322 Chapter 12
regardless of the country. If your multinational business spans Gregorian, Hebrew,
Islamic, and Chinese calendars, you would include four sets of days, months, and
years in this primary dimension.
Country-specific date dimensions supplement the primary date table. The key to
the supplemental dimension is the primary date key, along with the country code.
The table would include country-specific date attributes, such as holiday or season
names, as illustrated in Figure 12-8. This approach is similar to the handling of
multiple fiscal accounting calendars, as described in Chapter 7: Accounting.
Fact Date Dimension Country-Specific Date Outrigger
Date Key (FK) Date Key (PK) Date Key (FK)
More FKs ... Date Country Key (FK)
Facts ... Day of Week Country Name
Day Number in Epoch Civil Name
Week Number in Epoch Civil Holiday Flag
Month Number in Epoch Civil Holiday Name
Day Number in Calendar Month Religious Holiday Flag
Day Number in Calendar Year Religious Holiday Name
Day Number in Fiscal Month Weekday Indicator
Last Day in Fiscal Month Indicator Season Name
Calendar Month
Calendar Month Number in Year
Calendar Year-Month (YYYY-MM)
Calendar Quarter
Calendar Year-Quarter
Calendar Year
Fiscal Month
Fiscal Month Number in Year
Fiscal Year-Month
Fiscal Quarter
Fiscal Year-Quarter
Fiscal Year
...
Figure 12-8: Country-specific calendar outrigger.
You can join this table to the main calendar dimension as an outrigger or directly
to the fact table. If you provide an interface that requires the user to specify a coun-
try name, then the attributes of the country-specific supplement can be viewed as
logically appended to the primary date table, allowing them to view the calendar
through the eyes of a single country at a time. Country-specific calendars can be
Transportation 323
messy to build in their own right; things get even more complicated if you need to
deal with local holidays that occur on different days in different parts of a country.
Date and Time in Multiple Time Zones
When operating in multiple countries or even just multiple time zones, you’re faced
with a quandary concerning transaction dates and times. Do you capture the date and
time relative to local midnight in each time zone, or do you express the time period
relative to a standard, such as the corporate headquarters date/time, Greenwich Mean
Time (GMT), or Coordinated Universal Time (UTC), also known as Zulu time in the
aviation world? To fully satisfy users’ requirements, the correct answer is probably
both. The standard time enables you to see the simultaneous nature of transactions
across the business, whereas the local time enables you to understand transaction
timing relative to the time of day.
Contrary to popular belief, there are more than 24 time zones (corresponding
to the 24 hours of the day) in the world. For example, there is a single time zone in
China despite its latitudinal span. Likewise, there is a single time zone in India, off-
set from UTC by 5.5 hours. In Australia, there are three time zones with its Central
time zone offset by one-half hour. Meanwhile, Nepal and some other nations use
one-quarter hour offset. The situation gets even more unpleasant when you account
for switches to and from daylight saving time.
Given the complexities, it’s unreasonable to think that merely providing a UTC
offset in a fact table can support equivalized dates and times. Likewise, the offset
can’t reside in a time or airport dimension table because the offset depends on both
location and date. The recommended approach for expressing dates and times in
multiple time zones is to include separate date and time-of-day dimensions corre-
sponding to the local and equivalized dates, as shown in Figure 12-9. The time-of-day
dimensions, as discussed in Chapter 3: Retail Sales, support time period groupings
such as shift numbers or rush period time block designations.
Date Dimension Flight Activity Fact
(2 views for roles)
Departure Date Key (FK)
GMT Departure Date Key (FK) Time-of-Day Dimension
Departure Time-of-Day Key (FK) (2 views for roles)
GMT Departure Time-of-Day Key (FK)
More FKs ...
Degenerate Dimensions ...
Facts ...
Figure 12-9: Local and equivalized date/time across time zones.
324 Chapter 12
Localization Recap
We have discussed the challenges of international DW/BI system in several chapters
of the book. In addition to the international time zones and calendars discussed in the
previous two sections, we have also talked about multi-currency reporting in Chapter
6 and multi-language support in Chapter 8: Customer Relationship Management.
All these database-centric techniques fall under the general theme of localiza-
tion. Localization in the larger sense also includes the translation of user interface
text embedded in BI tools. BI tool vendors implement this form of localization with text
databases containing all the text prompts and labels needed by the tool, which can
then be configured for each local environment. Of course, this can become quite
complicated because text translated from English to most European languages results
in text strings that are longer than their English equivalents, which may force a
redesign of the BI application. Also, Arabic text reads from right to left, and many
Asian languages are completely different.
A serious international DW/BI system built to serve business users in many
countries needs to be thoughtfully designed to account for a selected set of these
localization issues. But perhaps it is worth thinking about how airport control tow-
ers and airplane pilots around the world deal with language incompatibilities when
communicating critical messages about flight directions and altitudes. They all use
one language (English) and unit of measure (feet).
Summary
In this chapter we turned our attention to airline trips or routes; we briefly touched
on similar scenarios drawn from the shipping and travel services industries. We
examined the situation in which we have multiple fact tables at multiple granularities
with multiple grain-specific facts. We also discussed the possibility of combining
dimensions into a single dimension table for cases in which the row count volumes
are extremely small or when there are additional attributes that depend on the com-
bined dimensions. Again, combining correlated dimensions should be viewed as the
exception rather than the rule.
We wrapped up this chapter by discussing several date and time dimension tech-
niques, including country-specific calendar outriggers and the handling of absolute
and relative dates and times.
13 Education
We step into the world of an educational institution in this chapter, looking first
at the applicant pipeline as an accumulating snapshot. When accumulating
snapshot fact tables were introduced in Chapter 4: Inventory, a product movement
pipeline illustrated the concept; order fulfillment workflows were captured in an
accumulating snapshot in Chapter 6: Order Management. In this chapter, rather than
watching products or orders move through various states, an accumulating snapshot
is used to monitor prospective student applicants as they progress through admis-
sions milestones.
The other primary concept discussed in this chapter is the factless fact table. We’ll
explore several case study illustrations drawn from higher education to further elabo-
rate on these special fact tables and discuss the analysis of events that didn’t occur.
Chapter 13 discusses the following concepts:
■ Example bus matrix snippet for a university or college
■ Applicant tracking and research grant proposals as accumulating snapshot
fact tables
■ Factless fact table for admission events, course registration facilities manage-
ment, and student attendance
■ Handling of nonexistent events
University Case Study and Bus Matrix
In this chapter you’re working for a university, college, or other type of educational
institution. Someone at a higher education client once remarked that running a
university is akin to operating all the businesses needed to support a small vil-
lage. Universities are simultaneously a real estate property management company
(residential student housing), restaurant with multiple outlets (dining halls), retailer
(bookstore), events management and ticketing agency (athletics and speaker events),
326 Chapter 13
police department (campus security), professional fundraiser (alumni development),
consumer financial services company (financial aid), investment firm (endowment
management), venture capitalist (research and development), job placement firm
(career planning), construction company (buildings and facilities maintenance), and
medical services provider (health clinic). In addition to these varied functions, higher
education institutions are obviously also focused on attracting high caliber students
and talented faculty to create a robust educational environment.
The bus matrix snippet in Figure 13-1 covers several core processes within an
educational institution. Traditionally, there has been less focus on revenue and profit
in higher education, but with ever-escalating costs and competition, universities
and colleges cannot ignore these financial metrics. They want to attract and retain
students who align with their academic and other institutional objectives. There’s
a strong interest in analyzing what students are “buying” in terms of courses each
term and the associated academic outcomes. Colleges and universities want to
understand many aspects of the student’s experience, along with maintaining an
ongoing relationship well beyond graduation.
Accumulating Snapshot Fact Tables
Chapter 4 used an accumulating snapshot fact table to track products identified by
serial or lot numbers as they move through various inventory stages in a warehouse.
Take a moment to recall the distinguishing characteristics of an accumulating snap-
shot fact table:
■ A single row represents the complete history of a workflow or pipeline
instance.
■ Multiple dates represent the standard pipeline milestone events.
■ The accumulating snapshot facts often included metrics corresponding to
each milestone, plus status counts and elapsed durations.
■ Each row is revisited and updated whenever the pipeline instance changes;
both foreign keys and measured facts may be changed during the fact row
updates.
Applicant Pipeline
Now envision these same accumulating snapshot characteristics as applied to the
prospective student admissions pipeline. For those who work in other industries,
there are obvious similarities to tracking job applicants through the hiring process
or sales prospects as they are qualified and become customers.
Education 327
Date/Term
Applicant-Student-Alum
Employee (Faculty, Staff)
Course
Department
Facility
Account
Student Lifecycle Processes
Admission Events XXX
Applicant Pipeline XXX X
Financial Aid Awards XXX X
Student Enrollment/Profile Snapshot XXX XX
Student Residential Housing XX XX
Student Course Registration & Outcomes X X X X X X
Student Course Instructor Evaluations XXXXXX
Student Activities XX X
Career Placement Activities XX X
Advancement Contacts XXX
Advancement Pledges & Gifts XXX X
Financial Processes
Budgeting XXXX
Endowment Tracking X XX
GL Transactions X XX
Payroll XXXX
Procurement XXXX
Employee Management Processes
Employee Headcount Snapshot X X XX
Employee Hiring & Separations XXX
Employee Benefits & Compensation X X X
Staff Performance Management XXX
Faculty Appointment Management X X X
Research Proposal Pipeline XXX
Research Expenditures XXXX
Faculty Publications XXX
Administrative Processes
Facilities Utilization X X XX
Energy Consumption & Waste Management X XX
Work Orders XXX XXX
Figure 13-1: Subset of bus matrix rows for educational institution.
328 Chapter 13
In the case of applicant tracking, prospective students progress through a stan-
dard set of admissions hurdles or milestones. Perhaps you’re interested in tracking
activities around key dates, such as initial inquiry, campus visit, application submit-
ted, application file completed, admissions decision notification, and enrolled or
withdrawn. At any point in time, admissions and enrollment management analysts
are interested in how many applicants are at each stage in the pipeline. The process
is much like a funnel, where many inquiries enter the pipeline, but far less prog-
ress through to the final stage. Admission personnel also would like to analyze the
applicant pool by a variety of characteristics.
The grain of the applicant pipeline accumulating snapshot is one row per prospec-
tive student; this granularity represents the lowest level of detail captured when the
prospect enters the pipeline. As more information is collected while the prospective
student progresses toward application, acceptance, and enrollment, you continue
to revisit and update the fact table row, as illustrated in Figure 13-2.
Date Dimension (views for 6 roles) Applicant Pipeline Fact Applicant Dimension
Date Key (PK)
... Initial Inquiry Date Key (FK) Applicant Key (PK)
Term Campus Visit Date Key (FK) Applicant Name
Academic Year-Term Application Submitted Date Key (FK) Applicant Address Attributes ...
Academic Year Application File Completed Date Key (FK) High School
Admission Decision Notification Date Key (FK) High School GPA
Application Status Dimension Applicant Enroll-Withdraw Date Key (FK) High School Type
Application Status Key Applicant Key (FK) SAT Math Score
Application Status Code Application Status Key (FK) SAT Verbal Score
Application Status Description Application ID (DD) SAT Writing Score
Application Status Category Inquiry Count ACT Composite Score
Campus Visit Count Number of AP Credits
Application Submitted Count Gender
Application Completed Count Date of Birth
Admit Early Decision Count Ethnicity
Admit Regular Decision Count Full time-Part time Indicator
Waitlist Count Application Source
Defer to Regular Decision Count Intended Major
Deny Count ...
Enroll Early Decision Count
Enroll Regular Decision Count
Admit Withdraw Count
Figure 13-2: Student applicant pipeline as an accumulating snapshot.
Like earlier accumulating snapshots, there are multiple dates in the fact table
corresponding to the standard milestone events. You want to analyze the prospect’s
progress by these dates to determine the pace of movement through the pipeline and
spot bottlenecks. This is especially important if you see a significant lag involving
a candidate whom you’re interested in recruiting. Each of these dates is treated as a
role-playing dimension, with a default surrogate key to handle the unknown dates
for new and in-process rows.
Education 329
The applicant dimension contains many interesting attributes about prospective
students. Analysts are interested in slicing and dicing by applicant characteristics
such as geography, incoming credentials (grade point average, college admissions test
scores, advanced placement credits, and high school), gender, date of birth, ethnicity,
preliminary major, application source, and a multitude of others. Analyzing these char-
acteristics at various stages of the pipeline can help admissions personnel adjust their
strategies to encourage more (or fewer) students to proceed to the next mile marker.
The facts in the applicant pipeline fact table include a variety of counts that are
closely monitored by admissions personnel. If available, this table could include esti-
mated probabilities that the prospect will apply and subsequently enroll if accepted
to predict admission yields.
Alternative Applicant Pipeline Schemas
Accumulating snapshots are appropriate for short-lived processes that have a defined
beginning and end, with standard intermediate milestones. This type of fact table
enables you to see an updated status and ultimately final disposition of each appli-
cant. However, because accumulating snapshot rows are updated, they do not pre-
serve applicant counts and statuses at critical points in the admissions calendar,
such as the early decision notification date. Given the close scrutiny of these num-
bers, analysts might also want to retain snapshots at several important cut-off dates.
Alternatively, you could build an admission transaction fact table with one row per
transaction per applicant for counting and period-to-period comparisons.
Research Grant Proposal Pipeline
The research proposal pipeline is another education-based example of an accumu-
lating snapshot. Faculty and administration are interested in viewing the lifecycle
of a grant proposal as it progresses through the pipeline from preliminary proposal
to grant approval and award receipt. This would support analysis of the number of
outstanding proposals in each stage of the pipeline by faculty, department, research
topic area, or research funding source. Likewise, you could see success rates by
various attributes. Having this information in a common repository would allow
it to be leveraged by a broader university population.
Factless Fact Tables
So far we’ve largely designed fact tables with very similar structures. Each fact table
typically has 5 to approximately 20 foreign key columns, followed by one to poten-
tially several dozen numeric, continuously valued, preferably additive facts. The
facts can be regarded as measurements taken at the intersection of the dimension
330 Chapter 13
key values. From this perspective, the facts are the justification for the fact table,
and the key values are simply administrative structure to identify the facts.
There are, however, a number of business processes whose fact tables are simi-
lar to those we’ve been designing with one major distinction. There are no mea-
sured facts! We introduced factless fact tables while discussing promotion events
in Chapter 3: Retail Sales, as well as in Chapter 6 to describe sales rep/customer
assignments. There are numerous examples of factless events in higher education.
Admissions Events
You can envision a factless fact table to track each prospective student’s attendance
at an admission event, such as a high school visit, college fair, alumni interview or
campus overnight, as illustrated in Figure 13-3.
Admissions Event Date Dimension Admissions Event Attendance Fact Planned Enroll Term Dimension
Applicant Dimension Application Status Dimension
Admissions Event Date Key (FK) Admission Event Dimension
Admissions Officer Dimension Planned Enroll Term Key (FK)
Applicant Key (FK)
Applicant Status Key (FK)
Admissions Officer Key (FK)
Admission Event Key (FK)
Admissions Event Attendance Count (=1)
Figure 13-3: Admission event attendance as a factless fact table.
Course Registrations
Similarly, you can track student course registrations by term using a factless fact
table. The grain would be one row for each registered course by student and term,
as illustrated in Figure 13-4.
Term Dimension
In this fact table, the data is at the term level rather than at the more typical cal-
endar day, week, or month granularity. The term dimension still should conform
to the calendar date dimension. In other words, each date in the daily calendar
dimension should identify the term (for example, Fall), term and academic year
(for example, Fall 2013), and academic year (for example, 2013-2014). The column
labels and values must be identical for the attributes common to both the calendar
date and term dimensions.
Student Dimension and Change Tracking
The student dimension is an expanded version of the applicant dimension discussed
in the first scenario. You still want to retain some information garnered from the
application process (for example, geography, credentials, and intended major) but
Education 331
supplement it with on-campus information, such as part-time or full-time status,
residence, athletic involvement indicator, declared major, and class level status (for
example, sophomore).
Term Dimension Course Registration Event Fact Student Dimension
Term Key (PK) Student Key (PK)
Term Term Key (FK) Student ID (NK)
Academic Year-Term Student Key (FK) ...
Academic Year Course Key (FK)
Instructor Key (FK) Instructor Dimension
Course Dimension Course Registration Count (=1) Instructor Key (PK)
Course Key (PK) Instructor Employee ID (NK)
Course Name Instructor Name
Course Department Instructor Address Attributes...
Course Format Instructor Type
Course Credit Hours Instructor Tenure Indicator
Instructor Original Hire Date
Instructor Years of Service
Figure 13-4: Course registration events as a factless fact table.
As discussed in Chapter 5: Procurement, you could imagine placing some of
these attributes in a type 4 mini-dimension because factions throughout the uni-
versity are interested in tracking changes to them, especially for declared major,
class level, and graduation attainment. People in administration and academia are
keenly interested in academic progress and retention rates by class, school, depart-
ment, and major. Alternatively, if there’s a strong demand to preserve the students’
profiles at the time of course registration, plus filter and group by the students’
current characteristics, you should consider handling the student information as
a slowly changing dimension type 7 with dual student dimension keys in the fact
table, as also described in Chapter 5. The surrogate student key would link to a
dimension table with type 2 attributes; the student’s durable identifier would link
to a view of the complete student dimension containing only the current row for
each student.
Artificial Count Metric
A fact table represents the robust set of many-to-many relationships among dimen-
sions; it records the collision of dimensions at a point in time and space. This course
registration fact table could be queried to answer a number of interesting questions
regarding registration for the college’s academic offerings, such as which students
registered for which courses? How many declared engineering majors are taking an
out-of-major finance course? How many students have registered for a given faculty
member’s courses during the last three years? How many students have registered
332 Chapter 13
for more than one course from a given faculty member? The only peculiarity in
these examples is that you don’t have a numeric fact tied to this registration data.
As such, analyses of this data will be based largely on counts.
NOTE Events are modeled as fact tables containing a series of keys, each
representing a participating dimension in the event. Event tables sometimes have
no variable measurement facts associated with them and hence are called factless
fact tables.
The SQL for performing counts in this factless fact is asymmetric because of
the absence of any facts. When counting the number of registrations for a faculty
member, any key can be used as the argument to the COUNT function. For example:
select faculty, count(term_key)... group by faculty
This gives the simple count of the number of student registrations by faculty,
subject to any constraints that may exist in the WHERE clause. An oddity of SQL is
that you can count any key and still get the same answer because you are counting
the number of keys that fly by the query, not their distinct values. You would need
to use a COUNT DISTINCT if you want to count the unique instances of a key rather
than the number of keys encountered.
The inevitable confusion surrounding the SQL statement, although not a serious
semantic problem, causes some designers to create an artificial implied fact, perhaps
called course registration count (as opposed to “dummy”), that is always populated
by the value 1. Although this fact does not add any information to the fact table, it
makes the SQL more readable, such as:
select faculty, sum(registration_count)... group by faculty
At this point the table is no longer strictly factless, but the “1” is nothing more
than an artifact. The SQL will be a bit cleaner and more expressive with the regis-
tration count. Some BI query tools have an easier time constructing this query with
a few simple user gestures. More important, if you build a summarized aggregate
table above this fact table, you need a real column to roll up to meaningful aggre-
gate registration counts. And finally, if deploying to an OLAP cube, you typically
include an explicit count column (always equal to 1) for complex counts because
the dimension join keys are not explicitly revealed in a cube.
If a measurable fact does surface during the design, it can be added to the schema,
assuming it is consistent with the grain of student course registrations by term. For
example, you could add tuition revenue, earned credit hours, and grade scores to
this fact table, but then it’s no longer a factless fact table.
Education 333
Multiple Course Instructors
If courses are taught by a single instructor, you can associate an instructor key to
the course registration events, as shown in Figure 13-4. However, if some courses
are co-taught, then it is a dimension attribute that takes on multiple values for the
fact table’s declared grain. You have several options:
■ Alter the grain of the fact table to be one row per instructor per course reg-
istration per student per term. Although this would address the multiple
instructors associated with a course, it’s an unnatural granularity that would
be extremely prone to overstated registration count errors.
■ Add a bridge table with an instructor group key in either the fact table or as
an outrigger on the course dimension, as introduced in Chapter 8: Customer
Relationship Management. There would be one row in this table for each instruc-
tor who teaches courses on his own. In addition, there would be two rows for
each instructor team; these rows would associate the same group key with
individual instructor keys. The concatenation of the group key and instructor
key would uniquely identify each bridge table row. As described in Chapter 10:
Financial Services, you could assign a weighting factor to each row in the bridge
if the teaching workload allocation is clearly defined. This approach would
be susceptible to the potential overstatement issues surrounding the bridge
table usage described in Chapter 10.
■ Concatenate the instructor names into a single, delimited attribute on the
course dimension, as discussed in Chapter 9: Human Resources Management.
This option enables users to easily label reports with a single dimension attri-
bute, but it would not support analysis of registration events by instructor
characteristics.
■ If one of the instructors is identified as the primary instructor, then her
instructor key could be handled as a single foreign key in the fact table,
joined to a dimension where the attributes were prefaced with “primary” for
differentiation.
Course Registration Periodic Snapshots
The grain of the fact table illustrated in Figure 13-4 is one row for each regis-
tered course by student and term. Some users at the college or university might be
interested in periodic snapshots of the course registration events at key academic
calendar dates, such as preregistration, start of the term, course drop/add deadline,
and end of the term. In this case, the fact table’s grain would be one row for each
student’s registered courses for a term per snapshot date.
334 Chapter 13
Facility Utilization
The second type of factless fact table deals with coverage, which can be illustrated
with a facilities management scenario. Universities invest a tremendous amount
of capital in their physical plant and facilities. It would be helpful to understand
which facilities were being used for what purpose during every hour of the day
during each term. For example, which facilities were used most heavily? What
was the average occupancy rate of the facilities as a function of time of day?
Does utilization drop off significantly on Fridays when no one wants to attend
(or teach) classes?
Again, the factless fact table comes to the rescue. In this case you’d insert one
row in the fact table for each facility for standard hourly time blocks during each
day of the week during a term regardless of whether the facility is being used.
Figure 13-5 illustrates the schema.
Term Dimension Facility Utilization Fact Time-of-Day Hour Dimension
Time-of-Day Hour Key (PK)
Day of Week Dimension Term Key (FK) Time-of-Day Hour
Day of Week Key (FK) Day Part Indicator
Facility Dimension Time-of-Day Hour Key (FK)
Facility Key (PK) Facility Key (FK) Department Dimension (2 views for roles)
Facility Building Name - Room Owner Department Key (FK)
Facility Building Name Assigned Department Key (FK) Utilization Status Dimension
Facility Building Address attributes... Utilization Status Key (FK)
Facility Type Facility Count (=1)
Facility Floor
Facility Square Footage
Facility Capacity
Projector Indicator
Vent Indicator
...
Figure 13-5: Facilities utilization as a coverage factless fact table.
The facility dimension would include all types of descriptive attributes about the
facility, such as the building, facility type (for example, classroom, lab, or office),
square footage, capacity, and amenities (for example, whiteboard or built-in projec-
tor). The utilization status dimension would include a text descriptor with values
of Available or Utilized. Meanwhile, multiple organizations may be involved in
facilities utilization. For example, one organization might own the facility during
a time block, but the same or a different organization might be assigned as the
facility user.
Education 335
Student Attendance
You can visualize a similar schema to track student attendance in a course. In this
case, the grain would be one row for each student who walks through the course’s
classroom door each day. This factless fact table would share a number of the same
dimensions discussed with registration events. The primary difference would be
the granularity is by calendar date in this schema rather than merely term. This
dimensional model, illustrated in Figure 13-6, allows business users to answer
questions concerning which courses were the most heavily attended. Which courses
suffered the least attendance attrition over the term? Which students attended which
courses? Which faculty member taught the most students?
Date Dimension Student Attendance Fact Student Dimension
Course Dimension Instructor Dimension
Facility Dimension Date Key (FK)
Student Key (FK)
Course Key (FK)
Instructor Key (FK)
Facility Key (FK)
Attendance Count
Figure 13-6: Student attendance fact table.
Explicit Rows for What Didn’t Happen
Perhaps people are interested in monitoring students who were registered for a
course but didn’t show up. In this example you can envision adding explicit rows to
the fact table for attendance events that didn’t occur. The fact table would no longer
be factless as there is an attendance metric equal to either 1 or 0.
Adding rows is viable in this scenario because the non-attendance events have the
same exact dimensionality as the attendance events. Likewise, the fact table won’t
grow at an alarming rate, presuming (or perhaps hoping) the no-shows are a small
percentage of the total students registered for a course. Although this approach is
reasonable in this scenario, creating rows for events that didn’t happen is ridiculous
in many other situations, such as adding rows to a customer’s sales transaction for
promoted products that weren’t purchased by the customer.
What Didn’t Happen with Multidimensional OLAP
Multidimensional OLAP databases do an excellent job of helping users understand
what didn’t happen. When the cube is constructed, multidimensional databases
handle the sparsity of the transaction data while minimizing the overhead burden
of storing explicit zeroes. As such, at least for fact cubes that are not too sparse, the
336 Chapter 13
event and nonevent data is available for user analysis while reducing some of the
complexities just discussed in the relational star schema world.
More Educational Analytic Opportunities
Many of the business processes described in earlier chapters, such as procurement
and human resources, are obviously applicable to the university environment given
the desire to better monitor and manage costs. Research grants and alumni contri-
butions are key sources of revenue, in addition to the tuition revenue.
Research grant analysis is often a variation of financial analysis, as discussed in
Chapter 7: Accounting, but at a lower level of detail, much like a subledger. The grain
would include additional dimensions to further describe the research grant, such as
the corporate or governmental funding source, research topic, grant duration, and
faculty investigator. There is a strong need to better understand and manage the
budgeted and actual spending associated with each research project. The objective
is to optimize the spending so a surplus or deficit situation is avoided, and funds
are deployed where they will be most productive. Likewise, understanding research
spending rolled up by various dimensions is necessary to ensure proper institutional
control of such monies.
Better understanding the university’s alumni is much like better understanding
a customer base, as described in Chapter 8. Obviously, there are many interesting
characteristics that would be helpful in maintaining a relationship with your alumni,
such as geographic, demographic, employment, interests, and behavioral information,
in addition to the data you collected about them as students (for example, affiliations,
residential housing, school, major, length of time to graduate, and honors designa-
tions). Improved access to a broad range of attributes about the alumni population
would allow the institution to better target messages and allocate resources. In addi-
tion to alumni contributions, alumni relationships can be leveraged for potential
recruiting, job placement, and research opportunities. To this end, a robust CRM
operational system should track all the touch points with alumni to capture mean-
ingful data for the DW/BI analytic platform.
Summary
In this chapter we focused on two primary concepts. First, we looked at the accu-
mulating snapshot fact table to track application or research grant pipelines. Even
though the accumulating snapshot is used much less frequently than the more com-
mon transaction and periodic snapshot fact tables, it is very useful for tracking the
current status of a short-lived process with standard milestones. As we described,
Education 337
accumulating snapshots are often complemented with transactional or periodic
snapshot tables.
Second, we explored several examples of factless fact tables. These fact tables
capture the relationship between dimensions in the case of an event or coverage,
but are unique in that no measurements are collected to serve as actual facts. We
also discussed the handling of situations in which you want to track events that
didn’t occur.
14 Healthcare
The healthcare industry is undergoing tremendous change as it seeks to both
improve patient outcomes, while simultaneously improving operational effi-
ciencies. The challenges are plentiful as organizations attempt to integrate their
clinical and administrative information. Healthcare data presents several interesting
dimensional design patterns that we’ll explore in this chapter.
Chapter 14 discusses the following concepts:
■ Example bus matrix snippet for a healthcare organization
■ Accumulating snapshot fact table to handle the claims billing and payment
pipeline
■ Dimension role playing for multiple dates and physicians
■ Multivalued dimensions, such as patient diagnoses
■ Supertype and subtype handling of healthcare charges
■ Treatment of textual comments
■ Measurement type dimension for sparse, heterogeneous measurements
■ Handling of images with dimensional schemas
■ Facility/equipment inventory utilization as transactions and periodic snapshots
Healthcare Case Study and Bus Matrix
In the face of unprecedented consumer focus and governmental policy regulations,
coupled with internal pressures, healthcare organizations need to leverage informa-
tion more effectively to impact both patient outcomes and operational efficiencies.
Healthcare organizations typically wrestle with many disparate systems to collect
their clinical, financial, and operational performance metrics. This information
needs to be better integrated to deliver more effective patient care, while concur-
rently managing costs and risks. Healthcare analysts want to better understand
which procedures deliver the best outcomes, while identifying opportunities to
340 Chapter 14
impact resource utilization, including labor, facilities, and associated equipment
and supplies. Large healthcare consortiums with networks of physicians, clinics,
hospitals, pharmacies, and laboratories are focused on these requirements, espe-
cially as both the federal government and private payers are encouraging providers
to assume more responsibility for the quality and cost of their healthcare services.
Figure 14-1 illustrates a sample snippet of a healthcare organization’s bus matrix.
Date
Patient
Physician
Employee
Facility
Diagnosis
Procedure
Payer
Clinical Events
Patient Encounter Workflow XX X X X X
Procedures XX X X X X X
Physician Orders XX X XX X
X
Medications XX X X X
X
Lab Test Results XX X X X X X
X
Disease/Case Management Participation X X X X X X X
X
Patient Reported Outcomes XX X XX
X
Patient Satisfaction Surveys XX X XX
Billing/Revenue Events
Inpatient Facility Charges XX X XX
Outpatient Professional Charges XX X XX
Claims Billing XX X XX X
X
Claims Payments XX X XX X
Collections and Write-Offs XX X X X X
Operational Events
Bed Inventory Utilization XX X X X
Facilities Utilization XX X X X
Supply Procurement X XX
Supply Utilization XX X X X X
Workforce Scheduling X X X
Figure 14-1: Subset of bus matrix row for a healthcare consortium.
Traditionally, healthcare insurance payers have leveraged claims information to
better understand their risk, improve underwriting policies, and detect potential
fraudulent activity. Payers have historically been more sophisticated than health-
care provider organizations in leveraging data analytically, perhaps in part because
their prime data source, claims, was more reliably captured and structured than
Healthcare 341
providers’ data. However, claims data is both a benefit and curse for payers’ analytic
efforts because it historically hasn’t provided the robust, granular clinical picture.
Increasingly, healthcare payers are partnering with providers to leverage detailed
patient information to support more predictive analysis. In many ways, the needs
and objectives of the providers and payers are converging, especially with the push
for shared-risk delivery models.
Every patient’s episode of care with a healthcare organization generates mounds
of information. Patient-centric transactional data falls into two prime categories:
administrative and clinical. The claims billing data provides detail on a patient bill
from a physician’s office, clinic, hospital, or laboratory. The clinical medical record,
on the other hand, is more comprehensive and includes not only the services result-
ing in charges, but also the laboratory test results, prescriptions, physician’s notes
or orders, and sometimes outcomes.
The issues of conforming common dimensions remain exactly the same for
healthcare as in other industries. Obviously, the most important conformed dimen-
sion is the patient. In Chapter 8: Customer Relationship Management, we described
the need for a 360-degree view of customers. It’s easy to argue that a 360-degree
view of patients is even more critical given the stakes; adoption of patient electronic
medical record (EMR) and electronic health record (EHR) systems clearly focus on
this objective.
Other dimensions that must be conformed include:
■ Date
■ Responsible party
■ Employer
■ Health plan
■ Payer (primary and secondary)
■ Physician
■ Procedure
■ Equipment
■ Lab test
■ Medication
■ Diagnosis
■ Facility (office, clinic, outpatient facility, and hospital)
In the healthcare arena, some of these dimensions are hard to conform, whereas
others are easier than they look at first glance. The patient dimension has historically
been challenging, at least in the United States, because of the lack of a reliable national
identity number and/or consistent patient identifier across facilities and physicians.
To further complicate matters, the Health Insurance Portability and Accountability Act
(HIPAA) includes strict privacy and security requirements to protect the confidential
342 Chapter 14
nature of patient information. Operational process improvements, like electronic
medical records, are ensuring more consistent master patient identification.
The diagnosis and treatment dimensions are considerably more structured and
predictable than you might expect because the insurance industry and government
have mandated their content. For example, diagnosis and disease classifications fol-
low the International Classification of Diseases (ICD) standard for consistent reporting.
Similarly, the Healthcare Common Procedure Coding System (HCPCS) is based on the
American Medical Association’s Current Procedural Terminology (CPT) to describe
medical, surgical, and diagnostic services, along with supplies and devices. Dentists
use the Current Dental Terminology (CDT) code set, which is updated and distributed
by the American Dental Association.
Finally, beyond integrated patient-centric clinical and financial information,
healthcare organizations also want to analyze operational information regarding
the utilization of their workforce, facilities, and supplies. Much of the discussion
from earlier chapters about human resources, inventory management, and procure-
ment processes is also applicable to healthcare organizations.
Claims Billing and Payments
Imagine you work in the healthcare consortium’s billing organization. You receive
the primary charges from the physicians and facilities, prepare bills for the respon-
sible payers, and track the progress of the claims payments received.
The dimensional model for the claims billing process must address a number of
business objectives. You want to analyze the billed dollar amounts by every avail-
able dimension, including patient, physician, facility, diagnosis, procedure, and
date. You want to see how these claims have been paid and what percentage of the
claims have not been collected. You want to see how long it takes to get paid, and
the current status of all unpaid claims.
As we discussed in Chapter 4: Inventory, whenever a source business process is consid-
ered for inclusion in the DW/BI system, there are three essential grain choices. Remember
the fact table’s granularity determines what constitutes a fact table row. In other words,
what is the measurement event being recorded?
The transaction grain is the most fundamental. In the healthcare billing example,
the transaction grain would include every billing transaction from the physicians
and facilities, as well as every claim payment transaction received. We’ll talk more
about these fact tables in a moment.
The periodic snapshot is the grain of choice for long-running time series, such
as bank accounts and insurance policies. However, the periodic snapshot doesn’t
Healthcare 343
do a good job of capturing the behavior of relatively short-lived processes, such as
orders or medical claims billing.
The accumulating snapshot grain is chosen to analyze the claims billing and pay-
ment workflow. A single fact table row represents a single line on a medical claim.
Furthermore, the row represents the accumulated history of the line item from the
moment of creation to the current state. When anything about the line changes, the row
is revisited and modified appropriately. From the point of view of the billing organiza-
tion, let’s assume the standard scenario of a claim includes:
■ Treatment date
■ Primary insurance billing date
■ Secondary insurance billing date
■ Responsible party billing date
■ Last primary insurance payment date
■ Last secondary insurance payment date
■ Last responsible party payment date
■ Zero balance date
These dates describe the normal claim workflow. An accumulating snapshot
does not attempt to fully describe unusual situations. Business users undoubt-
edly need to see all the details of messy claim payment scenarios because multiple
payments are sometimes received for a single line, or conversely, a single payment
sometimes applies to multiple claims. Companion transaction schemas inevitably
will be needed. In the meantime, the purpose of the accumulating snapshot grain
is to place every claim into a standard framework so that the analytic objectives
described earlier can be satisfied easily.
With a clear understanding that an individual fact table row represents the accu-
mulated history of a line item on a claim bill, you can identify the dimensions by
carefully listing everything known to be true in the context of this row. In this
hypothetical scenario, you know the patient, responsible party, physician, physi-
cian organization, procedure, facility, diagnosis, primary insurance organization,
secondary insurance organization, and master patient bill ID number, as shown in
Figure 14-2.
The interesting facts accumulated over the claim line’s history include the billed
amount, primary insurance paid amount, secondary insurance paid amount, respon-
sible party paid amount, total paid amount (calculated), amount sent to collections,
amount written off, amount remaining to be paid (calculated), length of stay, number
of days from billing to initial primary insurance, secondary insurance, and respon-
sible party payments, and finally, number of days to zero balance.
344 Chapter 14
Date Dimension (views for 8 roles) Claims Billing and Payment Workflow Fact Physician Dimension
Physician Organization Dimension
Patient Dimension Treatment Date Key (FK)
Procedure Dimension Primary Insurance Billing Date Key (FK) Facility Dimension
Primary Diagnosis Dimension Secondary Insurance Billing Date Key (FK) Insurance Organization Dimension (views for 2 roles)
Responsible Party Dimension Responsible Party Billing Date Key (FK)
Last Primary Insurance Payment Date Key (FK) Employer Dimension
Last Secondary Insurance Payment Date Key (FK)
Last Responsible Party Payment Date Key (FK)
Zero Balance Date Key (FK)
Patient Key (FK)
Physician Key (FK)
Physician Organization Key (FK)
Procedure Key (FK)
Facility Key (FK)
Primary Diagnosis Key (FK)
Primary Insurance Organization Key (FK)
Secondary Insurance Organization Key (FK)
Responsible Party Key (FK)
Employer Key (FK)
Master Bill ID (DD)
Billed Amount
Primary Insurance Paid Amount
Secondary Insurance Paid Amount
Responsible Party Paid Amount
Total Paid Amount
Sent to Collections Amount
Written Off Amount
Unpaid Balance Amount
Length of Stay
Bill to Initial Primary Insurance Payment Lag
Bill to Initial Secondary Insurance Payment Lag
Bill to Initial Responsible Party Payment Lag
Bill to Zero Balance Lag
Figure 14-2: Accumulating snapshot fact table for medical claim billing and payment
workflow.
A row is initially created in this fact table when the charge transactions are
received from the physicians or facilities and the initial bills are generated. On a
given bill, perhaps the primary insurance company is billed, but the secondary
insurance and responsible party are not billed, pending a response from the pri-
mary insurance company. For a period of time after the row is first entered into
the fact table, the last seven dates are not applicable. Because the surrogate date
keys in the fact table must not be null, they will point to a date dimension row
reserved for a To Be Determined date.
In the weeks after creation of the row, some payments are received. Bills are then
sent to the secondary insurance company and responsible party. Each time these
events take place, the same fact table row is revisited, and the appropriate keys and
facts are destructively updated. This destructive updating poses some challenges
for the database administrator. If most of the accumulating rows stabilize and stop
changing within a given timeframe, a physical reorganization of the database at
that time can recover disk storage and improve performance. If the fact table is
Healthcare 345
partitioned on the treatment date key, the physical clustering or partitioning prob-
ably will be well preserved throughout these changes because the treatment date
is not revisited and changed.
Date Dimension Role Playing
Accumulating snapshot fact tables always involve multiple date stamps, like the eight
foreign keys pointing to the date dimension in Figure 14-2. The eight date foreign
keys should not join to a single instance of the date dimension table. Instead, create
eight views on the single underlying date dimension table, and join the fact table
separately to these eight views, as if they were eight independent date dimension
tables. The eight view definitions should cosmetically relabel the column names to
be distinguishable, so BI tools accessing the views present understandable column
names to the business users.
Although the role-playing behavior of the date dimension is a common charac-
teristic of accumulating snapshot fact tables, other dimensions in Figure 14-2 play
roles in similar ways, such as the payer dimension. In the section “Supertypes and
Subtypes for Charges,” the physician dimension will play multiple roles depending
on whether the physician is the referring physician, attending physician, or working
in a consulting or assisting capacity.
Multivalued Diagnoses
Normally the dimensions surrounding a fact table take on a single value in the
context of the fact event. However, there are situations where multivaluedness is
natural and unavoidable. The diagnosis dimension in healthcare fact tables is a
good example. At the moment of a procedure or lab test, the patient has one or more
diagnoses. Electronic medical record applications facilitate the physician’s selection
of multiple diagnoses well beyond the historical practice of providing the minimal
coding needed for reimbursement; the result is a richer, more complete picture of
the severity of the patient’s medical condition. There is strong analytic incentive to
retain the multivalued diagnoses, along with the other financial performance data,
especially as organizations do more comparative utilization and cost benchmarking.
If there were always a maximum of three diagnoses, for instance, you might be
tempted to create three diagnosis foreign keys in the fact table with correspond-
ing dimensions, almost as if they were roles. However, diagnoses don’t behave like
independent roles. And unfortunately, there are often more than three diagnoses,
especially for hospitalized elderly patients who may present 20 simultaneous diag-
noses! Diagnoses don’t fit into well-defined roles other than potentially the primary
admitting and discharging diagnoses. Finally, a design with multiple diagnosis
346 Chapter 14
foreign keys would make for very inefficient BI applications because the query
doesn’t know which dimensional slot to constrain for a particular diagnosis.
The design shown in Figure 14-3 handles the open-ended nature of multiple diag-
noses. The diagnosis foreign key in the fact table is replaced with a diagnosis group
key. This diagnosis group key is connected by a many-to-many join to a diagnosis
group bridge table, which contains a separate row for each individual diagnosis in
a particular group.
Claim Billing Line Item Fact Diagnosis Group Bridge Diagnosis Dimension
Diagnosis Group Key (FK)
More FKs ... Diagnosis Key (FK) Diagnosis Key (PK)
Diagnosis Group Key (FK) Diagnosis Code (NK)
Master Bill ID (DD) Diagnosis Description
Facts ... Diagnosis Section Code
Diagnosis Section Description
Diagnosis Category Code
Diagnosis Category Description
Figure 14-3: Bridge table to handle multivalued diagnoses.
If a patient has three diagnoses, he is assigned a diagnosis group with three cor-
responding rows in the bridge table. In Chapter 10: Financial Services, we described
the use of a weighting factor on each bridge table row to allocate the fact table’s
metrics accordingly. However, in the case of multiple patient diagnoses, it’s virtu-
ally impossible to weight their impact on a patient’s treatment or bill, beyond the
potential determination of a primary diagnosis. Without a realistic way of assigning
weighting factors, the analysis of diagnosis codes must largely focus on impact ques-
tions like “What is the total billed amount for procedures involving the diagnosis of
congestive heart failure?” Most healthcare analysts understand impact analysis may
result in over counting as the same metrics are associated with multiple diagnoses.
NOTE Weighting factors in multivalued bridge tables provide an elegant way
to prorate numeric facts to produce correctly weighted reports. However, these
weighting factors are by no means required in a dimensional design. If there is no
agreement or enthusiasm within the business community for the weighting factors,
they should be left out. Also, in a schema with more than one multivalued dimen-
sion, it is not worth trying to decide how multiple weighting factors would interact.
If the many-to-many join in Figure 14-3 causes problems for a modeling tool that
insists on proper foreign-key-to-primary-key relationships, the equivalent design
Healthcare 347
of Figure 14-4 can be used. In this case an extra table whose primary key is a diag-
nosis group is inserted between the fact and bridge tables. There is likely no new
information in this extra table, unless there were labels for a cluster of diagnoses,
such as the Kimball Syndrome, but now both the fact table and bridge table have
conventional many-to-one joins in all directions.
Claim Billing Line Item Fact Diagnosis Group Dimension Diagnosis Group Bridge Diagnosis Dimension
Foreign Keys ... Diagnosis Group Key (PK) Diagnosis Group Key (FK) Diagnosis Key (PK)
Diagnosis Group Key (FK) Diagnosis Key (FK) Diagnosis Code (NK)
Master Bill ID (DD) Diagnosis Description
Facts ... Diagnosis Section Code
Diagnosis Section Description
Diagnosis Category Code
Diagnosis Category Description
Figure 14-4: Diagnosis group dimension to create a primary key relationship.
If a unique diagnosis group is created for every patient encounter, the number
of rows could become astronomical and many of the groups would be identical.
Probably a better approach is to have a portfolio of diagnosis groups that are repeat-
edly used. Each set of diagnoses would be looked up in the master diagnosis group
table during the ETL. If the existing group is found, it is used; if not found, a new
diagnosis group is created. Chapter 19: ETL Subsystems and Techniques provides
guidance for creating and administering bridge tables.
In an inpatient hospital stay scenario, the diagnosis group may be unique to each
patient if it evolves over time during the patient’s stay. In this case you would supple-
ment the bridge table with two date stamps to capture begin and end dates. Although
the twin date stamps complicate updates to the diagnosis group bridge table, they
are useful for change tracking, as described more fully in Chapter 7: Accounting.
Supertypes and Subtypes for Charges
We’ve described a design for billed healthcare treatments to cover both inpatient and
outpatient claims. In reality, healthcare charges resemble the supertype and subtype
pattern described in Chapter 10. Facility charges for inpatient hospital stays differ
from professional charges for outpatient treatments in clinics and doctor offices.
If you were focused exclusively on hospital stays, it would be reasonable to
tweak the Figure 14-2 dimensional structure to incorporate more hospital-specific
information. Figure 14-5 shows a revised set of dimensions specialized for hospital
stays, with the new dimensions bolded.
348 Chapter 14
Inpatient Hospital Claim Billing and Payment Workflow Fact
Treatment Date Key (FK)
Primary Insurance Billing Date Key (FK)
Secondary Insurance Billing Date Key (FK)
Responsible Party Billing Date Key (FK)
Last Primary Insurance Payment Date Key (FK)
Last Secondary Insurance Payment Date Key (FK)
Last Responsible Party Payment Date Key (FK)
Zero Balance Date Key (FK)
Patient Key (FK)
Admitting Physician Key (FK)
Admitting Physician Organization Key (FK)
Attending Physician Key (FK)
Attending Physician Organization Key (FK)
Procedure Key (FK)
Facility Key (FK)
Admitting Diagnosis Group Key (FK)
Discharge Diagnosis Group Key (FK)
Primary Insurance Organization Key (FK)
Secondary Insurance Organization Key (FK)
Responsible Party Key (FK)
Employer Key (FK)
Master Bill ID (DD)
Facts...
Figure 14-5: Accumulating snapshot for hospital stay charges.
Referring to Figure 14-5, you can see two roles for the physician: admitting physi-
cian and attending physician. The figure shows physician organizations for both roles
because physicians may represent different organizations in a hospital setting. With
more complex surgical events, such as a heart transplant operation, whole teams of
specialists and assistants are assembled. In this case, you could include a key in the
fact table for the primary responsible physician; the other physicians and medical
staff would be linked to the fact row via a group key to a multivalued bridge table.
You also have two multivalued diagnosis dimensions on each fact table row. The
admitting diagnosis group is determined at the beginning of the hospital stay and
should be the same for every treatment row that is part of the same hospital stay.
The discharge diagnosis group is not known until the patient is discharged.
Electronic Medical Records
Many healthcare organizations are moving from paper-based processes to elec-
tronic medical records. In the United States, federally mandated quality goals to
support improved population health management may be achievable only with
Healthcare 349
their adoption. Healthcare providers are aggressively implementing electronic
health record systems; the movement is significantly impacting healthcare DW
/BI initiatives.
Electronic medical records can present challenges for data warehouse environ-
ments because of their extreme variability and potentially extreme volumes. Patients’
medical record data comes in many different forms, ranging from numeric data to
freeform text comments entered by a healthcare professional to images and photo-
graphs. We’ll further discuss unstructured data in Chapter 21: Big Data Analytics;
electronic medical and/or health records may become a classic use case for big data.
One thing is certain. The amount and variability of electronic data in the healthcare
industry will continue to grow.
Measure Type Dimension for Sparse Facts
As designers, it is tempting to strive for a more standardized framework that could
be extended to handle data variability. For example, you could potentially handle the
variability of lab test results with a measurement type dimension describing what
the fact row means, or in other words, what the generic fact represents. The unit
of measure for a given numeric entry is found in the associated measurement type
dimension row, along with any additivity restrictions, as shown in Figure 14-6.
Lab Test Result Facts Lab Test Measurement Type Dimension
Order Date Key (FK) Lab Test Measurement Type Key (PK)
Test Date Key (FK) Lab Test Measurement Type Description
Patient Key (FK) Lab Test Measurement Type Unit of Measure
Physican Key (FK)
Lab Test Key (FK)
Lab Test Measurement Type Key (FK)
Observed Test Result Value
Figure 14-6: Lab test observations with measurement type dimension.
This approach is superbly flexible; you can add new measurement types simply by
adding new rows in the measurement type dimension, not by altering the structure
of the fact table. This approach also eliminates the nulls in the classic positional fact
table design because a row exists only if the measurement exists. However, there
are trade-offs. Using a measurement type dimension may generate lots of new fact
table rows because the grain is “one row per measurement per event” rather than the
more typical “one row per event.” If a lab test results in 10 numeric measurements,
there are now 10 rows in the fact table rather than a single row in the classic design.
For extremely sparse situations, such as clinical laboratory or manufacturing test
environments, this is a reasonable compromise. However, as the density of the facts
350 Chapter 14
grows, you end up spewing out too many fact rows. At this point you no longer have
sparse facts and should return to the classic fact table design with fixed columns.
Moreover, this measurement type approach may complicate BI data access appli-
cations. In the relational star schema, combining two numbers that were captured
as part of a single event is more difficult with this approach because now you must
fetch two rows from the fact table. SQL likes to perform arithmetic functions within
a row, not across rows. In addition, you must be careful not to mix incompatible
amounts in a calculation because all the numeric measures reside in a single amount
column. It’s worth noting that multidimensional OLAP cubes are more tolerant of
performing calculations across measurement types.
Freeform Text Comments
Freeform text comments, such as clinical notes, are sometimes associated with fact
table events. Although text comments are not very analytically potent unless they’re
parsed into well-behaved dimension attributes, business users are often unwilling
to part with them given the embedded nuggets of information.
Textual comments should not be stored in a fact table directly because they waste
space and rarely participate in queries. Some designers think it’s permissible to store
textual fields in the fact table, as long as they’re referred to as degenerate dimensions.
Degenerate dimensions are most typically used for operational transaction control
numbers and identifiers; it’s not an acceptable approach or pattern for contending
with bulky text fields. Storing freeform comments in the fact table adds clutter that
may negatively impact the performance of analysts’ more typical quantitative queries.
The unbounded text comments should either be stored in a separate comments
dimension or treated as attributes in a transaction event dimension. A key consider-
ation when evaluating these two approaches is the text field’s cardinality. If there’s
nearly a unique comment for every fact table event, storing the textual field in a trans-
action dimension makes the most sense. However, in many cases, No Comment is
associated with numerous fact rows. Because the number of unique text comments in
this scenario is much smaller than the number of unique transactions, it would make
more sense to store the textual data in a comments dimension with an associated
foreign key in the fact table. In either case, queries involving both the text comments
and fact metrics will perform relatively poorly given the need to resolve joins between
two voluminous tables. Often business users want to drill into text comments for
further investigation after highly selective fact table query filters have been applied.
Images
Sometimes the data captured in a patient’s electronic medical record is an image,
in addition to either quantitative numbers or qualitative notes. There are trade-offs
Healthcare 351
between capturing a JPEG filename in the fact table to refer to an associated image
versus embedding the image as a blob directly in the database. The advantage of
using a JPEG filename is that other image creation, viewing, and editing programs
can freely access the image. The disadvantage is that a separate database of graphic
files must be maintained in synchrony with the fact table.
Facility/Equipment Inventory Utilization
In addition to financial and clinical data, healthcare organizations are also keenly
interested in more operationally oriented metrics, such as utilization and availability
of their assets, whether referring to patient beds or surgical operating theatres. In
Chapter 4, we discussed product inventory data as transaction events as well as
periodic snapshots. Facility or equipment inventories in a healthcare organization
can be handled similarly.
For example, you can envision a bed utilization periodic snapshot with every bed’s
status at regularly recurring points in time, perhaps at midnight, the start of every
shift, or even more frequently throughout the day. In addition to a snapshot date and
potentially time-of-day, this factless fact table would include foreign keys to identify
the patient, attending physician, and perhaps an assigned nurse on duty.
Conversely, you can imagine treating the bed inventory data as a transaction
fact table with one row per movement into and out of a hospital bed. This may be a
simplistic transaction fact table with transaction date and time dimension foreign
keys, along with dimensions to describe the type of movement, such as filled or
vacated. In the case of operating room utilization and availability, you can envision
a lengthier list of statuses, such as pre-operation, post-operation, or downtime,
along with time durations.
If the inventory changes are not terribly volatile, such as the beds in a rehabilita-
tion or eldercare inpatient environment, you should consider a timespan fact table,
as discussed in Chapter 8, with row effective and expiration dates and times to
represent the various states of a bed over a period of time.
Dealing with Retroactive Changes
As DW/BI practitioners, we have well-developed techniques for accurately capturing
the historical flow of data from our enterprise’s source applications. Numeric mea-
surements go into fact tables, which are surrounded with contemporary descriptions
of what you know is true at the time of the measurements, packaged as dimension
tables. The descriptions of patient, physician, facility, and payer evolve as slowly
changing dimensions whenever these entities change their descriptions.
352 Chapter 14
However, in the healthcare industry, especially with legacy operational systems,
you often need to contend with late arriving data that should have been loaded into
the data warehouse weeks or months ago. For example, you might receive data
regarding patient procedures that occurred several weeks ago, or updates to patient
profiles that were back-dated as effective several months ago. The more delayed the
incoming records are, the more challenging the DW/BI system’s ETL processing
becomes. We’ll discuss these late arriving fact and dimension scenarios in Chapter
19. Unfortunately, these patterns are common in healthcare DW/BI environments;
in fact, they may be the dominant modes of processing rather than specialized
techniques for outlier cases. Eventually, more effective source data capture systems
should reduce the frequency of these late arriving data anomalies.
Summary
Healthcare provides a wealth of dimensional design examples. In this chapter, the
enterprise data warehouse bus matrix illustrated the critical linkages between a
healthcare organization’s administrative and clinical data. We used an accumulating
snapshot grain fact table with role-playing date dimensions for the healthcare claim
billing and payment pipeline. We also saw role playing used for the physician and
payer dimensions in other fact tables of this chapter.
Healthcare schemas are littered with multivalued dimensions, especially the
diagnosis dimension. Complex surgical events might also use multivalued bridge
tables to represent the teams of involved physicians and other staff members. The
bridge tables used with healthcare data seldom contain weighting factors, as dis-
cussed in earlier chapters, because it is extremely difficult to establish weighting
business rules, beyond the designation of a “primary” relationship.
We discussed medical records and test results, suggesting a measurement type
dimension to organize sparse, heterogeneous measurements into a single, uniform
framework. We also discussed the handling of text comments and linked images.
Transaction and periodic snapshot fact tables were used to represent facility or
equipment inventory utilization and availability. In closing, we touched upon ret-
roactive fact and dimension changes that are often all too common with healthcare
performance data.
15 Electronic
Commerce
Aweb-intensive business’s clickstream data records the gestures of every web
visitor. In its most elemental form, the clickstream is every page event recorded
by each of the company’s web servers. The clickstream contains a number of new
dimensions, such as page, session, and referrer, which are not found in other data
sources. The clickstream is a torrent of data; it can be difficult and exasperating for
DW/BI professionals. Does it connect to the rest of the DW/BI system? Can its dimen-
sions and facts be conformed in the enterprise data warehouse bus architecture?
We start this chapter by describing the raw clickstream data source and designing
its relevant dimensional models. We discuss the impact of Google Analytics, which
can be thought of as an external data warehouse delivering information about your
website. We then integrate clickstream data into a larger matrix of more conven-
tional processes for a web retailer, and argue that the profitability of the web sales
channel can be measured if you allocate the right costs back to the individual sales.
Chapter 15 discusses the following concepts:
■ Clickstream data and its unique dimensionality
■ Role of external services such as Google Analytics
■ Integrating clickstream data with the other business processes on the bus
matrix
■ Assembling a complete view of profitability for a web enterprise
Clickstream Source Data
The clickstream is not just another data source that is extracted, cleaned, and
dumped into the DW/BI environment. The clickstream is an evolving collection of
data sources. There are a number of server log file formats for capturing clickstream
data. These log file formats have optional data components that, if used, can be very
helpful in identifying visitors, sessions, and the true meaning of behavior.
354 Chapter 15
Because of the distributed nature of the web, clickstream data often is collected
simultaneously by different physical servers, even when the visitor thinks they are
interacting with a single website. Even if the log files collected by these separate
servers are compatible, a very interesting problem arises in synchronizing the log
files after the fact. Remember that a busy web server may be processing hundreds
of page events per second. It is unlikely the clocks on separate servers will be in
synchrony to one-hundredth of a second.
You also obtain clickstream data from different parties. Besides your own log
files, you may get clickstream data from referring partners or from internet service
providers (ISPs). Another important form of clickstream data is the search specifica-
tion given to a search engine that then directs the visitor to the website.
Finally, if you are an ISP providing web access to directly connected customers,
you have a unique perspective because you see every click of your captive custom-
ers that may allow more powerful and invasive analyses of the customer’s sessions.
The most basic form of clickstream data from a normal website is stateless. That
is, the log shows an isolated page retrieval event but does not provide a clear tie to
other page events elsewhere in the log. Without some kind of contextual help, it is
difficult or impossible to reliably identify a complete visitor session.
The other big frustration with basic clickstream data is the anonymity of the
session. Unless visitors agree to reveal their identity in some way, you often cannot
be sure who they are, or if you have ever seen them before. In certain situations,
you may not distinguish the clicks of two visitors who are simultaneously brows-
ing the website.
Clickstream Data Challenges
Clickstream data contains many ambiguities. Identifying visitor origins, visitor
sessions, and visitor identities is something of an interpretive art. Browser caches
and proxy servers make these identifications more challenging.
Identifying the Visitor Origin
If you are very lucky, your site is the default home page for the visitor’s browser.
Every time he opens his browser, your home page is the first thing he sees. This is
pretty unlikely unless you are the webmaster for a portal site or an intranet home
page, but many sites have buttons which, when clicked, prompt visitors to set their
URL as the browser’s home page. Unfortunately there is no easy way to determine
from a log whether your site is set as a browser’s home page.
A visitor may be directed to your site from a search at a portal such as Yahoo! or
Google. Such referrals can come either from the portal’s index, for which you may
have paid a placement fee, or from a word or content search.
Electronic Commerce 355
For some websites, the most common source of visitors is from a browser book-
mark. For this to happen, the visitor must have previously bookmarked your site,
and this can occur only after the site’s interest and trust levels cross the visitor’s
bookmark threshold.
Finally, your site may be reached as a result of a clickthrough—a deliberate click
on a text or graphical link from another site. This may be a paid-for referral via a
banner ad, or a free referral from an individual or cooperating site. In the case of
clickthroughs, the referring site will almost always be identifiable as a field in the
web event record. Capturing this crucial clickstream data is important to verify the
efficacy of marketing programs. It also provides crucial data for auditing invoices
you may receive from clickthrough advertising charges.
Identifying the Session
Most web-centric analyses require every visitor session (visit) to have its own unique
identity tag, similar to a supermarket receipt number. This is the session ID. Records
for every individual visitor action in a session, whether they are derived from the
clickstream or an application interaction, must contain this tag. But keep in mind
the operational application, such as an order entry system generates this session
ID, not the web server.
The basic protocol for the web, Hyper Text Transfer Protocol (HTTP) is stateless;
that is, it lacks the concept of a session. There are no intrinsic login or logout actions
built into the HTTP protocol, so session identity must be established in some other
way. There are several ways to do this:
1. In many cases, the individual hits comprising a session can be consolidated by
collating time-contiguous log entries from the same host (IP address). If the
log contains a number of entries with the same host ID in a short period of
time (for example, one hour), you can reasonably assume the entries are for
the same session. This method breaks down for websites with large numbers
of visitors because dynamically assigned IP addresses may be reused immedi-
ately by different visitors over a brief time period. Also, different IP addresses
may be used within the same session for the same visitor. This approach also
presents problems when dealing with browsers that are behind some firewalls.
Notwithstanding these problems, many commercial log analysis products use
this method of session tracking, and it requires no cookies or special web
server features.
2. Another much more satisfactory method is to let the web browser place a
session-level cookie into the visitor’s web browser. This cookie will last as
long as the browser is open and in general won’t be available in subsequent
356 Chapter 15
browser sessions. The cookie value can serve as a temporary session ID not
only to the browser, but also to any application that requests the session
cookie from the browser. But using a transient cookie has the disadvantage
that you can’t tell when the visitor returns to the site at a later time in a new
session.
3. HTTP’s secure sockets layer (SSL) offers an opportunity to track a visitor
session because it may include a login action by the visitor and the exchange
of encryption keys. The downside to using this method is that to track the
session, the entire information exchange needs to be in high-overhead SSL,
and the visitor may be put off by security advisories that can pop up using
certain browsers. Also, each host must have its own unique security certificate.
4. If page generation is dynamic, you can try to maintain visitor state by plac-
ing a session ID in a hidden field of each page returned to the visitor. This
session ID can be returned to the web server as a query string appended to
a subsequent URL. This method of session tracking requires a great deal of
control over the website’s page generation methods to ensure the thread of
a session ID is not broken. If the visitor clicks links that don’t support this
session ID ping-pong, a single session may appear to be multiple sessions.
This approach also breaks down if multiple vendors supply content in a single
session unless those vendors are closely collaborating.
5. Finally, the website may establish a persistent cookie in the visitor’s machine
that is not deleted by the browser when the session ends. Of course, it’s pos-
sible the visitor will have his browser set to refuse cookies, or may manually
clean out his cookie file, so there is no absolute guarantee that even a per-
sistent cookie will survive. Although any given cookie can be read only by
the website that caused it to be created, certain groups of websites can agree
to store a common ID tag that would let these sites combine their separate
notions of a visitor session into a “super session.”
In summary, the most reliable method of session tracking from web server log
records is obtained by setting a persistent cookie in the visitor’s browser. Less reli-
able, but good results can be obtained by setting a session level and a nonpersistent
cookie and by associating time-contiguous log entries from the same host. The latter
method requires a robust algorithm in the log postprocessor to ensure satisfactory
results and to decide when not to take the results seriously.
Identifying the Visitor
Identifying a specific visitor who logs into your site presents some of the most
challenging problems facing a site designer, webmaster, or manager of the web
analytics group.
Electronic Commerce 357
■ Web visitors want to be anonymous. They may have no reason to trust you,
the internet, or their computer with personal identification or credit card
information.
■ If you request visitors’ identity, they may not provide accurate information.
■ You can’t be sure which family member is visiting your site. If you obtain
an identity by association, for instance from a persistent cookie left during a
previous visit, the identification is only for the computer, not for the specific
visitor. Any family member or company employee may have been using that
particular computer at that moment in time.
■ You can’t assume an individual is always at the same computer. Server-
provided cookies identify a computer, not an individual. If someone accesses
the same website from an office computer, home computer, and mobile device,
a different website cookie is probably put into each machine.
Clickstream Dimensional Models
Before designing clickstream dimensional models, let’s consider all the dimensions
that may have relevance in a clickstream environment. Any single dimensional
model will not use all the dimensions at once, but it is nice to have a portfolio
of dimensions waiting to be used. The list of dimensions for a web retailer could
include:
■ Date
■ Time of day
■ Part
■ Vendor
■ Status
■ Carrier
■ Facilities location
■ Product
■ Customer
■ Media
■ Promotion
■ Internal organization
■ Employee
■ Page
■ Event
■ Session
■ Referral
358 Chapter 15
All the dimensions in the list, except for the last four shown in bold, are familiar
dimensions, most of which we have already used in earlier chapters of this book.
But the last four are the unique dimensions of the clickstream and warrant some
careful attention.
Page Dimension
The page dimension describes the page context for a web page event, as illustrated
in Figure 15-1. The grain of this dimension is the individual page. The definition
of page must be flexible enough to handle the evolution of web pages from static
page delivery to highly dynamic page delivery in which the exact page the customer
sees is unique at that instant in time. We assume even in the case of the dynamic
page that there is a well-defined function that characterizes the page, and we will
use that to describe the page. We will not create a page row for every instance of a
dynamic page because that would yield a dimension with an astronomical number
of rows. These rows also would not differ in interesting ways. You want a row in this
dimension for each interesting distinguishable type of page. Static pages probably get
their own row, but dynamic pages would be grouped by similar function and type.
Page Dimension Attribute Sample Data Values/Definitions
Page Key Surrogate values (1..N)
Page Source Static, Dynamic, Unknown, Corrupted, Inapplicable, ...
Page Function Portal, Search, Product description, Corporate information, ...
Page Template Sparse, Dense, ...
Item Type Product SKU, Book ISBN number, Telco rate type, ...
Graphics Type GIF, JPG, Progressive disclosure, Size pre-declared, ...
Animation Type Similar to graphics type
Sound Type Similar to graphics type
Page File Name Optional application dependent name
Figure 15-1: Page dimension attributes and sample data values.
When the definition of a static page changes because it is altered by the web-
master, the page dimension row can either be type 1 overwritten or treated with
an alternative slowly changing technique. This decision is a matter of policy for
the data warehouse and depends on whether the old and new descriptions of the
page differ materially, and whether the old definition should be kept for historical
analysis purposes.
Website designers, data governance representatives from the business, and the
DW/BI architects need to collaborate to assign descriptive codes and attributes to
each page served by the web server, whether the page is dynamic or static. Ideally,
the web page developers supply descriptive codes and attributes with each page
Electronic Commerce 359
they create and embed these codes and attributes into the optional fields of the
web log files. This crucial step is at the foundation of the implementation of this
page dimension.
Before leaving the page dimension, we want to point out that some internet com-
panies track the more granular individual elements on each page of their web sites,
including graphical elements and links. Each element generates its own row for each
visitor for each page request. A single complex web page can generate hundreds of
rows each time the page is served to a visitor. Obviously, this extreme granularity
generates astronomical amounts of data, often exceeding 10 terabytes per day!
Similarly, gaming companies may generate a row for every gesture made by every
online game player, which again can result in hundreds of millions of rows per day.
In both cases, the most atomic fact table will have extra dimensions describing the
graphical element, link, or game situation.
Event Dimension
The event dimension describes what happened on a particular page at a particular
point in time. The main interesting events are Open Page, Refresh Page, Click Link,
and Enter Data. You want to capture that information in this small event dimension,
as illustrated in Figure 15-2.
Event Dimension Attribute Sample Data Values/Definitions
Event Key Surrogate values (1..N)
Event Type Open page, Refresh page, Click link, Unknown, Inapplicable
Event Content Application-dependent fields eventually driven by XML tags
Figure 15-2: Event dimension attributes and sample data values.
Session Dimension
The session dimension provides one or more levels of diagnosis for the visitor’s
session as a whole, as shown in Figure 15-3. For example, the local context of the
session might be Requesting Product Information, but the overall session context
might be Ordering a Product. The success status would diagnose whether the mis-
sion was completed. The local context may be decidable from just the identity of
the current page, but the overall session context probably can be judged only by
processing the visitor’s complete session at data extract time. The customer status
attribute is a convenient place to label the customer for periods of time, with labels
that are not clear either from the page or immediate session. These statuses may be
derived from auxiliary business processes in the DW/BI system, but by placing these
labels deep within the clickstream, you can directly study the behavior of certain
types of customers. Do not put these labels in the customer dimension because they
360 Chapter 15
may change over very short periods of time. If there are a large number of these
statuses, consider creating a separate customer status mini-dimension rather than
embedding this information in the session dimension.
Session Dimension Attribute Sample Data Values/Definitions
Session Key Surrogate values (1..N)
Session Type Classified, Unclassified, Corrupted, Inapplicable
Local Context Page-derived context like Requesting Product Information
Session Context Trajectory-derived context like Ordering a Product
Action Sequence Summary label for overall sequence of actions during session
Success Status Identifies whether overall session mission was accomplished
Customer Status New customer, High value customer, About to cancel, In default
Figure 15-3: Session dimension attributes and sample data values.
This dimension groups sessions for analysis, such as:
■ How many customers consulted your product information before ordering?
■ How many customers looked at your product information and never ordered?
■ How many customers did not finish ordering? Where did they stop?
Referral Dimension
The referral dimension, illustrated in Figure 15-4, describes how the customer
arrived at the current page. The web server logs usually provide this information.
The URL of the previous page is identified, and in some cases additional information
is present. If the referrer was a search engine, usually the search string is specified.
It may not be worthwhile to put the raw search specification into your database
because the search specifications are so complicated and idiosyncratic that an ana-
lyst may not be able to query them usefully. You can assume some kind of simplified
and cleaned specification is placed in the specification attribute.
Referral Dimension Attribute Sample Data Values/Definitions
Referral Key Surrogate values (1..N)
Referral Type Intra site, Remote site, Search engine, Corrupted, Inapplicable
Referring URL www.organization-site.com/linkspage
Referring Site www.organization-site.com
Referring Domain www.organization-site.com
Search Type Simple text match, Complex logical match
Specification Actual spec used (useful if simple text, otherwise questionable)
Target Meta tags, Body text, Title (where search found its match)
Figure 15-4: Referral dimension attributes and sample data values.
Electronic Commerce 361
Clickstream Session Fact Table
Now that you have a portfolio of useful clickstream dimensions, you can design
the primary clickstream dimensional models based on the web server log data.
This business process can then be integrated into the family of other web retailing
subject areas.
With an eye toward keeping the first fact table from growing astronomically,
you should choose the grain to be one row for each completed customer session.
This grain is significantly higher than the underlying web server logs which record
each individual page event, including individual pages as well as each graphical
element on each page. While we typically encourage designers to start with the
most granular data available in the source system, this is a purposeful deviation
from our standard practices. Perhaps you have a big site recording more than 100
million page fetches per day, and 1 billion micro page events (graphical elements),
but you want to start with a more manageable number of rows to be loaded each
day. We assume for the sake of argument that the 100 million page fetches boil
down to 20 million complete visitor sessions. This could arise if an average visitor
session touched 5 pages.
The dimensions that are appropriate for this first fact table are calendar date, time
of day, customer, page, session, and referrer. Finally, you can add a set of measured
facts for this session including session seconds, pages visited, orders placed, units
ordered, and order dollars. The completed design is shown in Figure 15-5.
Date Dimension (2 views for roles) Clickstream Session Fact Customer Dimension
Session Dimension
Entry Page Dimension Universal Date Key (FK)
Referrer Dimension Universal Date/Time
Local Date Key (FK)
Local Date/Time
Customer Key (FK)
Entry Page Key (FK)
Session Key (FK)
Referrer Key (FK)
Session ID (DD)
Session Seconds
Pages Visited
Orders Placed
Order Quantity
Order Dollar Amount
Figure 15-5: Clickstream fact table design for complete sessions.
362 Chapter 15
There are a number of interesting aspects to this design. You may wonder why
there are two connections from the calendar date dimension to the fact table and
two date/time stamps. This is a case in which both the calendar date and the time
of day must play two different roles. Because you are interested in measuring the
precise times of sessions, you must meet two conflicting requirements. First, you
want to make sure you can synchronize all session dates and times internationally
across multiple time zones. Perhaps you have other date and time stamps from
other web servers or nonweb systems elsewhere in the DW/BI environment. To
achieve true synchronization of events across multiple servers and processes, you
must record all session dates and times, uniformly, in a single time zone such as
Greenwich Mean Time (GMT) or Coordinated Universal Time (UTC). You should
interpret the session date and time combinations as the beginning of the session.
Because you have the dwell time of the session as a numeric fact, you can tell when
the session ended, if that is of interest.
The other requirement you meet with this design is to record the date and time of
the session relative to the visitor’s wall clock. The best way to represent this informa-
tion is with a second calendar date foreign key and date/time stamp. Theoretically,
you could represent the time zone of the customer in the customer dimension table,
but constraints to determine the correct wall clock time would be horrendously
complicated. The time difference between two cities (such as London and Sydney)
can change by as much as two hours at different times of the year depending on
when these cities go on and off daylight savings time. This is not the business of
the BI reporting application to work out. It is the business of the database to store
this information, so it can be constrained in a simple and direct way.
The two role-playing calendar date dimension tables are views on a single under-
lying table. The column names are massaged in the view definition, so they are
slightly different when they show up in the user interface pick lists of BI tools.
Note that the use of views makes the two instances of each table semantically
independent.
We modeled the exact instant in time with a full date/time stamp rather than a
time-of-day dimension. Unlike the calendar date dimension, a time-of-day dimen-
sion would contain few if any meaningful attributes. You don’t have labels for each
hour, minute, or second. Such a time-of-day dimension could be ridiculously large
if its grain were the individual second or millisecond. Also, the use of an explicit
date/time stamp allows direct arithmetic between different date/time stamps to
calculate precise time gaps between sessions, even those crossing days. Calculating
time gaps using a time-of-day dimension would be awkward.
The inclusion of the page dimension in Figure 15-5 may seem surprising given
the grain of the design is the customer session. However, in a given session, a very
Electronic Commerce 363
interesting page is the entry page. The page dimension in this design is the page the
session started with. In other words, how did the customer hop onto your bus just
now? Coupled with the referrer dimension, you now have an interesting ability to
analyze how and why the customer accessed your website. A more elaborate design
would also add an exit page dimension.
You may be tempted to add the causal dimension to this design, but if the causal
dimension focuses on individual products, it would be inappropriate to add it to
this design. The symptom that the causal dimension does not mesh with this design
is the multivalued nature of the causal factors for a given complete session. If you
run ad campaigns or special deals for several products, how do you represent this
multivalued situation if the customer’s session involves several products? The right
place for a product-oriented causal dimension will be in the more fine-grained table
described in the next fact table example. Conversely, a more broadly focused mar-
ket conditions dimension that describes conditions affecting all products would be
appropriate for a session-grained fact table.
The session seconds fact is the total number of seconds the customer spent on the
site during this session. There will be many cases in which you can’t tell when the
customer left. Perhaps the customer typed in a new URL. This won’t be detected by
conventional web server logs. (If the data is collected by an ISP who can see every
click across sessions, this particular issue goes away.) Or perhaps the customer
got up out of the chair and didn’t return for 1 hour. Or perhaps the customer just
closed the browser without making any more clicks. In all these cases, your extract
software needs to assign a small and nominal number of seconds to this last session
step, so the analysis is not unrealistically distorted.
We purposely designed this first clickstream fact table to focus on complete visitor
sessions while keeping the size under control. The next schema drops down to the
lowest practical granularity you can support in the data warehouse: the individual
page event.
Clickstream Page Event Fact Table
The granularity of the second clickstream fact table is the individual page event in
each customer session; the underlying micro events recording graphical elements
such as JPGs and GIFs are discarded (unless you are Yahoo! or eBay as described
previously). With simple static HTML pages, you can record only one interesting
event per page view, namely the page view. As websites employ dynamically created
XML-based pages, with the ability to establish an on-going dialogue through the
page, the number and type of events will grow.
This fact table could become astronomical in size. You should resist the urge
to aggregate the table up to a coarser granularity because that inevitably involves
364 Chapter 15
dropping dimensions. Actually, the first clickstream fact table represents just such
an aggregation; although it is a worthwhile fact table, analysts cannot ask questions
about visitor behavior or individual pages.
Having chosen the grain, you can choose the appropriate dimensions. The list of
dimensions includes calendar date, time of day, customer, page, event, session, ses-
sion ID, step (three roles), product, referrer, and promotion. The completed design
is shown in Figure 15-6.
Date Dimension (2 views for roles) Clickstream Page Event Fact Customer Dimension
Event Dimension
Page Dimension Universal Date Key (FK)
Universal Date/Time Step Dimension (3 views for roles)
Session Dimension Local Date Key (FK) Step Key (PK)
Local Date/Time Step Number
Product Dimension Customer Key (FK) Steps Until End
Product Key (PK) Page Key (FK)
Product Attributes ... Event Key (FK) Referrer Dimension
Session Key (FK)
Promotion Dimension Session ID (DD)
Session Step Key (FK)
Purchase Step Key (FK)
Abandonment Step Key (FK)
Product Key (FK)
Referrer Key (FK)
Promotion Key (FK)
Page Seconds
Order Quantity
Order Dollar Amount
Figure 15-6: Clickstream fact table design for individual page use.
Figure 15-6 looks similar to the first design, except for the addition of the page,
event, promotion, and step dimensions. This similarity between fact tables is typical
of dimensional models. One of the charms of dimensional modeling is the “boring”
similarity of the designs. But that is where they get their power. When the designs
have a predictable structure, all the software up and down the DW/BI chain, from
extraction, to database querying, to the BI tools, can exploit this similarity to great
advantage.
The two roles played by the calendar date and date/time stamps have the same
interpretation as in the first design. One role is the universal synchronized time,
and the other role is the local wall clock time as measured by the customer. In this
fact table, these dates and times refer to the individual page event.