The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.
Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by soedito, 2017-10-02 20:13:34

C7_ETL_71

C7_ETL_71

Data Warehousing
& Data Mining

Wolf-Tilo Balke
Silviu Homoceanu
Institut für Informationssysteme
Technische Universität Braunschweig
http://www.ifis.cs.tu-bs.de

7. Building the DW

7. Building the DW

7.1 The DW Project
7.2 Data Extract/Transform/Load (ETL)
7.3 Metadata

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2

7.1 The DW Project

• Building a DW, is a complex IT project

– A middle size DW-project contains 500-1000 activities

• DW-Project organization

– Project roles and corresponding tasks, e.g.:

Roles Tasks
DW-PM Project Management
DW-Architect Methods, Concepts, Modeling
DW-Miner Concepts, Analyze(non-standard)
Domain Expert Domain Knowledge
DW-System Developer System- and metadata-management,

DW User ETL
Analyze (standard)

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 3

7.1 The DW Project

• Usual tasks in a DW-Project

– Communication, as process of information exchange
between team members

– Conflict management

• The magical triangle, compromise between time, costs
and quality

– Quality assurance

• Performance, reliability, scalability, robustness, etc.

– Documentation

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4

7.1 The DW Project

• Software choice

– Database system for the DW

• Usually the choice is to use the same technology provider as for
the operational data

• MDB vs. RDB

– ETL tools

• Differentiate by the data cleansing needs

– Analysis tools

• Varying from data mining to OLAP products, with a focus on
reporting functionality

– Repository

• Not very oft used
• Helpful for metadata management

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 5

7.1 The DW Project

• Hardware choice 6

– Data storage

• RAID systems, SAN’s, NAS’s

– Processing

• Multi-CPU systems, SMP, Clusters

– Failure tolerance

• Data replication, mirroring RAID, backup strategies

– Other factors

• Data access times, transfer rates, memory bandwidth,
network throughput and latency

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

7.1 The DW Project

• Project Timeline, depends on the development
methodology, but usually

– Phase I – Proof of Concept

• Establish Technical Infrastructure
• Prototype Data Extraction/Transformation/Load
• Prototype Analysis & Reporting

– Phase II – Controlled Release

• Iterative Process of Building Subject Oriented Data Marts

– Phase III – General Availability

• On going operations, support and training, maintenance and
growth

• The most important part of the DW building project
is defining the ETL process

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7

7.2 ETL

• What is ETL?

– Short for extract, transform and load
– Three database functions that are combined into one

tool to pull data out of productive databases and
place it into the DW

• Migrate data from one database to another, to form data
marts and data warehouses

• Convert databases from one format or type to another

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8

7.2 ETL

• When should we ETL?

– Periodically (e.g., every night, every week) or after
significant events

– Refresh policy set by administrator based on user
needs and traffic

– Possibly different policies for different sources
– Rarely, on every update (real-time DW)

• Not warranted unless warehouse data require current data
(up to the minute stock quotes)

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 9

7.2 ETL

• ETL is used to integrate heterogeneous systems

– With different DBMS, operating system, hardware,
communication protocols

• ETL challenges

– Getting the data from the source to target as fast as
possible

– Allow recovery from failure without restarting the
whole process

• This leads to balance between writing data to
staging tables or keeping it in memory

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 10

7.2 ETL

• Staging area, basic rules

– Data in the staging area is owned by
the ETL team

• Users are not allowed in the staging
area at any time

– Reports cannot access data from the staging area
– Only ETL processes can write to and read from the

staging area

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 11

7.2 Staging Area

• Staging area structures for holding data

– Flat files
– XML data sets
– Relational tables

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12

7.2 Data Structures

• Flat files

– ETL tools based on scripts, such as
Perl,VBScript or JavaScript

– Advantages

• No overhead of maintaining metadata as DBMS does
• Sorting, merging, deleting, replacing and other data-migration

functions are much faster outside the DBMS

– Disadvantages

• No concept of updating
• Queries and random access lookups are not well supported

by the operating system
• Flat files can not be indexed for fast lookups

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 13

7.2 Data Structures

• When should flat files be used?

– Staging source data for safekeeping and recovery

• Best approach to restart a failed process is by having data
dumped in a flat file

– Sorting data

• Sorting data in a file system may be more efficient as
performing it in a DBMS with Order By clause

• Sorting is important: a huge portion of the ETL
processing cycles goes to sorting

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14

7.2 Data Structures

• When should flat files be used?

– Filtering

• Using grep-like functionality

– Replacing text strings

• Sequential file processing is much
faster at the system-level than it
is with a database

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 15

7.2 Data Structures 16

• XML Data sets

– Used as common format for both
input and output from the ETL
system

– Generally, not used for persistent
staging

– Useful mechanisms

• XML schema (successor of DTD)
• XQuery, XPath
• XSLT

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

7.2 Data Structures

• Relational tables

– Using tables is most appropriate especially when there
are no dedicated ETL tools

– Advantages

• Apparent metadata: column names data types and lengths,
cardinality, etc.

• Relational abilities: data integrity as well as normalized
staging

• Open repository/SQL interface: easy to access by any SQL
compliant tool

– Disadvantages

• Sometimes slower than the operating file system

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 17

7.2 Staging Area - Storage

• How is the staging area designed?

– Staging database, file system, and directory structures
are set up by the DB and OS administrators based on
ETL architect estimations e.g., tables volumetric
worksheet

Table Update Load ETL Job Initial Avg Grows Expected Expected Initial table Table
Name strategy frequency row row with rows/mo bytes/mo size Size 6
SAcc count length mo. (MB)
S_ACC Truncate/ Daily SAssets New 9,983 269,548 1,078,191
Reload SDefect 39,933 27 account 192,875 2.57
S_ASSETS Daily 21 15,044,250
Insert/ 771,500 75 New 60,177,000 143.47
S_DEFECT Delete On assets 567
demand 84 27 2,268 0.01
Truncate/ New
Reload defect

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 18

7.2 ETL

• ETL

– Data extraction
– Data transformation
– Data loading

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 19

7.2 Data Extraction

• Data Extraction

– Data needs to be taken from a data
source so that it can be put into the
DW

• Internal scripts/tools at the data source,
which export the data to be used

• External programs, which extract the data from the source

– If the data is exported, it is typically exported into a text
file that can then be brought into an intermediary
database

– If the data is extracted from the source, it is typically
transferred directly into an intermediary database

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20

7.2 Data Extraction 21

• Steps in data extraction

– Initial extraction

• Preparing the logical map
• First time data extraction

– Ongoing extraction

• Just new data
• Changed data
• Or even deleted data

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

7.2 Data Extraction

• Logical map connects the original source data
to the final data

• Building the logical map: first identify the data
sources

– Data discovery phase

• Collecting and documenting source systems: databases,
tables, relations, cardinality, keys, data types, etc.

– Anomaly detection phase

• NULL values can destroy any ETL process, e.g., if a foreign
key is NULL, joining tables on a NULL column results in
data loss, because in RDB NULL ≠ NULL

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22

7.2 Data Extraction

• Example of solving NULL anomalies

– If NULL on foreign key then use outer joins

– If NULL on other columns then create a business
rule to replace NULLs while loading data in DW

• Ongoing Extraction

– Data needs to be maintained in the DW also after
the initial load

• Extraction is performed on a regular basis
• Only changes are extracted after the first time

• How do we detect changes?

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 23

7.2 Ongoing Extraction

• Detecting changes (new/changed data)

– Using audit columns
– Database log scraping or sniffing
– Process of elimination

• Using audit columns

– Store the date and time a record has been added or
modified at

– Detect changes based on date stamps higher than
the last extraction date

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 24

7.2 Detecting Changes

• Log scraping

– Takes a snapshot of the database redo log at a
certain time (e.g., midnight) and finds the transactions
affecting the tables ETL is interested in

– It can be problematic when the redo log
gets full and is emptied by the DBA

• Log sniffing

– Pooling the redo log at small
time granularity, capturing the
transactions on the fly

– The better choice: suitable also for real-time ETL

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 25

7.2 Detecting Changes

• Process of elimination

– Preserves exactly one copy of each previous
extraction

– During next run, it compares the entire source tables
against the extraction copy

– Only differences are sent to the DW

– Advantages

• Because the process makes row by row comparisons, it is
impossible to miss data

• It can also detect deleted rows

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 26

7.2 Data Transformation

• Data transformation

– Uses rules, lookup tables, or combinations with
other data, to convert source data to the desired
state

• 2 major steps

– Data Cleaning

• May involve manual work
• Assisted by artificial intelligence

algorithms and pattern recognition

– Data Integration

• May also involve manual work

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 27

7.2 Data Cleaning

• Extracted data can be dirty. How does clean data
look like?

• Data Quality characteristics:

– Correct: values and descriptions in data represent
their associated objects truthfully

• E.g., if the city in which store A is located is Braunschweig,
then the address should not report Paris

– Unambiguous: the values and descriptions in data
can be taken to have only one meaning

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28

7.2 Data Cleaning

– Consistent: values and descriptions in data use one
constant notational convention

• E.g., Braunschweig can be expressed as BS or Brunswick, by
our employees in USA. Consistency means using just BS in
all our data

– Complete

• Individual values and descriptors in data have a value (not
null)

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 29

7.2 Data Cleaning

• The data cleaning engine produces 3 main
deliverables:

– Data-profiling results:

• Meta-data repository describing schema definitions, business
objects, domains, data sources, table definitions, data rules,
value rules, etc.

• Represents a quantitative assessment of original data
sources

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 30

7.2 Cleaning Deliverables

– Error event table

• Structured as a dimensional star schema
• Each data quality error identified by the cleaning subsystem

is inserted as a
row in the error
event fact table

Data Quality Screen: 31
- Status report on data quality
- Gateway which lets only clean data
go through

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

7.2 Cleaning Deliverables

– Audit dimension

• Describes the data-quality context of a fact table record
being loaded into the DW

• Attached to each fact record
• Aggregates the information from the error event table on a

per record basis

Audit key (PK)

Completeness category (text)

Completeness score (integer)

Number screens failed

Max severity score

Extract timestamp

Clean timestamp



DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32

7.2 Data Cleaning

• Overall Process Flow

• Is data really dirty? Data
Transformation
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
Deliverables

33

7.2 Data Quality

• Sometimes data is
just garbage

– We shouldn’t load
garbage in the DW

• Cleaning data manually
takes just…forever!!!

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 34

7.2 Data Quality

• Use tools to clean data semi-automatic

– Commercial software

• SAP Business Objects
• IBM InfoSphere Data Stage
• Oracle Data Quality and Oracle Data Profiling

– Open source tools

• E.g., Eobjects DataCleaner,Talend Open Profiler

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 35

7.2 Data Quality

• Data cleaning process: use of thesaurus, regular
expressions, geographical information, etc.

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 36

7.2 Data Quality

• Regular expressions for date/time data

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 37

7.2 Cleaning Engine

• Core of the data cleaning engine: anomaly
detection phase

– Data anomaly is a piece of
data which doesn’t fit into the
domain of the rest of the data
it is stored with

• E.g., Male named Mary???

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38

7.2 Anomaly Detection

• Anomaly detection

– Count the rows in a table while grouping on the
column in question e.g.,

• SELECT city, count(*) FROM order_detail GROUP BY city

City Count(*)
Bremen 2
Berlin 3

WOB 4,500
BS 12,000
HAN 46,000


DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 39

7.2 Anomaly Detection

• What if our table has 100 million rows with
250,000 distinct values?

– Use data sampling e.g.,

• Divide the whole data into 1000 pieces, and choose 1
record from each

• Add a random number column to the data, sort it an take
the first 1000 records

• Etc.

– Common mistake is to select a range of dates

• Most anomalies happen temporarily

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 40

7.2 Data Quality

• Data profiling

– Pay closer look at strange values (outlier detection)
– Observe data distribution patterns

• E.g. Gaussian distribution

SELECT AVERAGE(sales_value) – 3 * STDDEV(sales_value),
AVERAGE(sales_value) + 3 * STDDEV(sales_value)
INTO Min_resonable, Max_resonable FROM …

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 41

7.2 Data Quality 42

• Data distribution

– Flat distribution

• Identifier distributions (keys)

– Zipfian distribution

• Some values appear more often
than others

– In sales, more cheap goods are sold
than expensive ones

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

7.2 Data Quality

• Data distribution

– Pareto, Poisson, S distribution

• Distribution discovery

– Statistical software: SPSS, StatSoft, R, etc.

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 43

7.2 Data Integration

• Data integration

– Several conceptual schemas need to be combined into
a unified global schema

– All differences in perspective and
terminology have to be resolved

– All redundancy has to be schema 2
removed
schema 1

schema 3 schema 4

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44

7.2 Schema Integration 45

• There are four basic steps needed for
conceptual schema integration

1. Preintegration analysis
2. Comparison of schemas
3. Conformation of schemas
4. Merging and restructuring

of schemas

• The integration process needs
continual refinement and
reevaluation

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

7.2 Schema Integration

schemas with Preintegration analysis
conflicts Comparison of schemas

identify conflicts

list of conflicts

resolve conflicts Conformation of schemas
Merging and restructuring
modified
schemas

integrate schemas

integrated
schema

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 46

7.2 Schema Integration

• Preintegration analysis needs a close look on
the individual conceptual schemas to decide for
an adequate integration strategy

– The larger the number of constructs, the
more important is modularization

– Is it really sensible/possible
to integrate all schemas?

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 47

7.2 Schema Integration

Integration strategy

Binary approach n-ary approach

Sequential Balanced One-shot Iterative
integration integration integration integration

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 48

7.2 Schema Integration

• Schema conflicts

– Table/Table conflicts

• Table naming e.g., synonyms, homonyms
• Structure conflicts e.g., missing attributes
• Conflicting integrity conditions

– Attribute/Attribute conflicts

• Naming e.g., synonyms, homonyms
• Default value conflicts
• Conflicting integrity conditions e.g., different data types or

boundary limitations

– Table/Attribute conflicts

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 49

7.2 Schema Integration

• The basic goal is to make schemas compatible
for integration

• Conformation usually needs manual interaction

– Conflicts need to be resolved semantically

– Rename entities/attributes

– Convert differing types, e.g., convert an entity to an
attribute or a relationship

– Align cardinalities/functionalities

– Align different datatypes

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 50


Click to View FlipBook Version