7.2 Schema Integration
• Schema integration is a semantic process
– This usually means a lot of manual work
– Computers can support the process by matching
some (parts of) schemas
• There have been some approaches towards
(semi-)automatic matching of schemas
– Matching is a complex process and usually only
focuses on simple constructs like ‘Are two entities
semantically equivalent?’
– The result is still rather error-prone…
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 51
7.2 Schema Integration
• Schema Matching
– Label-based matching
• For each label in one schema consider all labels of the other
schema and every time gauge their semantic similarity (Price
vs Cost)
– Instance-based matching
• E.g., find correlations between attributes: ‘Are there duplicate
tuples?’ or ‘Are the data distributions in their respective
domains similar?’
– Structure-based matching
• Abstracting from the actual labels, only the structure of the
schema is evaluated, e.g., regarding element types, depths in
hierarchies, number and type of relationships, etc.
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52
7.2 Schema Integration
• If integration is query-driven only Schema
Mapping is needed
– Mapping from one or more source schemas to a
target schema
Source High-level mapping Target
schema S schema T
Correspondence Mapping Correspondence
compiler
Data
Low-level mapping
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 53
7.2 Schema Integration
• Schema Mapping
– Abstracting from the actual labels, regarding element
types, depths in hierarchies, number and type of
relationships, etc.
Product Product
ID: Decimal ProdID: Decimal
Product: VARCHAR(50) Product: VARCHAR(50)
GroupID:Decimal Group: VARCHAR(50)
Categ: VARCHAR(50)
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 54
7.2 Schema Integration
• Schema integration in praxis
– BEA AquaLogic Data Services www.bea.com
• Special Feature: easy-to-use modeling: “Mappings and
transformations can be designed in an easy-to-use GUI
tool using a library of over 200 functions. For
complex mappings and
transformations, architects and
developers can bypass the GUI
tool and use an Xquery source
code editor to define or edit
services. “
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 55
7.2 Schema Integration
• What tools are actually given to support
integration?
– Data Translation Tool
• Transforms binary data into XML
• Transforms XML to binary data
– Data Transformation Tool
• Transforms an XML to another XML
– Base Idea
• Transform data to application specific XML → Transform to
XML specific to other application / general schema →
Transform back to binary
• Note: the integration work still has to be done manually
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 56
7.2 Schema Integration
• “I can’t afford expensive BEA consultants and the
AquaLogic Integration Suite, what now??”
– Do it completely yourself
• Most used technologies can be found as Open Source
projects (data mappers, XSL engines, XSL editors, etc)
– Do it yourself with specialized tools
• Many companies and open source projects are specialized in
developing data integration and transformation tools
– CloverETL
– Altova MapForce
– BusinessObjects Data Integrator
– Informatica Powerhouse, etc…
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 57
7.2 Schema Integration 58
• Altova MapForce
– Same idea as the BEA Integrator
• Also based on XSLT and a data description language
– Editors for binary/DB to
XML mapping
– Editor for XSL
transformation
– Automatic generation
of data sources, web-
services, and
transformation modules in Java, C#, C++
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
7.2 Schema Integration
• Google Refine
– Watch the videos at
http://code.google.com/p/google-refine/
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 59
7.2 Loading
• The loading process can be broken down into 2
different types:
– Initial load
– Continuous load (loading
over time)
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 60
7.2 Loading
• Issues
– Huge volumes of data to be loaded
– Small time window available when warehouse can be
taken off line (usually nights)
– When to build a index and aggregated tables
– Allow system administrators to monitor, cancel,
resume, change load rates
– Recover gracefully - restart after failure from
where you were and without loss of data integrity
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 61
7.2 Loading
• Initial Load
– Deliver dimensions tables
• Create and assign surrogate keys, each time a new cleaned and
conformed dimension record has to be loaded
• Write dimensions to disk as physical tables, in the proper
dimensional format
– Deliver fact tables
• Utilize bulk-load utilities
• Load in parallel
– Tools
• DTS – Data Transformation Services (set of tools)
• bcp utility – batch copy
• SQL* Loader
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 62
7.2 Loading
• Continuous load (loading over time)
– Must be scheduled and processed in a specific order
to maintain integrity, completeness, and a satisfactory
level of trust (if done once a year… the data is
obsolete)
– Should be the most carefully planned step in data
warehousing or can lead to:
• Error duplication
• Exaggeration of inconsistencies in data
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 63
7.2 Loading
• Continuous load of facts
– Separate updates from inserts
– Drop any indexes not required to support updates
– Load updates
– Drop all remaining indexes
– Load inserts through bulk loaders
– Rebuild indexes
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 64
7.3 Metadata
• Metadata - data about data
– In DW, metadata describe the contents of a data
warehouse and how to use it
• What information exists in a data warehouse, what the
information means, how it was derived, from what source
systems it comes, when it was created, what pre-built
reports and analyses exist for manipulating the information,
etc.
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 65
7.3 Metadata
• Types of metadata in DW
– Source system metadata
– Data staging metadata
– DBMS metadata
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 66
7.3 Metadata
• Source system metadata
– Source specifications
• E.g., repositories, and source logical schemas
– Source descriptive information
• E.g., ownership descriptions, update frequencies and access
methods
– Process information
• E.g., job schedules and extraction code
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 67
7.3 Metadata
• Data staging metadata
– Data acquisition information, such as data
transmission scheduling and results, and file usage
– Dimension table management, such as definitions of
dimensions, and surrogate key assignments
– Transformation and aggregation, such as data
enhancement and mapping, DBMS load scripts, and
aggregate definitions
– Audit, job logs and documentation, such as data
lineage records, data transform logs
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 68
7.3 Metadata
• DW schema: e.g., Cube description metadata
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 69
Summary
• How to build a DW
– The DW Project: usual tasks, hardware, software, timeline
(phases)
– Data Extract/Transform/Load (ETL):
• Data storage structures, extraction strategies (e.g., scraping,
sniffing)
• Transformation: data quality, integration
• Loading: issues, and strategies, (bulk loading for fact data is a
must)
– Metadata:
• Describes the contents of a DW, comprises all the intermediate
products of ETL,
• Helps for understanding how to use the DW
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 70
Next lecture
• Real-Time Data Warehouses
– Real-Time Requirements
– Real-Time ETL
– In-Memory Data Warehouses
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 71