The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.
Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by soedito, 2017-10-02 20:13:34

C7_ETL_71

C7_ETL_71

7.2 Schema Integration

• Schema integration is a semantic process

– This usually means a lot of manual work

– Computers can support the process by matching
some (parts of) schemas

• There have been some approaches towards
(semi-)automatic matching of schemas

– Matching is a complex process and usually only
focuses on simple constructs like ‘Are two entities
semantically equivalent?’

– The result is still rather error-prone…

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 51

7.2 Schema Integration

• Schema Matching

– Label-based matching

• For each label in one schema consider all labels of the other
schema and every time gauge their semantic similarity (Price
vs Cost)

– Instance-based matching

• E.g., find correlations between attributes: ‘Are there duplicate
tuples?’ or ‘Are the data distributions in their respective
domains similar?’

– Structure-based matching

• Abstracting from the actual labels, only the structure of the
schema is evaluated, e.g., regarding element types, depths in
hierarchies, number and type of relationships, etc.

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52

7.2 Schema Integration

• If integration is query-driven only Schema
Mapping is needed

– Mapping from one or more source schemas to a
target schema

Source High-level mapping Target
schema S schema T

Correspondence Mapping Correspondence
compiler

Data
Low-level mapping

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 53

7.2 Schema Integration

• Schema Mapping

– Abstracting from the actual labels, regarding element
types, depths in hierarchies, number and type of
relationships, etc.

Product Product
ID: Decimal ProdID: Decimal
Product: VARCHAR(50) Product: VARCHAR(50)
GroupID:Decimal Group: VARCHAR(50)
Categ: VARCHAR(50)

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 54

7.2 Schema Integration

• Schema integration in praxis

– BEA AquaLogic Data Services www.bea.com

• Special Feature: easy-to-use modeling: “Mappings and
transformations can be designed in an easy-to-use GUI
tool using a library of over 200 functions. For
complex mappings and
transformations, architects and
developers can bypass the GUI
tool and use an Xquery source
code editor to define or edit
services. “

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 55

7.2 Schema Integration

• What tools are actually given to support
integration?

– Data Translation Tool

• Transforms binary data into XML
• Transforms XML to binary data

– Data Transformation Tool

• Transforms an XML to another XML

– Base Idea

• Transform data to application specific XML → Transform to
XML specific to other application / general schema →
Transform back to binary

• Note: the integration work still has to be done manually

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 56

7.2 Schema Integration

• “I can’t afford expensive BEA consultants and the
AquaLogic Integration Suite, what now??”

– Do it completely yourself

• Most used technologies can be found as Open Source
projects (data mappers, XSL engines, XSL editors, etc)

– Do it yourself with specialized tools

• Many companies and open source projects are specialized in
developing data integration and transformation tools

– CloverETL
– Altova MapForce
– BusinessObjects Data Integrator
– Informatica Powerhouse, etc…

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 57

7.2 Schema Integration 58

• Altova MapForce

– Same idea as the BEA Integrator

• Also based on XSLT and a data description language

– Editors for binary/DB to
XML mapping

– Editor for XSL
transformation

– Automatic generation
of data sources, web-
services, and
transformation modules in Java, C#, C++

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig

7.2 Schema Integration

• Google Refine

– Watch the videos at
http://code.google.com/p/google-refine/

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 59

7.2 Loading

• The loading process can be broken down into 2
different types:

– Initial load
– Continuous load (loading

over time)

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 60

7.2 Loading

• Issues

– Huge volumes of data to be loaded
– Small time window available when warehouse can be

taken off line (usually nights)
– When to build a index and aggregated tables
– Allow system administrators to monitor, cancel,

resume, change load rates
– Recover gracefully - restart after failure from

where you were and without loss of data integrity

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 61

7.2 Loading

• Initial Load

– Deliver dimensions tables

• Create and assign surrogate keys, each time a new cleaned and
conformed dimension record has to be loaded

• Write dimensions to disk as physical tables, in the proper
dimensional format

– Deliver fact tables

• Utilize bulk-load utilities
• Load in parallel

– Tools

• DTS – Data Transformation Services (set of tools)
• bcp utility – batch copy
• SQL* Loader

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 62

7.2 Loading

• Continuous load (loading over time)

– Must be scheduled and processed in a specific order
to maintain integrity, completeness, and a satisfactory
level of trust (if done once a year… the data is
obsolete)

– Should be the most carefully planned step in data
warehousing or can lead to:

• Error duplication
• Exaggeration of inconsistencies in data

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 63

7.2 Loading

• Continuous load of facts

– Separate updates from inserts
– Drop any indexes not required to support updates
– Load updates
– Drop all remaining indexes
– Load inserts through bulk loaders
– Rebuild indexes

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 64

7.3 Metadata

• Metadata - data about data

– In DW, metadata describe the contents of a data
warehouse and how to use it

• What information exists in a data warehouse, what the
information means, how it was derived, from what source
systems it comes, when it was created, what pre-built
reports and analyses exist for manipulating the information,
etc.

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 65

7.3 Metadata

• Types of metadata in DW

– Source system metadata
– Data staging metadata
– DBMS metadata

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 66

7.3 Metadata

• Source system metadata

– Source specifications

• E.g., repositories, and source logical schemas

– Source descriptive information

• E.g., ownership descriptions, update frequencies and access
methods

– Process information

• E.g., job schedules and extraction code

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 67

7.3 Metadata

• Data staging metadata

– Data acquisition information, such as data
transmission scheduling and results, and file usage

– Dimension table management, such as definitions of
dimensions, and surrogate key assignments

– Transformation and aggregation, such as data
enhancement and mapping, DBMS load scripts, and
aggregate definitions

– Audit, job logs and documentation, such as data
lineage records, data transform logs

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 68

7.3 Metadata

• DW schema: e.g., Cube description metadata

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 69

Summary

• How to build a DW

– The DW Project: usual tasks, hardware, software, timeline
(phases)

– Data Extract/Transform/Load (ETL):

• Data storage structures, extraction strategies (e.g., scraping,
sniffing)

• Transformation: data quality, integration
• Loading: issues, and strategies, (bulk loading for fact data is a

must)

– Metadata:

• Describes the contents of a DW, comprises all the intermediate
products of ETL,

• Helps for understanding how to use the DW

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 70

Next lecture

• Real-Time Data Warehouses

– Real-Time Requirements
– Real-Time ETL
– In-Memory Data Warehouses

DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 71


Click to View FlipBook Version