ETL and Data Warehouse
Extract, transform, and load (ETL) is a process in data
warehousing that involves
* extracting data from outside sources,
* transforming it to fit business needs, and ultimately
* loading it into the data warehouse.
ETL is important, as it is the way data actually gets loaded into the
warehouse. This article assumes that data is always loaded into a data
warehouse, whereas the term ETL can in fact refer to a process that
loads any database.
Extract
The first part of an ETL process is to extract the data from the source
systems. Most data warehousing projects consolidate data from different
source systems. Each separate system may also use a different data
organization / format. Common data source formats are relational
databases and flat files, but may include non-relational database
structures such as IMS or other data structures such as VSAM or ISAM.
Extraction converts the data into a format for transformation
processing.
Transform
The transform phase applies a series of rules or functions to the
extracted data to derive the data to be loaded. Some data sources will
require very little manipulation of data. However, in other cases any
combination of the following transformations types may be required:
* Selecting only certain columns to load (or if you prefer, null columns
not to load)
* Translating coded values (e.g. If the source system stores M for male
and F for female but the warehouse stores 1 for male and 2 for female)
* Encoding free-form values (e.g. Mapping "Male" and "M" and "Mr" onto
1)
* Deriving a new calculated value (e.g. sale_amount = qty * unit_price)
* Joining together data from multiple sources (e.g. lookup, merge, etc)
* Summarizing multiple rows of data (e.g. total sales for each region)
* Generating surrogate key values
* Transposing or pivotting (turning multiple columns into multiple rows
or vice versa)
Load
The load phase loads the data into the data warehouse. Depending on the
requirements of the organization, this process ranges widely. Some data
warehouses merely overwrite old information with new data. More complex
systems can maintain a history and audit trail of all changes to the
data.
ETL and Data Warehouse
Challenges
ETL processes can be quite complex, and significant operational problems can
occur with improperly designed ETL systems.
The range of data values or data quality in an operational system may be outside
the expectations of designers at the time validation and transformation rules
are specified. Data profiling of a source during data analysis is recommended to
identify the data conditions that will need to be managed by transform rules
specifications.
The scalability of an ETL system across the lifetime of its usage needs to be
established during analysis. This includes understanding the volumes of data
that will have to be processed within Service Level Agreements. The time
available to extract from source systems may change, which may mean the same
amount of data may have to be processed in less time. Some ETL systems have to
scale to process terabytes of data to update data warehouses with tens of
terabytes of data. Increasing volumes of data may require designs that can scale
from daily batch to intra-day micro-batch to integration with message queues for
continuous transformation and update.
Recent developments in ETL software has been the implementation of parallel
processing. This has enabled a number of methods to improve overall performance
of ETL processes when dealing with large volumes of data.
There are 3 main types of parallelisms as implemented in ETL applications:
Data: By splitting a single sequential file into smaller data files to provide
parallel access.
Pipeline: Allowing the simultaneous running of several components on the same
data stream. E.g. performing step 2: lookup a value on record 1 at the same time
as step 1: add two fields together is performed on record 2.
Component: The simultaneous running of multiple processes on different data
streams in the same job. E.g. doing a sort on input file 1 at the same time that
the contents of input file 2 are deduped.
All three types of parallelisms are usually combined in a single job.
An additional difficulty is making sure the data being uploaded is relatively
consistent. Since multiple source databases all have different update cycles
(some may be updated every few minutes, while others may take days or weeks), an
ETL system may be required to hold back certain data until all sources are
synchronized. Likewise, where a warehouse may have to be reconciled to the
contents in a source system or with the general ledger establishing
synchronization and reconciliation points is necessary.
Tools
While an ETL process can be created using almost any programming language,
creating them from scratch is quite complex. Increasingly, companies are buying
ETL tools to help in the creation of ETL processes.
A good ETL tool must be able to communicate with the many different relational
databases and read the various file formats used throughout an organization. ETL
tools have started to migrate into Enterprise Application Integration, or even
Enterprise Service Bus, systems that now cover much more than just the
extraction, transformation and loading of data. Many ETL vendors now have data
profiling, data quality and metadata capabilities.
Some ETL tools
* Ab Initio
* Barracuda Integrator
* BusinessObjects Data Integrator
* Cast Iron Systems
* CloverETL OpenSource Project‘s Home Page
* Cognos Decisionstream
* Corporater Transformer
* CoSORT CoSORT ETL tools
* Crossflo Systems DataExchange
* Data Transformation Services (Included in Microsoft SQL Server 2000)
* Embarcadero DT Studio
* Group 1 Software's DataFlow
* Hummingbird GENIO
* ETL4ALL Ikan Software Engine Based ETL Tool
* IBM Websphere DataStage (Previously Ascential DataStage)
* Informatica PowerCenter
* KETTLE (Pentaho ETL)
* MetaSuite Ikan Software Code Generator Based ETL Tool
* Microsoft Biztalk Server
* Netik Interview7
* Oracle Warehouse Builder
* Pervasive Data Integrator
* SAS Data Integration Studio (formerly ETL Studio)
* Solonde ETL (Solonde is now part of Sybase, Inc)
* SQL Server Integration Services (Included in Microsoft SQL Server 2005)
* Sunopsis
* Syncsort's DMExpress
* Talend - Open Data Solutions based on Eclipse
* Transformation Manager
* WisdomForce FastReader
A Data warehouse is an application with a computer database that collects,
integrates and stores an organization's data with the aim of producing accurate
and timely management of information and support for analysis techniques, such
as data mining.
Definition of a Data Warehouse
Today's definition of "what is a data warehouse"? goes beyond the traditional,
as the concept has evolved. The previous definition, "a data warehouse is a
collection of computerised data that is organised to most optimally support
reporting and analysis activity," ought to be expanded.
A data warehouse is a repository of an organization's data, where the
informational assets of the organization are stored and managed, to support
various activities such as reporting, analysis, decision-making, as well as
other activities such as support for optimization of organizational operational
processes. An enterprise data warehouse is further described as the "single
point of truth", the "corporate memory", the sole historical register of
virtually all transactions and important operational events that occur in the
life of an organization. This data is ultimately stored and cataloged for
immediate and future utilization in various forms, such as deployment into some
application. It is through these various uses and deployments that this data
becomes information, to be potentially exploited for benefit.
The practice of data warehousing includes the storage of virtually all
transactional data, master data (customer, material), and meta data at a very
detailed level. Data warehousing as a topic is linked closely with Business
Intelligence (BI), but supports a wide range of activities as it essentially
becomes the basis for what Bill Inmon has termed the "Information Factory", as
in the "Corporate Information Factory", or the "Government Information Factory".
As the concepts demonstrate, a data warehouse is actually foundational for the
realization of significant overall benefits from the utilization of technology
within the context of organizational operations. Data warehousing represents a
natural evolution from the traditional concept of information technology, which
has been mainly grounded in the automation of operations; data warehousing seeks
to optimize data management and further processes data, to support its
deployment and exploitation as information.
Bill Inmon's formal systems definition of a data warehouse is a computer
database and its supporting components that is:
* Subject-oriented, meaning that the data in the database is organised so that
all the data elements relating to the same real-world event or object are linked
together;
* Time-variant, meaning that the changes to the data in the database are tracked
and recorded so that reports can be produced showing changes over time;
* Non-volatile, meaning that data in the database is never over-written or
deleted, but retained for future reporting; and,
* Integrated, meaning that the database contains data from most or all of an
organisation's operational applications, and that this data is made consistent.
History of data warehousing
Data Warehouses became a distinct type of computer database during the late
1980's and early 1990's. They developed to meet a growing demand for management
information and analysis that could not be met by operational systems.ɟ
Operational systems were unable to meet this need for a range of reasons:
* The processing load of reporting reduced the response time of the operational
systems,
* The database designs of operational systems were not optimised for information
analysis and reporting,
* Most organizations had more than one operational system, so company-wide
reporting could not be supported from a single system, and
* Development of reports in operational systems often required writing specific
computer programs which was slow and expensive
As a result, separate computer databases began to be built that were
specifically designed to support management information and analysis purposes.
These data warehouses were able to bring in data from a range of different data
sources, such as mainframe computers, minicomputers, as well as personal
computers and office automation software such as spreadsheets, and integrate
this information in a single place. This capability, coupled with user-friendly
reporting tools and freedom from operational impacts, has led to a growth of
this type of computer system.
As technology improved (lower cost for more performance) and user requirements
increased (faster data load cycle times and more features), data warehouses have
evolved through several fundamental stages:
* Offline Operational Databases - Data warehouses in this initial stage are
developed by simply copying the database of an operational system to an off-line
server where the processing load of reporting does not impact on the operational
system's performance.
* Offline Data Warehouse - Data warehouses in this stage of evolution are
updated on a regular time cycle (usually daily, weekly or monthly) from the
operational systems and the data is stored in an integrated reporting-oriented
data structure
* Real Time Data Warehouse - Data warehouses at this stage are updated on a
transaction or event basis, every time an operational system performs a
transaction (e.g. an order or a delivery or a booking etc.)
* Integrated Data Warehouse - Data warehouses at this stage are used to generate
activity or transactions that are passed back into the operational systems for
use in the daily activity of the organization.sdsds
Components of a data warehouse
The primary components of the majority of data warehouses are shown in the
attached diagram and described in more detail below:
Data Sources
Data sources refers to any electronic repository of information that contains
data of interest for management use or analytics. This definition covers
mainframe databases (eg IBM DB2, ISAM, Adabas, Teradata, etc.), client-server
databases (e.g. Teradata, IBM DB2, Oracle database, Informix, Microsoft SQL
Server etc.), PC databases (eg Microsoft Access), spreadsheets (eg Microsoft
Excel) and any other electronic store of data. Data needs to be passed from
these systems to the data warehouse either on a transaction-by-transaction basis
for real-time data warehouses or on a regular cycle (e.g. daily or weekly) for
offline data warehouses.
Data Transformation
The Data Transformation layer receives data from the data sources, cleans and
standardises it, and loads it into the data repository. This is often called
"staging" data as data often passes through a temporary database whilst it is
being transformed. This activity of transforming data can be performed either by
manually created code or a specific type of software could be used called an ETL
tool. Regardless of the nature of the software used, the following types of
activities occur during data transformation:
* comparing data from different systems to improve data quality (e.g. Date of
birth for a customer may be blank in one system but contain valid data in a
second system. In this instance, the data warehouse would retain the date of
birth field from the second system)
* standardising data and codes (e.g. If one system refers to "Male" and
"Female", but a second refers to only "M" and "F", these codes sets would need
to be standardised)
* integrating data from different systems (e.g. if one system keeps orders and
another stores customers, these data elements need to be linked)
* performing other system housekeeping functions such as determining change (or
"delta") files to reduce data load times, generating or finding surrogate keys
for data etc.
Data Warehouse
The data warehouse need not to be a relational database, as it must be organised
to hold information in a structure that best supports not only query and
reporting, but also advanced analysis techniques, like data mining. Most data
warehouses hold information for at least 1 year and sometimes can reach half
century, depending on the business/operations data retention requirement. As a
result these databases can become very large.
Reporting
The data in the data warehouse must be available to the organisation's staff if
the data warehouse is to be useful. There are a very large number of software
applications that perform this function, or reporting can be custom-developed.
Examples of types of reporting tools include:
* Business intelligence tools: These are software applications that simplify the
process of development and production of business reports based on data
warehouse data.
* Executive information systems (known more widely as Dashboard (business):
These are software applications that are used to display complex business
metrics and information in a graphical way to allow rapid understanding.
* OLAP Tools: OLAP tools form data into logical multi-dimensional structures and
allow users to select which dimensions to view data by.
* Data Mining: Data mining tools are software that allow users to perform
detailed mathematical and statistical calculations on detailed data warehouse
data to detect trends, identify patterns and analyse data.
Metadata
Metadata, or "data about data", is used not only to inform operators and users
of the data warehouse about its status and the information held within the data
warehouse, but also as a means of integration of incoming data and a tool to
update and refine the underlying DW model.
Examples of data warehouse metadata include table and column names, their
detailed descriptions, their connection to business meaninful names, the most
recent data load date, the business meaning of a data item and the number of
users that are logged in currently.
Operations
Data warehouse operations comprises of the processes of loading, manipulating
and extracting data from the data warehouse. Operations also cover user
management, security, capacity management and related functions
Optional Components
In addition, the following components also exist in some data warehouses:
1. Dependent Data Marts: A dependent data mart is a physical database (either on
the same hardware as the data warehouse or on a separate hardware platform) that
receives all its information from the data warehouse. The purpose of a Data Mart
is to provide a sub-set of the data warehouse's data for a specific purpose or
to a specific sub-group of the organisation.
2. Logical Data Marts: A logical data mart is a filtered view of the main data
warehouse but does not physically exist as a separate data copy. This approach
to data marts delivers the same benefits but has the additional advantages of
not requiring additional (costly) disk space and it is always as current with
data as the main data warehouse.
3. Operational Data Store: An ODS is an integrated database of operational data.
Its sources include legacy systems and it contains current or near term data. An
ODS may contain 30 to 60 days of information, while a data warehouse typically
contains years of data. ODS's are used in some data warehouse architectures to
provide near real time reporting capability in the event that the Data
Warehouse's loading time or architecture prevents it being able to provide near
real time reporting capability.
Different methods of storing data in a data warehouse
All data warehouses store their data grouped together by subject areas that
reflect the general usage of the data (Customer, Product, Finance etc.). The
general principle used in the majority of data warehouses is that data is stored
at its most elemental level for use in reporting and information analysis.
Within this generic intent, there are two primary approaches to organising the
data in a data warehouse.
The first is using a "dimensional" approach. In this style, information is
stored as "facts" which are numeric or text data that capture specific data
about a single transaction or event, and "dimensions" which contain reference
information that allows each transaction or event to be classified in various
ways. As an example, a sales transaction would be broken up into facts such as
the number of products ordered, and the price paid, and dimensions such as date,
customer, product, geographical location and sales person. The main advantages
of a dimensional approach is that the Data Warehouse is easy for business staff
with limited information technology experience to understand and use. Also,
because the data is pre-processed into the dimensional form, the Data Warehouse
tends to operate very quickly. The main disadvantage of the dimensional approach
is that it is quite difficult to add or change later if the company changes the
way in which it does business.
The second approach uses database normalisation. In this style, the data in the
data warehouse is stored in third normal form. The main advantage of this
approach is that it is quite straightforward to add new information into the
database, whilst the primary disadvantage of this approach is that it can be
quite slow to produce information and reports.
Advantages of using data warehouse
There are many advantages to using a data warehouse, some of them are:
* Enhances end-user access to a wide variety of data.
* Business decision makers can get the trend reports e.g. The item with the most
sales in a particular area/ country for the last two years. This may be helpful
for future investments in a particular item.
* Increases data consistency.
* Increases productivity and decreases computing costs.
* Is able to combine data from different sources, in one place.
* It provides an infrastructure that could support changes to data and
replication of the changed data back into the operational systems.
Concerns in using data warehouse
* Extracting, cleaning and loading data could be time consuming.
* Data warehousing project scope might increase.
* Problems with compatibility with systems already in place e.g. transaction
processing system.
* Providing training to end-users, who end up not using the data warehouse.
* Security could develop into a serious issue, especially if the data warehouse
is web accessible.
.