Learn More

Home
About
Jobs
Links
Contact
Bookmark this Site

Project Topics

Project Management Defined

Agile Implementation

Business Case Creation

Business Intelligence Implementation

Data Collection Implementation

Extreme Programming

Project Estimation

Enterprise Tracking Systems (ETS)

ETL and Data Warehouse

ERP Implementation

Federal Grants Projects

Implementation Issues

Lab Assets Management

Medical Project Management

Offshoring

Ofice System Automation

Open Source Project Management

Oracle Database Recovery

Outsourcing

Portfolio Management Systems

Project Methodologies

Project Management Body of Knowledge

Rapid Application Development (RAD)

Rational Unified Process (RUP)

Risk Management

Reporting Projects

Requirements Elicitation

SAP BW

SAP Implementation

SCRUM

Selection and Evaluation

Six Sigma

User Documentation

Wireless Implementation

Work Breakdown Structures

More -->

Technology News

Tech policy news
 


Coming Soon:
Project Management job postings

>

Sponsored Links

Coming Soon: Certification Options



ETL and Data Warehouse








 

Extract, transform, and load (ETL) is a process in data warehousing that involves

* extracting data from outside sources,
* transforming it to fit business needs, and ultimately
* loading it into the data warehouse.

ETL is important, as it is the way data actually gets loaded into the warehouse. This article assumes that data is always loaded into a data warehouse, whereas the term ETL can in fact refer to a process that loads any database.


Extract

The first part of an ETL process is to extract the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization / format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as IMS or other data structures such as VSAM or ISAM. Extraction converts the data into a format for transformation processing.


Transform

The transform phase applies a series of rules or functions to the extracted data to derive the data to be loaded. Some data sources will require very little manipulation of data. However, in other cases any combination of the following transformations types may be required:

* Selecting only certain columns to load (or if you prefer, null columns not to load)
* Translating coded values (e.g. If the source system stores M for male and F for female but the warehouse stores 1 for male and 2 for female)
* Encoding free-form values (e.g. Mapping "Male" and "M" and "Mr" onto 1)
* Deriving a new calculated value (e.g. sale_amount = qty * unit_price)
* Joining together data from multiple sources (e.g. lookup, merge, etc)
* Summarizing multiple rows of data (e.g. total sales for each region)
* Generating surrogate key values
* Transposing or pivotting (turning multiple columns into multiple rows or vice versa)



Load

The load phase loads the data into the data warehouse. Depending on the requirements of the organization, this process ranges widely. Some data warehouses merely overwrite old information with new data. More complex systems can maintain a history and audit trail of all changes to the data.



ETL and Data Warehouse



Challenges

ETL processes can be quite complex, and significant operational problems can occur with improperly designed ETL systems.

The range of data values or data quality in an operational system may be outside the expectations of designers at the time validation and transformation rules are specified. Data profiling of a source during data analysis is recommended to identify the data conditions that will need to be managed by transform rules specifications.

The scalability of an ETL system across the lifetime of its usage needs to be established during analysis. This includes understanding the volumes of data that will have to be processed within Service Level Agreements. The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time. Some ETL systems have to scale to process terabytes of data to update data warehouses with tens of terabytes of data. Increasing volumes of data may require designs that can scale from daily batch to intra-day micro-batch to integration with message queues for continuous transformation and update.

Recent developments in ETL software has been the implementation of parallel processing. This has enabled a number of methods to improve overall performance of ETL processes when dealing with large volumes of data.

There are 3 main types of parallelisms as implemented in ETL applications:

Data: By splitting a single sequential file into smaller data files to provide parallel access.

Pipeline: Allowing the simultaneous running of several components on the same data stream. E.g. performing step 2: lookup a value on record 1 at the same time as step 1: add two fields together is performed on record 2.

Component: The simultaneous running of multiple processes on different data streams in the same job. E.g. doing a sort on input file 1 at the same time that the contents of input file 2 are deduped.

All three types of parallelisms are usually combined in a single job.

An additional difficulty is making sure the data being uploaded is relatively consistent. Since multiple source databases all have different update cycles (some may be updated every few minutes, while others may take days or weeks), an ETL system may be required to hold back certain data until all sources are synchronized. Likewise, where a warehouse may have to be reconciled to the contents in a source system or with the general ledger establishing synchronization and reconciliation points is necessary.


Tools

While an ETL process can be created using almost any programming language, creating them from scratch is quite complex. Increasingly, companies are buying ETL tools to help in the creation of ETL processes.

A good ETL tool must be able to communicate with the many different relational databases and read the various file formats used throughout an organization. ETL tools have started to migrate into Enterprise Application Integration, or even Enterprise Service Bus, systems that now cover much more than just the extraction, transformation and loading of data. Many ETL vendors now have data profiling, data quality and metadata capabilities.


Some ETL tools

* Ab Initio
* Barracuda Integrator
* BusinessObjects Data Integrator
* Cast Iron Systems
* CloverETL OpenSource Project‘s Home Page
* Cognos Decisionstream
* Corporater Transformer
* CoSORT CoSORT ETL tools
* Crossflo Systems DataExchange
* Data Transformation Services (Included in Microsoft SQL Server 2000)
* Embarcadero DT Studio
* Group 1 Software's DataFlow
* Hummingbird GENIO
* ETL4ALL Ikan Software Engine Based ETL Tool
* IBM Websphere DataStage (Previously Ascential DataStage)
* Informatica PowerCenter
* KETTLE (Pentaho ETL)
* MetaSuite Ikan Software Code Generator Based ETL Tool
* Microsoft Biztalk Server
* Netik Interview7
* Oracle Warehouse Builder
* Pervasive Data Integrator
* SAS Data Integration Studio (formerly ETL Studio)
* Solonde ETL (Solonde is now part of Sybase, Inc)
* SQL Server Integration Services (Included in Microsoft SQL Server 2005)
* Sunopsis
* Syncsort's DMExpress
* Talend - Open Data Solutions based on Eclipse
* Transformation Manager
* WisdomForce FastReader



A Data warehouse is an application with a computer database that collects, integrates and stores an organization's data with the aim of producing accurate and timely management of information and support for analysis techniques, such as data mining.



Definition of a Data Warehouse

Today's definition of "what is a data warehouse"? goes beyond the traditional, as the concept has evolved. The previous definition, "a data warehouse is a collection of computerised data that is organised to most optimally support reporting and analysis activity," ought to be expanded.

A data warehouse is a repository of an organization's data, where the informational assets of the organization are stored and managed, to support various activities such as reporting, analysis, decision-making, as well as other activities such as support for optimization of organizational operational processes. An enterprise data warehouse is further described as the "single point of truth", the "corporate memory", the sole historical register of virtually all transactions and important operational events that occur in the life of an organization. This data is ultimately stored and cataloged for immediate and future utilization in various forms, such as deployment into some application. It is through these various uses and deployments that this data becomes information, to be potentially exploited for benefit.

The practice of data warehousing includes the storage of virtually all transactional data, master data (customer, material), and meta data at a very detailed level. Data warehousing as a topic is linked closely with Business Intelligence (BI), but supports a wide range of activities as it essentially becomes the basis for what Bill Inmon has termed the "Information Factory", as in the "Corporate Information Factory", or the "Government Information Factory". As the concepts demonstrate, a data warehouse is actually foundational for the realization of significant overall benefits from the utilization of technology within the context of organizational operations. Data warehousing represents a natural evolution from the traditional concept of information technology, which has been mainly grounded in the automation of operations; data warehousing seeks to optimize data management and further processes data, to support its deployment and exploitation as information.

Bill Inmon's formal systems definition of a data warehouse is a computer database and its supporting components that is:

* Subject-oriented, meaning that the data in the database is organised so that all the data elements relating to the same real-world event or object are linked together;
* Time-variant, meaning that the changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time;
* Non-volatile, meaning that data in the database is never over-written or deleted, but retained for future reporting; and,
* Integrated, meaning that the database contains data from most or all of an organisation's operational applications, and that this data is made consistent.



History of data warehousing

Data Warehouses became a distinct type of computer database during the late 1980's and early 1990's. They developed to meet a growing demand for management information and analysis that could not be met by operational systems.ɟ Operational systems were unable to meet this need for a range of reasons:

* The processing load of reporting reduced the response time of the operational systems,
* The database designs of operational systems were not optimised for information analysis and reporting,
* Most organizations had more than one operational system, so company-wide reporting could not be supported from a single system, and
* Development of reports in operational systems often required writing specific computer programs which was slow and expensive

As a result, separate computer databases began to be built that were specifically designed to support management information and analysis purposes. These data warehouses were able to bring in data from a range of different data sources, such as mainframe computers, minicomputers, as well as personal computers and office automation software such as spreadsheets, and integrate this information in a single place. This capability, coupled with user-friendly reporting tools and freedom from operational impacts, has led to a growth of this type of computer system.

As technology improved (lower cost for more performance) and user requirements increased (faster data load cycle times and more features), data warehouses have evolved through several fundamental stages:

* Offline Operational Databases - Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system's performance.
* Offline Data Warehouse - Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure
* Real Time Data Warehouse - Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)
* Integrated Data Warehouse - Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.sdsds



Components of a data warehouse

The primary components of the majority of data warehouses are shown in the attached diagram and described in more detail below:


Data Sources

Data sources refers to any electronic repository of information that contains data of interest for management use or analytics. This definition covers mainframe databases (eg IBM DB2, ISAM, Adabas, Teradata, etc.), client-server databases (e.g. Teradata, IBM DB2, Oracle database, Informix, Microsoft SQL Server etc.), PC databases (eg Microsoft Access), spreadsheets (eg Microsoft Excel) and any other electronic store of data. Data needs to be passed from these systems to the data warehouse either on a transaction-by-transaction basis for real-time data warehouses or on a regular cycle (e.g. daily or weekly) for offline data warehouses.


Data Transformation

The Data Transformation layer receives data from the data sources, cleans and standardises it, and loads it into the data repository. This is often called "staging" data as data often passes through a temporary database whilst it is being transformed. This activity of transforming data can be performed either by manually created code or a specific type of software could be used called an ETL tool. Regardless of the nature of the software used, the following types of activities occur during data transformation:

* comparing data from different systems to improve data quality (e.g. Date of birth for a customer may be blank in one system but contain valid data in a second system. In this instance, the data warehouse would retain the date of birth field from the second system)
* standardising data and codes (e.g. If one system refers to "Male" and "Female", but a second refers to only "M" and "F", these codes sets would need to be standardised)
* integrating data from different systems (e.g. if one system keeps orders and another stores customers, these data elements need to be linked)
* performing other system housekeeping functions such as determining change (or "delta") files to reduce data load times, generating or finding surrogate keys for data etc.



Data Warehouse

The data warehouse need not to be a relational database, as it must be organised to hold information in a structure that best supports not only query and reporting, but also advanced analysis techniques, like data mining. Most data warehouses hold information for at least 1 year and sometimes can reach half century, depending on the business/operations data retention requirement. As a result these databases can become very large.


Reporting

The data in the data warehouse must be available to the organisation's staff if the data warehouse is to be useful. There are a very large number of software applications that perform this function, or reporting can be custom-developed. Examples of types of reporting tools include:

* Business intelligence tools: These are software applications that simplify the process of development and production of business reports based on data warehouse data.
* Executive information systems (known more widely as Dashboard (business): These are software applications that are used to display complex business metrics and information in a graphical way to allow rapid understanding.
* OLAP Tools: OLAP tools form data into logical multi-dimensional structures and allow users to select which dimensions to view data by.
* Data Mining: Data mining tools are software that allow users to perform detailed mathematical and statistical calculations on detailed data warehouse data to detect trends, identify patterns and analyse data.



Metadata

Metadata, or "data about data", is used not only to inform operators and users of the data warehouse about its status and the information held within the data warehouse, but also as a means of integration of incoming data and a tool to update and refine the underlying DW model.

Examples of data warehouse metadata include table and column names, their detailed descriptions, their connection to business meaninful names, the most recent data load date, the business meaning of a data item and the number of users that are logged in currently.


Operations

Data warehouse operations comprises of the processes of loading, manipulating and extracting data from the data warehouse. Operations also cover user management, security, capacity management and related functions


Optional Components

In addition, the following components also exist in some data warehouses:

1. Dependent Data Marts: A dependent data mart is a physical database (either on the same hardware as the data warehouse or on a separate hardware platform) that receives all its information from the data warehouse. The purpose of a Data Mart is to provide a sub-set of the data warehouse's data for a specific purpose or to a specific sub-group of the organisation.
2. Logical Data Marts: A logical data mart is a filtered view of the main data warehouse but does not physically exist as a separate data copy. This approach to data marts delivers the same benefits but has the additional advantages of not requiring additional (costly) disk space and it is always as current with data as the main data warehouse.
3. Operational Data Store: An ODS is an integrated database of operational data. Its sources include legacy systems and it contains current or near term data. An ODS may contain 30 to 60 days of information, while a data warehouse typically contains years of data. ODS's are used in some data warehouse architectures to provide near real time reporting capability in the event that the Data Warehouse's loading time or architecture prevents it being able to provide near real time reporting capability.


Different methods of storing data in a data warehouse

All data warehouses store their data grouped together by subject areas that reflect the general usage of the data (Customer, Product, Finance etc.). The general principle used in the majority of data warehouses is that data is stored at its most elemental level for use in reporting and information analysis.

Within this generic intent, there are two primary approaches to organising the data in a data warehouse.

The first is using a "dimensional" approach. In this style, information is stored as "facts" which are numeric or text data that capture specific data about a single transaction or event, and "dimensions" which contain reference information that allows each transaction or event to be classified in various ways. As an example, a sales transaction would be broken up into facts such as the number of products ordered, and the price paid, and dimensions such as date, customer, product, geographical location and sales person. The main advantages of a dimensional approach is that the Data Warehouse is easy for business staff with limited information technology experience to understand and use. Also, because the data is pre-processed into the dimensional form, the Data Warehouse tends to operate very quickly. The main disadvantage of the dimensional approach is that it is quite difficult to add or change later if the company changes the way in which it does business.

The second approach uses database normalisation. In this style, the data in the data warehouse is stored in third normal form. The main advantage of this approach is that it is quite straightforward to add new information into the database, whilst the primary disadvantage of this approach is that it can be quite slow to produce information and reports.


Advantages of using data warehouse

There are many advantages to using a data warehouse, some of them are:

* Enhances end-user access to a wide variety of data.
* Business decision makers can get the trend reports e.g. The item with the most sales in a particular area/ country for the last two years. This may be helpful for future investments in a particular item.
* Increases data consistency.
* Increases productivity and decreases computing costs.
* Is able to combine data from different sources, in one place.
* It provides an infrastructure that could support changes to data and replication of the changed data back into the operational systems.



Concerns in using data warehouse

* Extracting, cleaning and loading data could be time consuming.
* Data warehousing project scope might increase.
* Problems with compatibility with systems already in place e.g. transaction processing system.
* Providing training to end-users, who end up not using the data warehouse.
* Security could develop into a serious issue, especially if the data warehouse is web accessible.





 
 
 

 


 
Web Projectmanagementacademy.com