DataStage – Overview - Data Warehousing Data Warehousing

Reading Time: 3 minutes

DataStage is an ETL tool and it’s a component of the IBM information Platforms
Solution suite and the InfoSphere. Thus called as IBM InfoSphere DataStage. This tool can extract information from dissimilar sources, carry out
transformations as per a business’s requirements and transfer the data into
chosen data warehouses. It is widely used for development and maintenance of
Data warehouses and Data marts.
It was first launched by VMark in mid-90’s. Later IBM acquiring DataStage in 2005, it
was renamed to IBM WebSphere DataStage and later to IBM InfoSphere. DataStage
is available in various versions in the market so far was Enterprise Edition
(PX), Server Edition, MVS Edition, DataStage for PeopleSoft and so on. The
latest edition is IBM InfoSphere DataStage.

DataStage
plays major role of information management in stream of Business intelligence
(BI). Datastage provides a GUI (Graphical User Interface) driven interface to
carry out the Extract Transform Load work.

The ETL
work is carried out through jobs. A DataStage job can be referred to as an
implementable unit of work that can be gathered & executed individually or
as a component of a stream data flow.

A job is
made of various stages that are connected via links.

A stage
serves many purposes, comparable to database stages to link to target systems
and source, running stages to carry out many data transformations, file stages
so as to link to many file systems and so on.
Links are
used to bring together various stages in a job to describe the flow of data.

What is DataStage?

DataStage is an ETL tool which extracts data, transform and load data from source to the target. The data sources might include sequential files, indexed files, relational databases, external data sources, archives, enterprise applications, etc. DataStage
facilitates business analysis by providing quality data to help in gaining business intelligence.

DataStage Features

DataStage can Extract the data from any source and can loads the data into the any target.
The Job developed in the one platform can run on the any other platform If we designed a job in the Uni level processing, it can be run in the SMP machine.
Node Configuration is a technique to create logical C.P.U. Node is a Logical C.P.U.
Partition parallelism is a technique distributing the data across the nodes based on the partition techniques.
It can be used to build and load data warehouse which can operate in batch, real time, or as a Web service.
It can handle complex transformations and manage multiple integration processes.

DataStage ETL work carried out through jobs. It contains mainly three different types of jobs.

Server jobs
Parallel jobs
Sequence jobs

DataStage server jobs runs on single node on DataStage server engine. Server jobs handle less volume of data and slow processing capabilities. Server jobs contain less number of components and compiled into Basic language.

DataStage parallel jobs runs on multiple nodes on DataStage parallel engine. Parallel jobs can handles huge volume of data and processing speed is high. Parallel jobs contain more number of components and compiled into OSH (Orchestrate Shell script) except Transformer which compiles into C++.

Sequence jobs – For more complex designs, you can build sequence jobs to run multiple jobs in conjunction with other jobs. By using sequence jobs, you can integrate programming controls into your job workflow, such as branching and looping. You specify the control information, such as the different courses of action to take depending on whether a job in the sequence succeeds or fails. After you
create a sequence job, you schedule it to run using the InfoSphere DataStage
Director client, just like you would with a parallel job or server job.

What to read next?