Data Stage front end Applications

Reading Time: 2 minutes

Following are the various data stage client application:

Data Stage Designer            –         Used to design the jobs and sequences

Data Stage Director              –         Used to monitor the jobs are run time

Data Stage Manager            –         Used to import and export jobs

DATASTAGE DESIGNER

  • It is place to design our jobs and job sequence.
    • Create and use parameters within jobs
    • Save, Compile and Run jobs

DataStage Designer is used to:

  • Create DataStage Jobs that are compiled into Create DataStage Jobs that are compiled into executable programs.
    • Design the jobs that extract, integrate, aggregate, load, and transform the data.
    • Create and reuse metadata and job components
    • Allow you to use familiar graphical point-and-click techniques to develop processes for extracting, cleansing, transforming, Integrating and loading data.

Use Designer to:

  • Specify how data is extracted.
    • Specify data transformations.
    • Decode data going into the target tables using reference lookups
    • Aggregate Data.
    • Split data into multiple outputs on the basis of defined constraints

The Designer graphical interface lets you select

  • Stage icons, drop them onto the Designer work area and add links. Then, still working in the Designer, you define the required actions and processes for each stage and link.
  • A job created with the Designer is easily scalable. This means that you can easily create a simple job, get it working, then insert further processing, additional data sources, and so on.

Datastage Expressions

Reading Time: < 1 minute
You can enter expressions at various places in a job sequence to set values.
Literal strings enclosed in double-quotes or single-quotes.
Numeric constants (integer and floating point).
The sequence’s own job parameters.
Prior activity variables (for example, job exit status).
All built-in BASIC functions as available in a server job.
Certain macros and constants as available in a server or parallel job:
DSHostName
DSJobController
DSJobInvocationId
DSJobName
DSJobStartDate
DSJobStartTime
DSJobStartTimestamp
DSJobWaveNo
DSProjectName
DS constants as available in server jobs.
Arithmetic operators: + – * / ** ^
Relational operators: > < = # <> >= =< and so on.
Logical operators (AND OR NOT) plus usual bracketing conventions.
The ternary IF … THEN … ELSE operator.

What is IBM WebSphere DataStage?

Reading Time: 2 minutes

DataStage is a product from IBM, used as the strategic ETL tool within many organizations.

It can be used for multiple purposes:

  • Interfacing between multiple databases.
  • Changing of data from one format to another. Example: From database to flat files, XML files, etc.
  • Fast access to data that doesn’t change often
  • Interacts with WebSphere MQ to provide real time processing capabilities triggered by external messages.

Datastage is client server technology, It’s a GUI based ETL tool from IBM Corporation.

Design jobs for Extraction, Transformation, and Loading (ETL)

Ideal tool for data integration projects – such as, data warehouses, data marts, and system migrations

Import, export, create, and manage metadata for use within jobs

Schedule, run, and monitor jobs all within DataStage

Administer your DataStage development and execution environments

Create batch (controlling) jobs.

Usage of DataStage within organizations

  • DataStage has Windows Clients which connect to the Server on the Unix / Windows or Mainframe platform.
  • The clients can be used to develop, deploy and run datastage jobs.
  • In a deployment environment, the jobs can be kicked off through scripts directly on Unix servers.
Datastage Architecture

Datastage Triggers

Reading Time: < 1 minute
There are three types of trigger:
 
Conditional. A conditional trigger fires the target activity if the source activity fulfills the specified condition. The condition is defined by an expression, and can be one of the following types:
OK. Activity succeeds.
Failed. Activity fails.
Warnings. Activity produced warnings.
ReturnValue. A routine or command has returned a value.
Custom. Allows you to define a custom expression.
User status. Allows you to define a custom status message to write to the log.
Unconditional. An unconditional trigger fires the target activity once the source activity completes, regardless of what other triggers are fired from the same activity.
Otherwise. An otherwise trigger is used as a default where a source activity has multiple output triggers, but none of the conditional ones have fired.

Datastage Activity stages

Reading Time: 2 minutes
Activity stages

The job sequence supports the following types of activity:

Job. Specifies a server or parallel job.

Routine. Specifies a routine. This routine can be any routine in the InfoSphere® DataStage® Repository (but not transforms).

ExecCommand. Specifies an operating system command to execute.

Email Notification. Specifies that an email notification is sent at this point of the sequence (uses SMTP).

Wait-for-file. Waits for a specified file to appear or disappear.

Exception Handler. There can only be one Exception Handler in a job sequence.The Exception Handler is executed if a job in the sequence fails to run (other exceptions are handled by triggers) or if a job aborts and the Automatically handle activities that fail option is set for the sequence. Use a Terminator stage to ensure that the sequence is stopped cleanly if certain situations arise.

Nested Conditions. Allows you to branch the execution of a sequence depending on a condition.

Sequencer. Allows you to synchronize the control flow of multiple activities in a job sequence.

Terminator. Allows you to specify that, if certain situations occur, the jobs being run by a sequence are shut down cleanly.

Start Loop and End Loop. Use these two stages together to implement a For…Next or a For…Each loop within your sequence.

User Variable. Allows you to define variables within a sequence. These variables can then be used later on in the sequence, for example to set job parameters.

Data Modelling with Erwin

Reading Time: 3 minutes

Definition:

Data modeling is a process used to define and analyze data requirements needed to support the business processes. The process of data modeling involves professional data modelers working with business stakeholders, as well as potential users of the information system.

Types:

Conceptual data model:

It is a set of technology independent specifications about the data and is used to discuss initial requirements with the business stakeholders.

Logical data model:

The conceptual data model is translated into a logical data model, which documents structures of the data the can be implemented in databases. Implementation of one conceptual model may require multiple logical data models.

Physical data model:

This is the last step in data model where the logical data model is transformed into a physical data model that organizes the data into tables, and accounts for access, performance and storage details.

Data modeling not only defines data elements, but also their structures and relationships between them.

Usage:

  • To assist business analysts, programmers, testers, manual writers, IT package selectors, engineers, managers, related organizations, clients to understand and use an agreed semi-formal model concepts of the organization and how they related to each other.
  • To manage data as a resource
  • For the integration of information systems
  • For designing databases/data warehouses (aka data repositories)

Erwin Data Modeler

CA Erwin Data modeler is a software tool for data modeling (data requirements analysis, database design etc.,) of custom developed information systems, including databases of transactional systems and data marts.

Erwin’s data modeling engine is based upon the IDEF1X method, although it now supports diagrams displayed with information engineering notation as well.

Features:

Conceptual data modeling

Logical data modeling

Physical data modeling

Logical to physical transformation

Includes an abbreviation/naming dictionary called “Naming standards editor” and a logical to RDBMS data type mapping facility called “Data type standards editor”, both of which are customizable with entries and basic rule enforcement

Forward engineering

Once the database designer is satisfied with the Physical data model, the tool can automatically generate a SQL Data definition language (DDL) script that can be directly executed in a RDBMS environment or saved to a file.

Reverse engineering

If an analyst needs to examine and understand an existing data structure, ERwin will depict the physical data base objects in an Erwin model file.

Model to model comparison

The ‘complete/compare” facility allows an analyst or a designer to view the differences between two model files (including real-time reverse engineered files), for instance to understand the changes between two versions of a model.

An UNDO feature is available from version 7.

Bulk editor: CA Erwin data modeler has a unique bulk editor feature that enable users to edit a collection of objects in a single editing session.

Active Model Templates: Derive new models from template models. Change in template model can be automatically synchronized into all derived models.

API (Application programmable interface) support: Erwin provides a programming interface to automate tasks that involves reading/writing of data models.

Domains (Column names): Erwin has a concept of domains that defines the characteristics of attributes/columns in a model. It acts as a library of attributes/columns so it can be reused across entities/tables and centrally managed through a flexible property inheritance/over-ride mechanism.

Model repository and version management: Erwin has a very powerful integrated model repository for change management (Workgroup edition) that support metadata governance, versioning, multi-user modeling etc.,

Trial version of the Erwin software can be downloaded at erwin.com

Datastage – Grid vs Cluster

Reading Time: < 1 minute
GridCluster
A Grid uses software to auto-gen the APT file to use a subset of resources based on compute node use. A Cluster is a group of Datastage servers where The admin/designer designs the APT file to use the compute node resources available
Under the GRID software, it will look at the utilisation of the compute nodes on the grid and pick which machines are underused and dispatch the job to these nodes. If you have a group of 25 compute nodes, the designer would write his APT file to use 4 compute nodes. If those nodes are busy, Datastage would still try and use those nodes