DataStage File Stages - Data Warehousing Data Warehousing

Reading Time: 5 minutes

File stages used to read and write data from files.
Following are some common file stages used into DataStage.

Sequential File Stage

The sequential file Stage is a file Stage. It is the most
common I/O Stage used in a DataStage Job. It is used to read data from or write
data to one or more flat Files. It can have only one input link or one Output
link. It can also have one reject link.

While handling huge volumes of data, this Stage can itself
become one of the major bottlenecks as reading and writing from this Stage is
slow.

Sequential files should be used in following conditions When
we are reading a flat file (fixed width or delimited) from UNIX environment,
which is FTPed from some external systems When some UNIX operations has to be
done on the file don’t use sequential file for intermediate storage between
jobs. It causes performance overhead, as it needs to do data conversion before
writing and reading from a UNIX file.

In order to have faster reading from the Stage the number of
readers per node can be increased (default value is one).

Dataset Stage

The Data Set is a file Stage, which allows reading data from
or writing data to a dataset. This Stage can have a single input link or single
Output link. It does not support a reject link.

It can be configured to operate in sequential mode or
parallel mode. DataStage parallel extender jobs use Dataset to store data being
operated on in a persistent form. Datasets are operating system files which by
convention has the suffix .ds Datasets are much faster compared to sequential
files. Data is spread across multiple nodes and is referred by a control file.

Datasets are not UNIX files and no UNIX operation can be
performed on them. Usage of Dataset results in a good performance in a set of
linked jobs. They help in achieving end-to-end parallelism by writing data in
partitioned form and maintaining the sort order.It also preserve partitions.

Dataset is having following parts:

Descriptor file: contains metadata, data location, but NOT
the data itself

Data file(s): Contains data in Native format
C:/IBM/Information Server / Server/data set/ file. Ds

Control file (or) header file : Resides in operating system.

File set stage

The File Set stage is a file stage. It allows you to read
data from or write data to a file set. The stage can have a single input link,
a single output link, and a single rejects link.

It only executes in parallel mode. advantage of using
fileset over a sequential file is “it preserves partitioning scheme”. The
amount of data that can be stored in each destination data file is limited by
the characteristics of the file system and the amount of free disk space
available. The number of files created by a file set depends on 1) the number
of processing nodes in the default node pool. 2) The number of disks in
the export or default disk pool connected to each processing node in the
default node pool. 3) The size of the partitions of the data set.

Lookup file set stage

The Lookup File Set stage is a file stage. It allows you to
create a lookup file set or reference one for a lookup. The stage can have
a single input link or a single output link. The output link must be a
reference link. The stage can be configured to execute in parallel or
sequential mode when used with an input link.

External source stage

The External Source stage is a file stage. It allows you to
read data that is output from one or more source programs. The stage calls
the program and passes appropriate arguments. The stage can have a single
output link, and a single rejects link. It can be configured to execute in
parallel or sequential mode.

External Target stage

The External Target stage is a file stage. It allows you to
write data to one or more source programs. The stage can have a single
input link and a single rejects link. It can be configured to execute in
parallel or sequential mode.

Complex Flat File stage

The Complex Flat File (CFF) stage is a file stage. You can
use the stage to read a file or write to a file, but you cannot use the same
stage to do both. As a source, the CFF stage can have multiple output
links and a single reject link. You can read data from one or more complex flat
files, including MVS™ data sets with QSAM and VSAM files. You can also read
data from files that contain multiple record types. The source data can contain
one or more of the following clauses:

v GROUP

v REDEFINES

v OCCURS

v OCCURS DEPENDING ON

CFF source stages run in parallel mode when they are used to read multiple
files, but you can configure the stage to run sequentially if it is reading
only one file with a single reader.

By using CFF, we can read ASCII or EBCDIC (Extended Binary
coded Decimal Interchage Code) data. We can select the required columns and can
omit the remaining. We can collect the rejects (bad formatted records) by
setting the property of rejects to “save” (other options: continue,
fail). We can flatten the arrays(COBOL files).

As a target, the CFF stage can have a single input link and
a single reject link. You can write data to one or more complex flat files. You
cannot write to MVS data sets or to files that contain multiple record types.

What to read next?