Understanding Table Definitions and Schemas in Data Warehousing

When working with data, it's crucial to define the data you are handling. This definition is achieved by importing or creating table definitions, which can be used in your job designs.

Table definitions serve as the foundation for your DataStage project, specifying the data to be utilized at each stage of a job. These definitions are stored in the repository and are shared among all projects. A minimum of table definitions should be available for each data source and one for each data target in the data warehouse.

When developing a job, you'll usually populate your stages with column definitions from table definitions found in the repository. This is done on the Columns tab of the stage editor. If you select specific options in the Grid Properties dialog box, the Columns tab will display two additional fields: Table Definition Reference and Column Definition Reference. These show the table definition and individual columns that the columns on the tab were derived from.

You can import, create, or edit a table definition using the Designer.

Importing a Table Definition

The simplest way to specify a table definition is by directly importing it from the source or target database.

A new table definition will be created, and its properties will automatically fill in with the details of your data source or data target.

Table definitions can be imported from various sources, including:

To import table definitions:

Procedure:

  1. Select Import > Table Definitions > Data Source Type from the main menu. For most data source types, a dialog box appears enabling you to connect to the data source (for some sources, a wizard appears and guides you through the process).
  2. Enter the required connection details and click OK. Once a successful connection to the data source has been established, the updated dialog box displays details of the table definitions available for import.
  3. Select the desired table definitions and click OK. The table definition metadata is imported into the repository.

In DataStage, Schemas offer an alternative method for specifying column definitions for data used by parallel jobs.

Schema format:

The following schema format is utilized to read a fixed width file:

//Schema FileΒ  is used to read Input data with out specifyinh metadata in the Sequential File stage
record{final_delim=end,delim=none}(CUSTOMER_SSN:STRING[11];CUSTOMER_NAME:STRING[30];CUSTOMER_CITY::STRING[40];CUSTOMER_ZIPCODE:STRING[10];)

Now we use the sequential file stage to read data from the datafile using schemafile.

To utilize schema file concept, enable runtime propagation in the job properties.

In the sequential file stage, add the Schema File option and provide the schema file name including path as demonstrated below.