Data Warehouse and Informatica Realtime Scenarios: 12/16/10

Thursday, December 16, 2010

Simple things about Lookup Transformation

Q Define lookup transformation?

A lookup transformation is used to lookup data in a ‘data-pool’. This data-pool may be a flat-file, relational table, view or a synonym. You can also create a lookup definition from a source qualifier. The Integration Service queries the lookup source based on the lookup ports in the transformation and a lookup condition. The Lookup transformation returns the result of the lookup to the target or another transformation.

Lookups are generally used to get a related value, to perform a calculation using the derived related value or to update a slowly changing dimension.

When you configure a flat file Lookup transformation for sorted input, the condition columns must be grouped. If the condition columns are not grouped, the Integration Service cannot cache the lookup and fails the session. For optimal caching performance, sort the condition columns. The Integration Service always caches flat file and pipeline lookups. If you configure a Lookup transformation to use a dynamic cache, you can use only the equality operator (=) in the lookup condition.

Q What are the differences between connected and unconnected lookups?

1. Connected Lokkup uses a dynamic or static cache while unconnected lookup uses only static cache.

2. Connected lookup can return multiple columns from the same row or insert into the dynamic lookup cache while unconnected lookup returns one column from each row.

3. Connected lookup supports user-defined default values while unconnected lookup does not supports user-defined default values.

Q How can you return multiple ports from an unconnected lookup transformation?

Unconnected lookup transformation returns only 1 port. To return multiple ports, concatenate all those ports in the overwritten lookup query and return the concatenated port. Now separate out those columns in an expression transformation.

Q How can you optimize a lookup transformation?

1. If you have privileges to modify the database containing a lookup table, you can improve lookup initialization time by adding an index to the lookup table.

2. You can improve performance by indexing the columns in the lookup ORDER BY.

3. By default, the Integration Service generates an ORDER BY clause for a cached lookup. The ORDER BY clause contains all lookup ports. To increase performance, you can suppress the default ORDER BY clause and enter an override ORDER BY with fewer columns. Place two dashes ‘--’ as a comment notation after the ORDER BY clause to suppress the ORDER BY clause that the Integration Service generates.

4. If you include more than one lookup condition, place the conditions in the following order to optimize lookup performance:

- Equal to (=)

- Less than (<), greater than (>), less than or equal to (<=), greater than or equal to (>=)

- Not equal to (!=)

5. Improve session performance by caching small lookup tables.

6. If the lookup table is on the same database as the source table in the mapping and caching is not feasible, join the tables in the source database rather than using a Lookup transformation.

Informatica Basics

Informatica Basics and Interview FAQs

Q1 What is Informatica Powercenter?

Ans Powercenter is a data integration software of Informatica Corporation which provides an environment that allows to load data into a centralized location such as data warehouse. Data can be extracted from multiple sources , can be transformed according to the business logic and can be loaded into files and relation targets. It has following components:
PowerCentre Domain
PowerCenter Repositiory
Administration Console
PowerCenter Client
Repository Service
Integration service
Web Services Hub
Data Analyzer
Metadata Manager
PowerCenter Repository Reports

Q2 What is Data Integration?

Ans Data Integration is the process of combining data residing at different sources and providing the user with a unified view of these data.

Q3 Explain PowerCenter Repository?

Ans Repository consist of database tables that store metadata. Metadata describes different types of objects , such as mappings or transformations , that you can create using PowerCenter Client tools. The interation service uses repository objects to extract , transform and load data. The repository also stores administrative information such as user names, passwords , permissions and previleges. When any task is performed through PowerCenter Client application such as creating users, analyzing sources , developing mapping or mapplets or creating workflows , Metadata is added to repository tables.

Q4. What is a Mapping?

Ans A mapping is a set of source and target definitions linked by transformation objects that define the rules for data transformation. Mappings represent the data flow between sources and targets. When the Integration Service runs a session, it uses the instructions configured in the mapping to read, transform, and write data.

Q5. What is a mapplet?

Ans A mapplet is a reusable object that contains a set of transformations and enables to reuse that transformation logic in multiple mappings.

Q6. What is Transformation?

Ans Transformation is a repository object that generates,modifies or passes data.Transformations in a mapping represent the operations the Integration Service performs on the data. Data passes through transformation ports that are linked in a mapping or mapplet.

Q7. What are Mapping Parameters and Variables? Whats the difference between them?

Mapping parameters and variables are used to make mappings more flexible.

A mapping parameter represents a constant value that can be defined before running a session. It retains the same value throughout the session. Using a parameter file, this value can be changed for subsequent sessions.

A mapping variable represents a value that can change through sessions. The Integration Service saves the value of a mapping variable to the repository at the end of each successful run and uses that value in the next run of the session.

Q8 What happens when you do not group values in an aggregator transformation?

When the values are not grouped in aggregator transformation, Integration service returns 1 row for all input rows. It typically returns the last row of each group (or the last row recieved) with the result of the aggregation. However, if you specify a particular row to be returned (e.g through FIRST function), then that row is returned.

Q9 How does using sorted input improves the performance of Aggregator Transformation?

When sorted input is used, the Integration Service assumes that all data is sorted by group and it performs aggregate calculations as it reads rows for a group. It doesnot wait for the whole data and hence this reduces the amont of data cached during the session and improves session performance. While when unsorted input is used, Integration service waits for the whole data and only then performs aggregation.

Sorted Input should not be used when either of the following conditions are true:

The aggregate expression uses nested aggregate functions
The session uses incremental aggregation

Q10 How does a join in Joiner transformation different from normal SQL join?

The joiner transformation join can be done on hetrogeneous sources but SQL join can be done only on tables.

Q11 What are the criteria for deciding Master and Detail sources for a joiner transformation?

The master pipeline ends at the joiner transformation while the detail pipeline continues to the target. So, accordingly the sources should be decided as master or detail
For optimal performance, if an unsorted joiner transformation is used, then designate the source with fewer rows as the master source. During a session run, the joiner transformation compares each row of the master source against the detail source.
For a sorted joiner transformation, designate the source with fewer duplicate key values as master.

Processing of incremental aggregation

Processing of incremental aggregation

The first time u run an incremental aggregation session the power center server process the entire source.At the end of the session the power center server stores aggregate data from the session runs in two files, the index file and the data file .The power center server creates the files in a local directory.

Guidelines to work with Informatica Power Center

Guidelines to work with Informatica Power Center

Repository: This is where all the metadata information is stored in the Informatica suite. The Power Center Client and the Repository Server would access this repository to retrieve, store and manage metadata.

Power Center Client: Informatica client is used for managing users, identifiying source and target systems definitions, creating mapping and mapplets, creating sessions and run workflows etc.

Repository Server: This repository server takes care of all the connections between the repository and the Power Center Client.

Power Center Server: Power Center server does the extraction from source and then loading data into targets.

Designer: Source Analyzer, Mapping Designer and Warehouse Designer are tools reside within the Designer wizard. Source Analyzer is used for extracting metadata from source systems.
Mapping Designer is used to create mapping between sources and targets. Mapping is a pictorial representation about the flow of data from source to target.
Warehouse Designer is used for extracting metadata from target systems or metadata can be created in the Designer itself.

Data Cleansing: The PowerCenter's data cleansing technology improves data quality by validating, correctly naming and standardization of address data. A person's address may not be same in all source systems because of typos and postal code, city name may not match with address. These errors can be corrected by using data cleansing process and standardized data can be loaded in target systems (data warehouse).

Transformation: Transformations help to transform the source data according to the requirements of target system. Sorting, Filtering, Aggregation, Joining are some of the examples of transformation. Transformations ensure the quality of the data being loaded into target and this is done during the mapping process from source to target.

Workflow Manager: Workflow helps to load the data from source to target in a sequential manner. For example, if the fact tables are loaded before the lookup tables, then the target system will pop up an error message since the fact table is violating the foreign key validation. To avoid this, workflows can be created to ensure the correct flow of data from source to target.

Workflow Monitor: This monitor is helpful in monitoring and tracking the workflows created in each Power Center Server.

Power Center Connect: This component helps to extract data and metadata from ERP systems like IBM's MQSeries, Peoplesoft, SAP, Siebel etc. and other third party applications.

Power Center Exchange: This component helps to extract data and metadata from ERP systems like IBM's MQSeries, Peoplesoft, SAP, Siebel etc. and other third party applications.

Informatica Transformations2

1.Joiner Transformation

Joiner Transformation is an Active and Connected transformation. This can be used to join two sources coming from two different locations or from same location. For example, to join a flat file and a relational source or to join two flat files or to join a relational source and a XML source.

In order to join two sources, there must be atleast one matching port. at least one matching port. While joining two sources it is a must to specify one source as master and the other as detail.

The Joiner transformation supports the following types of joins:

1)Normal

2)Master Outer

3)Detail Outer

4)Full Outer

Normal join discards all the rows of data from the master and detail source that do not match, based on the condition.

Master outer join discards all the unmatched rows from the master source and keeps all the rows from the detail source and the matching rows from the master source.

Detail outer join keeps all rows of data from the master source and the matching rows from the detail source. It discards the unmatched rows from the detail source.

Full outer join keeps all rows of data from both the master and detail sources.

2.Lookup transformation is Passive and it can be both Connected and UnConnected as well. It is used to look up data in a relational table, view, or synonym. Lookup definition can be imported either from source or from target tables.

For example, if we want to retrieve all the sales of a product with an ID 10 and assume that the sales data resides in another table. Here instead of using the sales table as one more source, use Lookup transformation to lookup the data for the product, with ID 10 in sales table.

Connected lookup receives input values directly from mapping pipeline whereas

UnConnected lookup receives values from: LKP expression from another transformation.

Connected lookup returns multiple columns from the same row whereas

UnConnected lookup has one return port and returns one column from each row.

Connected lookup supports user-defined default values whereas

UnConnected lookup does not support user defined

3.XML Source Qualifier Transformation

XML Source Qualifier is a Passive and Connected transformation.

XML Source Qualifier is used only with an XML source definition.

It represents the data elements that the Informatica Server reads when it executes a session with XML sources.

4.Stored Procedure Transformation

Stored Procedure transformation is a Passive and Connected & UnConnected transformation. It is useful to automate time-consuming tasks and it is also used in error handling, to drop and recreate indexes and to determine the space in database, a specialized calculation etc.

The stored procedure must exist in the database before creating a Stored Procedure transformation, and the stored procedure can exist in a source, target, or any database with a valid connection to the Informatica Server. Stored Procedure is an executable script with SQL statements and control statements, user-defined variables and conditional statements.

Note:

Differences between Advanced External Procedure and External Procedure Transformations:

1) External Procedure returns single value,

whereas

Advanced External Procedure returns multiple values.

2) External Procedure supports COM and Informatica procedures

whereas

Advanced External Procedure supports only Informatica Procedures.

Source Qualifier Transformation

Source Qualifier transformation is an Active and Connected transformation. When adding a relational or a flat file source definition to a mapping, it is must to connect it to a Source Qualifier transformation.

The Source Qualifier performs the various tasks such as

Overriding Default SQL query,

Filtering records;

join data from two or more tables etc.

Informatica - Transformations

Informatica - Transformations

In Informatica, Transformations help to transform the source data according to the requirements of target system and it ensures the quality of the data being loaded into target.
Transformations are of two types: Active and Passive.

Active Transformation

An active transformation can change the number of rows that pass through it from source to target i.e it eliminates rows that do not meet the condition in transformation.

Passive Transformation

A passive transformation does not change the number of rows that pass through it i.e it passes all rows through the transformation.

Transformations can be Connected or UnConnected.

Connected Transformation

Connected transformation is connected to other transformations or directly to target table in the mapping.

UnConnected Transformation

An unconnected transformation is not connected to other transformations in the mapping. It is called within another transformation, and returns a value to that transformation.

Following are the list of Transformations available in Informatica:
Aggregator Transformation

Expression Transformation

Filter Transformation

Joiner Transformation

Lookup Transformation

Normalizer Transformation

Rank Transformation

Router Transformation

Sequence Generator Transformation

Stored Procedure Transformation

Sorter Transformation

Update Strategy Transformation

XML Source Qualifier Transformation

Advanced External Procedure Transformation

External Transformation

In the following pages, we will explain all the above Informatica Transformations and their significances in the ETL process in detail.

Informatica Transformations1

Aggregator Transformation

Aggregator transformation is an Active and Connected transformation.

This transformation is useful to perform calculations such as averages and sums (mainly to perform calculations on multiple rows or groups).

For example, to calculate total of daily sales or to calculate average of monthly or yearly sales. Aggregate functions such as AVG, FIRST, COUNT, PERCENTILE, MAX, SUM etc. can be used in aggregate transformation.

Expression Transformation

Expression transformation is a Passive and Connected transformation.

This can be used to calculate values in a single row before writing to the target.

For example, to calculate discount of each product

or to concatenate first and last names

or to convert date to a string field.

Filter Transformation

Filter transformation is an Active and Connected transformation.

This can be used to filter rows in a mapping that do not meet the condition.

For example,

To know all the employees who are working in Department 10 or

To find out the products that falls between the rate category $500 and $1000.

Rank Transformation

Rank transformation is an Active and Connected transformation.

It is used to select the top or bottom rank of data.

For example,

To select top 10 Regions where the sales volume was very high

or

To select 10 lowest priced products.

Router Transformation

Router is an Active and Connected transformation. It is similar to filter transformation.

The only difference is, filter transformation drops the data that do not meet the condition whereas router has an option to capture the data that do not meet the condition. It is useful to test multiple conditions.

It has input, output and default groups.

For example, if we want to filter data like where State=Michigan, State=California, State=New York and all other States. It’s easy to route data to different tables.

Sorter Transformation

Sorter transformation is a Connected and an Active transformation.

It allows to sort data either in ascending or descending order according to a specified field.

Also used to configure for case-sensitive sorting, and specify whether the output rows should be distinct.

Sequence Generator Transformation

Sequence Generator transformation is a Passive and Connected transformation. It is used to create unique primary key values or cycle through a sequential range of numbers or to replace missing keys.

It has two output ports to connect transformations. By default it has two fields CURRVAL and NEXTVAL(You cannot add ports to this transformation).

NEXTVAL port generates a sequence of numbers by connecting it to a transformation or target. CURRVAL is the NEXTVAL value plus one or NEXTVAL plus the Increment By value.

Normalizer Transformation

Normalizer Transformation is an Active and Connected transformation.

It is used mainly with COBOL sources where most of the time data is stored in de-normalized format.

Also, Normalizer transformation can be used to create multiple rows from a single row of data.

Informatica concepts

Difference filter and router transformation.
Filter transformation to filter the data only one condition and drop the rows don’t meet the condition.

Drop rows does not store any ware like session log file..

Router transformation to filter the data based on multiple conditions and give yiou the option to route rows that don’t match to a default group.

Explain bulk loading?

You can use bulk loading to improve performance of a session that inserts a large amount of data to a db2,sysbase,oracle or MS SQL server database.

When bulk loading the power center server by passes the database log,which speeds performance.

With out writing to the database log, however the target database can’t perform rollback. As a result you may not be perform recovery.

What is a constraint based loading?

When you select this option the power center server orders the target load on a row-by-row basis only.

Edit tasks->properties->select treat source rows as insert.

Edit tasks->config object tab->select constraint based

If session is configured constraint absed loading when target table receive rows from different sources.The power center server revert the normal loading for those tables but loads all other targets in the session using constraint based loading when possible loading the primary key table first then the foreign key table.

Use the constraint based loading only when the session option treat rows as set to insert.

Constraint based load ordering functionality which allows developers to read the source once and populate parent and child tables in a single process.

what is transformation?

Transformation is repository object that generates modifies or passes data.

what are the type of transformations?

2 types:

1) active

2) passive.

Explain active and passive transformation?

Active transformation can change the number of rows that pass through it.No of output rows less than or equal to no of input rows.

Passive transformation does not change the number of rows.Always no of output rows equal to no of input rows.

Explain incremental aggregation?

Explain incremental aggregation?

When using incremental aggregation you apply captured changes in the source to aggregate calculations in a session.If the source changes only incrementally and you can capture changes you can configure the session to process only those changes. This allows the power center server to update your target incrementally rather than forcing it to process the entire source and recalculate the same data each time you run the session.

You can capture new source data.use incremental aggregation when you can capture new source data much time you run the session.Use a stored procedure on filter transformation only new data.

Incremental changes do not significantly change the target.Use incremental aggregation when the changes do not significantly change the target.If processing the incrementally changed source alters more than half the existing target, the session may not benefit from using incremental aggregation. In this case drop the table and recreate the target with complete source data.