Wednesday, November 12, 2014

Informatica - 2

Source Qualifier transformation

Purpose

Source Qualifier (SQ) Transformation selects records from various sources - those can be flat files, relational tables, Informatica PowerExchange services. It provides an efficient way to filter input columns (or fields) and to perform joins. 

Source Qualifier converts source datatypes to native (most compatible) PowerCenter datatypes. It is a mandatory component for relational table sources and flat files. Source Qualifier is an Active transformation (which means that the number of input rows may differ from the number of output ports). It generates a SQL query for relational sources and can perform joins between relational tables on the same database.

Converting datatypes

Data can be converted from one type to another by:
  • Passing data between ports with different datatypes. Datatypes must be compatible or explicitly converted.
  • Passing data from an expression to a port
  • Using transformation functions or arithmetic operators
    More detailed information on converting data types in Informatica Client Help: 'Port-to-Port Data Conversion' topic

    Example

    A sample use of Source Qualifier for a Flat File source (top) and ODBC database source (bottom). Note the datatype change for some columns. 
  •   

    Useful tips

  • For relational data sources, provide only columns (ports) which are used in the mapping (based on the output ports).
  • Use 'SQL Query' option in the Properties tab to override the default SQL

    Expression transformation

    Purpose

    Expression is a passive transformation (number of input and output rows is the same), which lets modify individual ports of a single row, or add or suppress them.It helps implement the complicated data transforms, applies business logic and performs checks and validations. 
    Data can be modified using logical and numeric operators or built-in functions. Sample transformations handled by the expression transformer:
  • data manipulation - concatentation, truncation, round (CONCAT, LTRIM, UPPER, INITCAP)
  • datatype conversion (TO_DECIMAL, TO_CHAR, TO_DATE)
  • data cleansing - check nulls, replace chars, test for spaces, test for number (ISNULL, REPLACESTR)
  • manipulate dates - convert, add, test (GET_DATE_PART, IS_DATE, DIFF_DATES)
  • scientific calculations and numerical operations - exponential, power, log, modulus (LOG, POWER, SQRT)
  • ETL specific - if, lookup, decode (IIF, DECODE)

    Examples / useful tips

  • If possible, use numeric operations which are faster than string.
  • Operators are faster than functions (i.e. || vs. CONCAT).
  • Use transformation variables to break down complex transformation logic into smaller parts. It is defined by checking a V check-box at the top (then I for Input and O for Output is greyed out).
  • It is highly recommended to define a naming convention for the input and output ports for expressions. For example, all input ports have an _in suffix, output ports _out and variables _var:
  • PowerCenter Designer client comes with a good help system with examples. Highlighting a function and pressing F1 launches it.
  • Ports are evaluated in the following order: input ports first, then variable ports in the display order (from top to bottom), then output ports.
  • Looking up data in Informatica

    Purpose and overview

    Lookup transformation - returns values from a database table or a flat file associated with a given input value. 
    In SQL teminology a lookup may be considered as a sub-query.
      There are the following Lookup transformation types:
    • Dynamic Lookup (connected) - when the lookup table is also the target, the data may go out of sync with the target table loaded in memory. The Dynamic Lookup transformation allows for the synchronization of the target lookup in-memory table with its physical table in a database. 
      Dynamic Cache properties worth looking at: Dynamic Lookup Cache (boolean, use onlywith the lookup cache enabled), 'Insert Else Update' or 'Update Else Insert'.
    • Unconnected lookup - used when a lookup is not needed for each record, the lookup data is fetched at the point the mappings requires it. It doesn't contain links from and to other transformation. Data lookup is performed only for those rows which require it, which can significantly improve performance of the ETL process. 
      Unconnected lookup function can be invoked within any transformation that supports expressions.

    Example - dynamic lookup

    Let's consider the daily updated master table with products. Within a day a product may change its status or an error may be corrected. A new product record may be added first and a change to that record may be added later. The change (insert followed by an update) needs to be detected automatically.

    Example - unconnected lookup

    Lookup table is usually used within a conditional statement in an Expression. For instance, to look up a product color only for rows which don't have it filled, use the following :
    IIF ( ISNULL(colorproduct_id),:lkp.COLORLOOKUP(product_id),colorproduct_id)
    

    Lookup cache

    Basically there are two types of cache memory: index and data cache. All port values from the lookup table where the port is part of the lookup condition are loaded into index cache. 
    The index cache contains all port values from the lookup table where the port is specified in the lookup condition. 
    The data cache contains all port values from the lookup table that are not in the lookup condition and that are specified as "output" ports. 
    After the cache is loaded, values from the Lookup input port(s) that are part of the lookup condition are compared to the index cache. 
    Upon a match the rows from the cache are included in the stream.
    Key cache related properties in the Lookup transform:
  • Lookup Caching Enabled - Indicates whether the integration Service caches lookup values during the session.
  • Lookup Cache Persistent - Indicates whether the Integration Service uses a persistent lookup cache.
  • Recache From Lookup Source - can be used only with the lookup cache enabled. When selected, the Integration Service rebuilds the lookup cache from the lookup source when it first calls the Lookup transformation instance. If a persistent lookup cache is used, it rebuilds the persistent cache files before using the cache, otherwise, it rebuilds the lookup cache in memory before using the cache.

    Useful tips / best practices

  • When the source is large, cache lookup table columns for those lookup tables of 500,000 rows or less.
  • Standard rule of thumb is not to cache tables over 500,000 rows.
  • Use equality (=) conditions if possible in the Condition tab of the Lookup.
  • Use IIF or DECODE functions when lookup returns small row sets.
  • Avoid date comparisons in lookup, it's more efficient to convert to string.
  • A large lookup table may require more memory resources than available. SQL override in the lookup transformation can be used to reduce the amount of memory used by the Lookup cache.
  • The unconnected lookup function should normally be called only when a condition (IIF) is evaluated.
  • An unconnected lookup transformation can return only one port, however multiple input ports may be passed to it.
  • One expression transformer can do more than one lookups.
  • Connected lookups use default values, where in unconnected lookups the default values are ignored.
  • Lookup cache - cache if the number (and size) of records in the lookup table is relatively small comparing to the number of rows requiring a lookup.
  • Lookup caching typically improves performance if the time taken to load the lookup cache is less than the time that would be taken to perform the external read requests. To reduce the amount of cache required: turn off or delete any unused output ports, index the lookup file to speed the retrieval time or use where clauses in the SQL override to minimize the amount of data written to cache.
  •  

    Sorter transformation

    Purpose

    Sorter is an active transformation which sorts the incoming data based on one or more key values (in an ascending, descending or mixed order). The Sorter transformation attribute, 'Distinct' provides a facility to remove duplicates from the input rows.
    Key properties worth looking at:
  • Sorter Cache Size - the Integration Service passes all rows into the Sorter transformation before it performs the sort operation. Any amount between 1 MB and 4 GB can be specified for the Sorter cache size.

  • Case Sensitive and Null Treated Low - options used to precise the output order
  • Work Directory - the directory used to create temporary files while it sorts data. After the rows are sorted, the temporary files get deleted automatically.
  • Distinct - if distinct output rows is checked, the Mapping Designer configures all ports as part of the sort key.

    Aggregator transformation

    Purpose

    The Aggregator transformation calculates aggregates such as sums, minimum or maximum values across multiple groups of rows. 
    The Aggregator transformation can apply expressions to its ports however those expressions will be applied to a group of rows unlike the Expression transformation which applies calculations on a row-by-row basis only. 

    Aggregate functions are created in output ports only. Function grouping requirements are set using the Aggregator Group By port. 

    Available aggregate expressions: AVG, COUNT, FIRST, LAST, MAX, MEDIAN, MIN, PERCENTILE, STDDEV, SUM 

    Key properties:
  • Sorted Input - should be checked when the input data is already sorted by groups.
  • Aggregator Data Cache Size
  • Aggregator Index Cache Size - default is "Auto", index cache contains all group by ports.
  • Examples / useful tips

  • Sorting the records prior to passing them on to an Aggregator transformation may improve the overall performance of the aggregation task.
  • Using a Sorter transformation may improve performance over an 'Order By' clause in a SQL override statement in aggregate session when the source is a database because the source database may not be tuned with the buffer sizes needed for a database sort.
  • In Aggregator checking the sorted input attribute will bypass caching.

    Update Strategy transformation

    Purpose

    Specifies how each row updates target tables (insert, update, delete, or reject) based on an expression.
    The Update Strategy transformation "tags" a row with the appropriate DML (data manipulation language) for the PowerCenter writer to apply to a relational target. Each row can be "tagged" with one of the following flags (the DD label stands for Data Driven). Appropriate SQL is submitted to the target database:
  • DD_INSERT for insert

    Transaction Control

    Purpose

    The Transaction Control transformation is Active and Connected which lets control the commit and rollback transactions based on a set of passing-through data. A commit or rollback can be defined based on a varying number of input rows, for example a common product group key (to ensure that the entire group gets loaded) or invoice date (to avoid loaded an incomplete set of invoices for a given date into a data warehouse).

    Use

    A transaction is the row or set of rows bound by commit or roll back rows. The number of rows may vary for each transaction. 
    Transaction control can be defined at two levels:
    • Within a mapping - a transaction is defined using an expression in a Transaction Control transformation. Based on the return value of the expression, the decision can be made whether to commit, roll back, or continue without any transaction changes.
    • Within a session - when a session is configured for user-defined commit, the adminisitrator can choose to commit or roll back a transaction if the Integration Service fails to transform or write any row to the target.
    The Transaction Control transformation can be used to define conditions to commit and roll back transactions from transactional targets, such as: relational databases, XML, and dynamic MQSeries targets.
    The following built-in variables can be used in the Expression Editor to create a transaction control expression:
  • TC_CONTINUE_TRANSACTION - does not perform any transaction change for this row (the default value).
  • TC_COMMIT_BEFORE - commits the transaction (write rows to the target) and begins a new transaction (current row is the new transaction).
  • TC_COMMIT_AFTER - commits the transaction (write rows to the target) and begins a new transaction (current row is the committed transaction).
  • TC_ROLLBACK_BEFORE - roll backs the current transaction, begins a new transaction, and writes the current row to the target. The current row is in the new transaction.
  • TC_ROLLBACK_AFTER - writes the current row to the target, rolls back the transaction, and begins a new transaction. The current row is in the rolled back transaction.

    Examples / useful tips

  • Let's consider source data with invoices sorted by date (INV_DATE column). 
    To catch date change in the invoices flow, the NEW_DATE indicator is calculated within an Expression Transformation with this expression: IIF(INV_DATE_CURRENT=INV_DATE_PREV, 0,1).
    To commit all invoices for a given day the following transaction control expression might be used:IIF(NEW_DATE = 1, TC_COMMIT_BEFORE, TC_CONTINUE_TRANSACTION)
  • If the mapping includes an XML target, append or create a new document can be selected on commit. The input groups must receive data from the same transaction control point.
  • If the transaction control expression evaluates to a value other than commit, roll back, or continue, the Integration Service fails the session.
  • Each target instance must be conneted to a Transaction Control transformation.
  • Multiple targets can be connected to a single Transaction Control transformation.
  • If Dynamic Lookup and Transaction Control transformations are used in the same mapping, a rolled-back transaction might result in unsynchronized target data.
  • DD_DELETE for delete
  • DD_UPDATE for update
  • DD_REJECT means the row will not have SQL written for it

    Examples / useful tips

  • IIF(row_exists_flag = 1, DD_UPDATE, DD_INSERT)
  • For the row tags DD_DELETE and DD_UPDATE, the table definition in a mapping must have a key identified otherwise the session created from that mapping will fail.
  • Rows tagged with DD_REJECT will be passed on to the next transformation or target and subsequently placed in the appropriate "bad file" if the "Forward Rejected Rows" attribute is "checked" (default). If the attribute is "un-checked" then reject rows will be skipped.
  • In some cases there may be a performance benefit to split a mapping with updates and inserts into two mappings and sessions: one with inserts and the other with updates.

    JAVA transformation

    Purpose

    Java transformation in Informatica PowerCenter uses Java programming language to transform the data. 
    It is not necessary to write the entire Java code or use an external Java development environment as the Java code can be entered as snippets in the PowerCenter Designer client. 
    The code in Java transformation can invoke Informatica's custom expressions, user-defined functions, unconnected transformations and mapping variables. Java methods, variables, third-party API's, built-in Java packages and static code can be invoked as well. 
    Java transformation can be re-usable and it can be defined as both active or passive Informatica object.
    The Java Transformation has four self-explanatory tabs: Transformation (general options), Ports (inputs and outputs in separate groups), Properties (active/passive, deterministic), and Java Code. Once the ports and properties are set, the java code can be entered and compiled from within the designer window. The code window is divided into tab windows which includes:
    • Import Packages - import 3rd party java packages, built-in or custom Java packages
    • Helper code - declare user-defined variables and methods for the Java transformation class.
    • On Input Row - the Java code is executed one time for each input row. Only on this tab the input row can be accessed.
    • On End of Data - defines the behavior after processing all the input data
    • On Receiving transaction - code which is executed when a transaction is received by the transformation
    • Java expressions - used for defining and calling Java expressions
    Then the code snippets get compiled into byte code behind the scenes and Integration Service starts a JVM that executes it to process the data.

    Examples / useful tips

    Sample Java transformation window:

    Informatica ETL process implementation

    Parameters and variables in Informatica PowerCenter

  • Purpose

    A mapping can utilize parameters and variables to store information during the execution. Each parameter and variable is defined with a specific data type and their main purpose is to provide increased development flexibility. 
    Parameters are different from variables in the fact that:
  • The value of a parameter is fixed during the run of the mapping
  • Variables can change in value during run-time 

    Both parameters and variables can be accessed from any component in the mapping which supports it. 

    To create a parameter or variable, go to Mapping -> Parameters and Variables from within the Mapping Designer in the Designer client.
    The format is $$VariableName or $$ParameterName

    Changing values of Variables

    To change the value of a variable, one of the following functions can be used within an expression: SETMAXVARIABLE($$Variable, value) , SETMINVARIABLE($$Variable, value), SETVARIABLE($$Variable, value) , SETCOUNTVARIABLE($$Variable), where:

  • SETVARIABLE sets the variable to a value that you specify (executes only if a row is marked as insert or update). At the end of a successful session, the Integration Service saves either the MAX or MIN of (start value.final value) to the repository, depending on the aggregate type of the variable. Unless overridden, it uses the saved value as the start value of the variable for the next session run.
  • SETCOUNTVARIABLE - increments a counter variable. If the Row Type is Insert increment +1, if Row Type is Delete increment -1. A value = 0 is used for Update and Reject.
  • SETMAXVARIABLE - compare current value to value passed into the function. Returns the higher value and sets the current value to the higher value.
  • SETMINVARIABLE - compare current value to the value passed into the function. Returns the lower value and sets the current value to the lower value.
    At the end of a successful session, the values of variables are saved to the repository. The SetVariable function writes the final value of a variable to the repository based on the Aggregation Type selected when the variable was defined.

    Parameter files

    Parameter file is an ASCII file which is used to set values of mapping paramteres and variables. Parameters can be set on workflow, worklet or session level. The physical location of a parameter file is set in Workflow manager in Workflows -> Edit. It can also be specified using the pmcmd command when starting a session task.

    Parameter file structure

    Parameters can be grouped into the following sections:
    • [Global]
    • [Service: service name]
    • [folder name.WF:workflow name]
    • [folder name.WF:workflow name.WT:worklet name]
    • [folder name.WF:workflow name.WT:worklet name.WT:worklet name...]
    • [folder name.WF:workflow name.ST:session name]
    • [folder name.session name]
    • [session name]

    Examples / useful tips

    • The value is initialized by the specification that defines it, however it can be set to a different value in a parameter file, specified for the session task
    • Initialization priority of Parameters: Parameter file, Declared initial value, Default value
    • Initialization priority of Variables: Parameter file, Repository value, Declared initial value, Default value
    • Parameters and variables can only be utilized inside of the object that they are created in.
    • Parameters and variables can be used in pre and post-SQL
    • Sample parameter file:
      [Service:IntegrationSvc_01]
      $$SuccessEmail=dwhadmin@etl-tools.info
      $$FailureEmail=helpdesk@etl-tools.info
      [DWH_PROJECT.WF:wkf_daily_loading]
      $$platform=hpux
      $$DBC_ORA=oracle_dwh
      [DWH_PROJECT.WF:wkf_daily_loading.ST:s_src_sa_sapbw]
      $$DBC_SAP=sapbw.etl-tools.info
      $$DBC_ORA=oracle_sap_staging
      
     Workflow tasks

    Purpose

    Tasks are designed in Workflow Manager and represent an executable set of actions, functions or commands - which define the runtime behavior of thw whole ETL process.
    Some of the most often used tasks:
  • Session task - it needed to run a mapping
  • Assignment task - can establish the value of a Workflow Variable whose value can be used at a later point in the workflow, as testing criteria to determine if (or when) other workflow tasks/pipelines should be run.
  • Decision task - enables the workflow designer to set criteria by which the workflow will or will not proceed to the next set of tasks, depending on whether the set criteria is true or false. It tests for a condition during the workflow and sets a flag based on the condition
  • Email task - sends an email from within a workflow. Email addresses, a subject line and the email message text can be defined. When called from within a Session task, the message text can contain variable session-related metadata.
  • Event wait task - pauses processing of the pipeline until a specified event occurs. Events can be Pre-defined (file watch) or User-defined (created by an Event Raise task elsewhere in the workflow)
  • Command task - specifies one or more UNIX command or shell script, DOS command or batch file for Windows servers to run during a workflow. Command task status (success or failure) is held in the task-specific variable: $command_task_name.STATUS. 
    UNIX and DOS commands can also be run pre- or post- session in a Session task.

    Examples / useful tips

  • Session, Email and Command tasks can be reusable.
  • Reusable tasks are created within the Task Developer

    Workflow

    Purpose

    Workflow, designed in Workflow Manager, is a collection of tasks that descibe runtime ETL processes. Speaking the IBM Infosphere Datastage language, Worflows are Job Sequences, Flows in Ab Initio and Jobs in Pentaho Data Integration. 

    Examples / useful tips
  • Use a parameter file to define the values for parameters and variables used in a workflow, worklet, mapping, or session. A parameter file can be created with any text editor such as Notepad or Vi.
  • When developing a sequential workflow, it is a good idea to use the Workflow Wizard to create Sessions in sequence. Dependencies between the sessions can be created.
  • Session parameters must be defined in a parameter file. Since session parameters do not have default values, when the Integration Service cannot locate the value of a session parameter in the parameter hie, the session initialization fails.
  • On under-utilized hardware systems, it may be possible to improve performance by processing partitioned data sets in parallel in multiple threads of the same session instance running on the Integration Service node. However, parallel execution may impair performance on over-utilized systems or systems with smaller I/O capacity
  • Incremental aggregation is useful for applying captured changes in the source to aggregate calculations in a session.
  • From the Workflow Manager Tools menu, select Options and select the option to 'Show full names of task'. This will show the entire name of all tasks in the workflow.

    Design and development best practices

    Mapping design tips

    • Standards - sticking to consistent standards is beneficial in a long-run. This includes naming conventions, descriptions, environment settings, parameter files, documentation, etc.
    • Reusability - in order to be able to react quickly to potential changes, use where possible such Informatica components as mapplets, worklets, reusable transformations.
    • Scalability - when designing and developing mappings, it is a good practice to keep volumes in mind. This is caching, queries, partitioning, initial vs incremental loads.
    • Simplicity - it is recommended to create multiple mappings instead of few complex ones. Use Staging Area and try to keep the processing logic as clear and simple as possible.
    • Modularity - use modular design technique (common error handling, reprocessing).

    Mapping development best practices

    • Source Qualifier - use shortcuts, extract only the necessary data, limit read of columns and rows on source. Try to use the default query options (User Defined Join, Filter) instead of using SQL Query override which may impact database resources and make unable to use partitioning and push-down.
    • Expressions - use local variables to limit the amount of redundant calculations, avoid datatype conversions, reduce invoking external scripts (coding outside of Informatica), provide comments, use operators (||, +, /) instead of functions. Keep in mind that numeric operations are generally faster than string operations.
    • Filter - use the Filter transformation as close to the source as possible. If multiple filters need to be applied, usually it's more efficient to replace them with Router.
    • Aggregator - use sorted input, also use as early (close to the source) as possible and filter the data before aggregating.
    • Joiner - try to join the data in Source Qualifier if possible, avoid outer joins. It is a good practice to use a source with fewer rows as a Master source.
    • Lookup - relational lookup should only return ports that meet the condition. Call Unconnected Lookup in expression (IIF). Replace large lookup tables with joins when possible. Review the database objects and add indexes to database columns when possible. Use Cache Calculator in session to eliminate paging in lookup cache.Informatica extensions and optional features
  • 1 comment :

    1. Great article! Thank you for sharing such usefull imformation, but i want to know more about Informatica Read Json please share more imformative article with us.

      Informatica Read Soap API

      ReplyDelete