~Note~

Please note that you can always click on an image in my postings and it will render a clear full sized version in a separate browser page! Also please note that this blog is best viewed with Firefox and Chrome
Google
 

Friday, February 19, 2010

Oracle Data Integrator Tutorial- ODI for beginners

I am going to be writing up a few posts and tutorials on the basics of the architecture (see previous post) and a few beginner topics to get the masses in a mode where they are not intimidated by this tool. The previous post described various aspects to ODI, and the architecture. We saw various GUIs that ODI allows us to use. This post (and possibly another one or two) will look at the DESIGNER GUI in more detail.

One of the first things I will say is that the tool can feel a bit foreign if you come from an Informatica/OWB type background like myself. The emphasis is on building smaller and reusable pieces of code, if you will. Other tools that tend to be more ETL based (Informatica) and also more GUI and non-declarative based (OWB) are not necessarily the easiest tools to break things into small pieces. Yes, there are many that say breaking mappings into smaller pieces is a best practice, but the tool doesn't necessarily thrive on your ability to follow that practice. They don't really revolve around that practice. There are even mapplets and pluggable mappings, but these pale in comparison to the design methodologies that ODI offers. ODI thrives on code re-use.

For example, the diagram below shows a typical ETL/ELT from something that may come out of OWB or Informatica (a bit bigger, defined, and manual).
The following is the type of development that ODI supports and emphasizes for code reuse:


If you are a beginner to ODI and used other ETL/ELT tools and yoou are beginning to build out mappings you'll notice many of the mappings you'll build will be more discrete, if you will. There are also many steps that are done for you. You can build mappings quite fast in ODI actually. However, do not be too worried that that tool does quite a bit for you. Like Oracle's JDeveloper tool, you can get into most of the gory details if you'd like to fine-tune certain aspects, even if the tool does much of the work for you.

You will do the bulk of your work doing mappings in the following interface:



A brief description of what we see above in the Diagram above. First, notice that the Diagram TAB is selected on the top. This opens the diagramming tool where 'mappings' are built. In the middle are the data sources. These sources are physically dragged and dropped from the far left pane of the GUI to this 'sources' section and are then joined, filtered, and so forth. The right side of the GUI is labeled as the TARGET datastore. This is your target. For a beginner, this can be a bit confusing if you're used to OWB because here, the sources and target is separated by differnet parts of the GUI. You would drag your TARGET table to this target datastore pane in the GUI and not into the same area as the sources. So, a bit different in that respect.

Below is a zoom in on only the mapping portions of the tool. Again, on the left are the source portions of the mapping diagram, and on the left is the target datastore. You might not use the terminology 'datastore' too often, but in this tool you should get used to seeing that term a lot!
Here is another screenshot of the far left portion of the main GUI, which I tend to call the 'explorer' since it is built out like many software tool's explorer-like interfaces.
Again, one final screenshot to tie it all together what we have seen so far:

In the next posting, I will go into a bit more detail on the 'explorer' and the many tabs and options available within. As you might have imagined, these 'mappings' have been created and opened from the explorer on the left, and they are technically referred to as 'interfaces'. I'll talk a bit more on the concept of interfaces, and go into a bit more detail on working with your interfaces (A.K.A. mappings to most of us).

Monday, February 8, 2010

ODI Oracle Data Integrator Architecture Part 1


The following series will be a high level look at the pieces, components, and basic workings of ODI, which should help any reader comprehend some of the finer pieces of this data integration and ELT tool.

The GUI graphical modules are listed above.


The four ODI GUIs – Designer, Operator, Topology Manager and Security Manager, are based on Java. They can be installed on any platform that supports Java Virtual Machine 1.4, including Windows, Linux, HP-UX, Solaris, pSeries and so on.

Designer is the GUI for defining metadata, and rules for transformation and data quality. It uses these to generate scenarios for production, and is where all project development takes place. It is the core module for developers and metadata administrators.

Operator is used to manage and monitor ODI in production. It is designed for production operators and shows the execution logs with errors counts, the number of rows processed, execution statistics etc. At design time, developers use Operator for debugging purposes.

Topology Manager manages the physical and logical architecture of the infrastructure. Servers, schemas and agents are registered here in the ODI Master Repository. This module is usually used by the administrators of the infrastructure.

Security Manager manages users and their privileges in ODI. It can be used to give profiles and users access rights to ODI objects and features. This module is usually used by security administrators.

All ODI modules store their information in the centralized ODI repository.

ETL Misconceptions and Mistakes Made While Working With Large Databases - 1TB or Greater

I'd first like to give some due credit to a fellow blogger that has pushed out so much information on data warehouse topics, it is staggering. He is Dan Lindstedt, and this next piece on my blog is about some common etl mistakes on large systems, with my commentary. His work appears frequently on the Beyenetwork blogs.

Please note that some of these issues listed below might not necessarily make a difference if you are dealing with a smaller system with less data / rows. If you have never worked on a larger size database, some of these concepts might even seem unfamiliar or odd.

Top ETL mistakes


1) Incorporating Inserts, Updates, and Deletes in to the same data flow / same process.

This is a great point, if for nothing less than splitting the individual tasks into discrete pieces. For a target table load, try an insert job, an update job, and other jobs like deletion or massive bulk loading jobs. These will all vary based on time and loads. Inserts in many cases being simple and easy, and updates and especially deletes sometimes becoming more taxing.

2) Sourcing multiple systems at the same time, depending on heterogeneous systems for data.

This can be cause for great concern if you try this, unless the data volume is smaller, again, we are referring to larger loads and larger systems though.

3) Targeting more than 1 or 2 target tables

You might consider 2 tables at once if it has a relationship, hierarchical, parent-child, or otherwise, but this should not happen often! That should be an exception, most large volume ETL jobs should be 1 target table.

4) moving rows that are too wide through a single process

Any ways to get around this issue or mistake? Comments?

5) loading very large data sets to targets WITH INDEXES ON

Many tools allow you to turn indexes off and then recompute and turn on again after a job- datastage, owb, etc. You can code it custom if you need to. Again, you don't need indexes and keys if you have other ways to check your referential integrity.

6) not running a cost-based optimizer in the database

Big mistake here. You should always consider a routine after all loads are finished to analyze, or gather stats or other such details for your database just so that your cost-based optimizer, or other tool (in a non-Oracle database) can gather all the necessary information about new structures, indexes, and so forth.

7) not keeping statistics up to date in the database

Again, this is similar to 6, almost exactly. You'd want current statistics so the cost based optimizer takes advantage of the current structure and used the right joins and indexes, etc.

8) not producing the correct indexes on the sources / lookups that need to be accessed

You should know well enough where the majority of your lookups and/or joins are going to happen....indexes....use them! But, again, see the above mistake about keeping them during a load.

9) not purchasing enough RAM for the ETL server to house the RAM caches in memory.

Not applicable to OWB necessarily, more for things like Informatica and DataStage.

10) running on a 32 bit environment which causes significant OS swapping to occur
11) running on a 32 bit environment which causes significant OS swapping to occur
12) running on a 32 bit environment which causes significant OS swapping to occur

The original author had a major issue with #10, so as to repeat it for #11 and for #12!!!

The original author points out the following concerning #10-#12:

"Let me explain this one. Any single process started within a 32 bit environment is limited (immediately) to 2GB of addressable RAM, now - take away the RAM needed for the threads or the executable engine code... usually around 300M or more, leaving us with 1.5GB to 1.7GB of RAM to work with. Then, take away any "caching" objects that might use this RAM, and multiply that "drop" in RAM by the number of caching objects that you have....

Now, you are left with possibly 200MB, maybe 100MB of RAM left to allocate for "data reading/data writing"... and it drops from there.

Let me explain this other piece as well: Windows 32 bit OS MUST RUN CODE UNDER THE 640k BOUNDARY!!! So there's additional Limitations there, Windows 32 Bit PAGEFILE.SYS (swap/temp) is SINGLE THREADED, AND BLOCKS I/O OPERATIONS when swapping pieces to disk. Oh yes, one more: Windows 32 Bit - RAM allocation will ONLY give the application 1/2 of the requested RAM, and AUTOMATICALLY put 1/2 in the pagefile.sys swap area.

By the way, here's a tid-bit for you that you have to try to believe:
If you are running Windows 32 bit, SQLServer (either version)....
1. Change your ETL "block size" to 8k to match the Database (4k I think in SQLServer2000)
2. Change your row size to fit as many rows as possible into an 8k block, the more rows per block the faster the performance.
3. Change your commit point to 8,500 rows (this is the sweet spot)
4. Check in on NETWORK PACKET SIZES and increasing those between the ETL engine and the Database, increase them from 2k/4k to 8k - again to match the disk.
5. USE ODBC/OLE DB connectivity; do NOT use NATIVE libraries (except in SQLServer2005 direct CT-LIB). If your tool uses DB-LIB as "native" connectivity, it will be slow, slow slow - DB-LIB is 12 year old technology, and is NOT multi-threaded, nor parallel, where CT-LIB is.

Windows 64 Bit does NOT have these limitations, but requires 30% more hardware to run "cool" and efficiently, but it is FAST when installed and configured correctly on the right hardware.

Linux, and Unix do not have these limitations either.

So, architecture is everything."

13) Trying to do "too much" inside of a single data flow, increasing complexity and dropping performance

This does make the job harder to debug and sometimes you get those random out of resource errors...Beware. Split work where appropriate. If a mapping or process looks too complicated, it probably is.

14) believing that "I need to process all the data in one pass because it's the fastest way to do it."

"breaking-up" a single flow into multiple flows , and changing the width of the data sets and possibly the parallelism along the way would help the performance.

15) Letting the database "bounce" errors back to the ETL tool, dropping flow rates and throughput rates by factors of 4x to 10x.

comments?

16) "THINKING" in a transactional mode, rather than a batch mode, and processing each row, one row at a time (like they would code a cursor in a database language).

A bit hard to avoid in a row-by-row ETL tool! The parallel architecture and caching and memory sharing and a bunch of other things make it fast.

17) LOOPING inside an ETL process, because they think it's necessary (transactional processing again).

Careful here no matter which tool you are using. Looping, as it goes WAY back in simple logical programming can be a blessing or a disaster. If you need to loop, there's probably another way to do what you're attempting, in a more efficient way. This does not necessarily mean using a LOOP in an OWB process flow, but trying to loop somehow in the ETL itself, called a stored function or procedure, etc.

Friday, February 5, 2010

GO New Orleans Saints!



I hope they win the super bowl and knock off those dumb Colts. They beat my Bears in 2006-2007, so GO SAINTS! Who DAT?!

Date Dimensions anybody?

Here is some sample code for creating a date dimension, It's been out there before, My colleague Dave posted this on his blog at one point. I'd like to take a moment and maybe look at how Oracle implements their date dim with the OWB date dimension wizard....hmmmm!

SELECT TO_CHAR (dt, 'YYYYMMDD') date_id,
dt date_value,
TO_CHAR (dt, 'MM/DD/YYYY') date_formatted_name,
TO_CHAR (dt, 'Mon DD, YYYY') date_name,
TO_CHAR (dt, 'YYYY') year_num,
TRUNC (dt, 'YYYY') year_start_date,
ADD_MONTHS (TRUNC (dt, 'YYYY'), 12) - 1 year_end_date,
TO_NUMBER (TO_CHAR (dt, 'Q')) quarter_num,
'Q' || TO_CHAR (dt, 'Q') quarter_name,
TRUNC (dt, 'Q') quarter_start_date,
ADD_MONTHS (TRUNC (dt, 'Q'), 3) - 1 quarter_end_date,
TO_CHAR (dt, 'YYYY') || '-' || 'Q' || TO_CHAR (dt, 'Q') year_quarter_name,
TO_CHAR (dt, 'MM') month_num,
TO_CHAR (dt, 'Mon') month_short_name,
TO_CHAR (dt, 'Month') month_long_name,
TRUNC (dt, 'MM') month_start_date,
LAST_DAY (dt) month_end_date,
TO_CHAR (dt, 'YYYY') || '-' || TO_CHAR (dt, 'Mon') year_month_name_1,
TO_CHAR (dt, 'YYYY') || '-' || TO_CHAR (dt, 'MM') year_month_num_1,
TO_CHAR (dt, 'Mon') || ' ' || TO_CHAR (dt, 'YYYY') year_month_name_2,
TO_CHAR (dt, 'YYYY') || TO_CHAR (dt, 'MM') year_month_num_2,
TO_CHAR (dt, 'WW') week_of_year_num,
TO_CHAR (dt, 'W') week_of_month_num,
TRUNC (dt, 'W') week_start_date,
TRUNC (dt, 'W') + 6 week_end_date,
TO_CHAR (dt, 'YYYY') || TO_CHAR (dt, 'WW') year_week_num,
TO_CHAR (dt, 'YYYY') || '-' || TO_CHAR (dt, 'WW') year_week_name,
TO_CHAR (dt, 'D') weekday_num,
TO_CHAR (dt, 'Day') weekday_name
FROM (SELECT TO_DATE (:start_date) + ROWNUM dt
FROM DUAL
CONNECT BY TO_DATE (:start_date) + ROWNUM <= TO_DATE(:end_date));

Metadata reporting on the OWB 11gR1 repository- options?

There are quite a few clients, coworkers and managers I've met with that feel the documentation capabilities of OWB are, shall we say....lacking a bit. Sure, there is the OWB online repository browser which is quite handy and gives you a lot of information. There is the metadata lineage and impact analysis diagramming which again is quite handy. There are various tricks one can do with OMB*PLUS or experts to whip up some simple documentation. Of course when you want to get specific, then it becomes a challenge.

Say you want a report, with minimal work, that is in a CSV or excel spreadsheet showing all sorts of values that you figure you can grab from the repository somehow. Again, you can accomplish some great things with OMB and experts, and I have done this, but many times it is too time intensive and overkill. Sometimes you might consider UDOs (User Defined Objects) which can be created with OMB scripting, which give you additional fields and properties to use on your objects. This can make things easy when you are in the beginning of a project and you can tackle documenting things as you go along.

What if you're 50% or even 90% done with your project, and someone demands something - fast. I say, look at the repository quickly, write some SQL, and get it out there fast.

Well in OWB 11gR1, the repository is broken down into things called workspaces. These workspaces have separate owners, and they have users themselves. So if you want to access the OWBSYS schema's many views that hold important metadata, you'll need to set some context for the user accessing the data- namely the various workspace owners. Below are a few steps that have been published in forums and the owb blog as well that outline what you should do if you cannot see any data in OWBSYS's views.....

First read this.

Then....


THEN, here are 2 sample scripts that are in the oracle forums as well. These should get you started, and get your mind thinking i terms of what other views you'd like to look at, and what other columns are you interested in...as well as the various ways to join the views!


--This script focuses more on mapping contents...
select
distinct 'TARGET',
comp.map_name,
comp.data_entity_name,
comp.operator_type
from
all_iv_xform_map_components comp,
all_iv_xform_map_parameters param
where
lower(operator_type)
in ('table', 'view', 'dimension', 'cube')
and param.map_component_id = comp.map_component_id
and param.source_parameter_id is not null
UNION
select
distinct 'SOURCE',
t1.c1,
t1.c2,
t1.c3
from
(select
comp.map_name c1,
comp.data_entity_name c2,
comp.operator_type c3,
max(param.source_parameter_id) c4
from
all_iv_xform_map_components comp,
all_iv_xform_map_parameters param
where
lower(operator_type) in
('table', 'view', 'dimension', 'cube')
and param.map_component_id = comp.map_component_id
group by
comp.map_name, comp.data_entity_name, comp.operator_type) t1
where t1.c4 is null
order by 2,1

--This script focuses more on source to target tables and column mappings

SELECT
S_TBL_LOC ,
S_TBL_name,
S_FLD_name,
T_TBL_LOC ,
T_TBL_name,
T_FLD_name
FROM
(
SELECT
parms.map_component_id S_COMP_id ,
parms.parameter_id S_PARM_id ,
colms.entity_name S_TBL_name ,
colms.column_name S_FLD_name ,
tbls.schema_name S_TBL_LOC
FROM
all_iv_xform_map_parameters PARMS,
all_iv_xform_map_components COMPS,
all_iv_columns COLMS ,
all_iv_tables TBLS
WHERE
source_parameter_id IS NULL
AND parms.map_component_id = comps.map_component_id
AND UPPER(comps.operator_type) IN ('TABLE','DIMENSION')
AND parms.data_item_id =colms.column_id
AND colms.entity_id =tbls.table_id
)
,
(
SELECT
parms.map_component_id T_COMP_id ,
parms.parameter_id T_PARM_id ,
parms.source_parameter_id T_Sid ,
colms.entity_name T_TBL_name ,
colms.column_name T_FLD_name ,
tbls.schema_name T_TBL_LOC
FROM
all_iv_xform_map_parameters PARMS,
all_iv_xform_map_components COMPS,
all_iv_columns COLMS ,
all_iv_tables TBLS
WHERE
source_parameter_id IS NOT NULL
AND parms.map_component_id = comps.map_component_id
AND UPPER(comps.operator_type) IN ('TABLE','DIMENSION')
AND parms.data_item_id =colms.column_id
AND colms.entity_id =tbls.table_id
)
WHERE
S_PARM_id=T_Sid

Monday, February 1, 2010