Please note that you can always click on an image in my postings and it will render a clear full sized version in a separate browser page! Also please note that this blog is best viewed with Firefox and Chrome

Monday, February 8, 2010

ETL Misconceptions and Mistakes Made While Working With Large Databases - 1TB or Greater

I'd first like to give some due credit to a fellow blogger that has pushed out so much information on data warehouse topics, it is staggering. He is Dan Lindstedt, and this next piece on my blog is about some common etl mistakes on large systems, with my commentary. His work appears frequently on the Beyenetwork blogs.

Please note that some of these issues listed below might not necessarily make a difference if you are dealing with a smaller system with less data / rows. If you have never worked on a larger size database, some of these concepts might even seem unfamiliar or odd.

Top ETL mistakes

1) Incorporating Inserts, Updates, and Deletes in to the same data flow / same process.

This is a great point, if for nothing less than splitting the individual tasks into discrete pieces. For a target table load, try an insert job, an update job, and other jobs like deletion or massive bulk loading jobs. These will all vary based on time and loads. Inserts in many cases being simple and easy, and updates and especially deletes sometimes becoming more taxing.

2) Sourcing multiple systems at the same time, depending on heterogeneous systems for data.

This can be cause for great concern if you try this, unless the data volume is smaller, again, we are referring to larger loads and larger systems though.

3) Targeting more than 1 or 2 target tables

You might consider 2 tables at once if it has a relationship, hierarchical, parent-child, or otherwise, but this should not happen often! That should be an exception, most large volume ETL jobs should be 1 target table.

4) moving rows that are too wide through a single process

Any ways to get around this issue or mistake? Comments?

5) loading very large data sets to targets WITH INDEXES ON

Many tools allow you to turn indexes off and then recompute and turn on again after a job- datastage, owb, etc. You can code it custom if you need to. Again, you don't need indexes and keys if you have other ways to check your referential integrity.

6) not running a cost-based optimizer in the database

Big mistake here. You should always consider a routine after all loads are finished to analyze, or gather stats or other such details for your database just so that your cost-based optimizer, or other tool (in a non-Oracle database) can gather all the necessary information about new structures, indexes, and so forth.

7) not keeping statistics up to date in the database

Again, this is similar to 6, almost exactly. You'd want current statistics so the cost based optimizer takes advantage of the current structure and used the right joins and indexes, etc.

8) not producing the correct indexes on the sources / lookups that need to be accessed

You should know well enough where the majority of your lookups and/or joins are going to happen....indexes....use them! But, again, see the above mistake about keeping them during a load.

9) not purchasing enough RAM for the ETL server to house the RAM caches in memory.

Not applicable to OWB necessarily, more for things like Informatica and DataStage.

10) running on a 32 bit environment which causes significant OS swapping to occur
11) running on a 32 bit environment which causes significant OS swapping to occur
12) running on a 32 bit environment which causes significant OS swapping to occur

The original author had a major issue with #10, so as to repeat it for #11 and for #12!!!

The original author points out the following concerning #10-#12:

"Let me explain this one. Any single process started within a 32 bit environment is limited (immediately) to 2GB of addressable RAM, now - take away the RAM needed for the threads or the executable engine code... usually around 300M or more, leaving us with 1.5GB to 1.7GB of RAM to work with. Then, take away any "caching" objects that might use this RAM, and multiply that "drop" in RAM by the number of caching objects that you have....

Now, you are left with possibly 200MB, maybe 100MB of RAM left to allocate for "data reading/data writing"... and it drops from there.

Let me explain this other piece as well: Windows 32 bit OS MUST RUN CODE UNDER THE 640k BOUNDARY!!! So there's additional Limitations there, Windows 32 Bit PAGEFILE.SYS (swap/temp) is SINGLE THREADED, AND BLOCKS I/O OPERATIONS when swapping pieces to disk. Oh yes, one more: Windows 32 Bit - RAM allocation will ONLY give the application 1/2 of the requested RAM, and AUTOMATICALLY put 1/2 in the pagefile.sys swap area.

By the way, here's a tid-bit for you that you have to try to believe:
If you are running Windows 32 bit, SQLServer (either version)....
1. Change your ETL "block size" to 8k to match the Database (4k I think in SQLServer2000)
2. Change your row size to fit as many rows as possible into an 8k block, the more rows per block the faster the performance.
3. Change your commit point to 8,500 rows (this is the sweet spot)
4. Check in on NETWORK PACKET SIZES and increasing those between the ETL engine and the Database, increase them from 2k/4k to 8k - again to match the disk.
5. USE ODBC/OLE DB connectivity; do NOT use NATIVE libraries (except in SQLServer2005 direct CT-LIB). If your tool uses DB-LIB as "native" connectivity, it will be slow, slow slow - DB-LIB is 12 year old technology, and is NOT multi-threaded, nor parallel, where CT-LIB is.

Windows 64 Bit does NOT have these limitations, but requires 30% more hardware to run "cool" and efficiently, but it is FAST when installed and configured correctly on the right hardware.

Linux, and Unix do not have these limitations either.

So, architecture is everything."

13) Trying to do "too much" inside of a single data flow, increasing complexity and dropping performance

This does make the job harder to debug and sometimes you get those random out of resource errors...Beware. Split work where appropriate. If a mapping or process looks too complicated, it probably is.

14) believing that "I need to process all the data in one pass because it's the fastest way to do it."

"breaking-up" a single flow into multiple flows , and changing the width of the data sets and possibly the parallelism along the way would help the performance.

15) Letting the database "bounce" errors back to the ETL tool, dropping flow rates and throughput rates by factors of 4x to 10x.


16) "THINKING" in a transactional mode, rather than a batch mode, and processing each row, one row at a time (like they would code a cursor in a database language).

A bit hard to avoid in a row-by-row ETL tool! The parallel architecture and caching and memory sharing and a bunch of other things make it fast.

17) LOOPING inside an ETL process, because they think it's necessary (transactional processing again).

Careful here no matter which tool you are using. Looping, as it goes WAY back in simple logical programming can be a blessing or a disaster. If you need to loop, there's probably another way to do what you're attempting, in a more efficient way. This does not necessarily mean using a LOOP in an OWB process flow, but trying to loop somehow in the ETL itself, called a stored function or procedure, etc.

1 comment:

  1. Hi Greg,

    Thanks for blogging on my topics. In regards to teaching scalable systems design, I've started a new blog (yes, it's based on the Data Vault Model and methodology) - but even if you don't want/use the DV, there's a ton of BI/EDW information I'm pushing in to how-to hands-on lessons.

    My coaching area covers scalability, processes design, repeatability, 64 bit architecture, all the good stuff. Including OWB, SQL, Oracle tips and tricks, and of course Informatica.

    You can also get good information about Project management, six sigma, TQM, PMP, KPA and KPI's - I've got a full methodology section with 50+ downloadable templates.

    Anyhow, it's free to register for the blog, and it's my new resource.

    Dan Linstedt