Throughput post followup question

Hello folks. SSIS - Lovin it!

I have a couple questions/comments:

Nov 7th is a long ways away. Ideally, I'd like to convert my current ETL over to SSIS immediately and begin running in production and work around the current bugs. Yeah, I know its unsupported, but it would be nice to be ahead of the technology. Currently using stored procedures and they have grown too complex to manage. I find myself adding code (2 minutes) and then spending a great deal of time helping the optimizer make sense out of the 30 joins of large tables.

#1 -Is it safe to say that SSIS RTM will most likely work with CTP versions of packages?

I saw Jamie's post regarding throughput, and having the capability to log # rows processed and time taken for each component within a data flow. I would love to see this in a future version.

I had just sent this email to Donald asking the same thing. so, Donald, if you're reading this first, disregard my email.

#2 - Can you explain how the engine works to the following question.
"Donald, I’ve seen some ETL tools where tasks within the data flow run in parallel, where chunks of data are continually sent through the pipeline. When watching the SSIS window in debug mode, the data moves as 1 chunk through the dataflow for ease of debugging. But how does it actually work during execution on the server?

In particular, I’m looking to log information on how many rows have been processed by the task. For example, in DTS, every 1000 rows , I’d have script inside the transform data task to connect to the database and update a custom log table to reflect the # of rows processed so far. How would I accomplish this in SSIS? Preferably, I’d like to use a slowly changing dimension wizard to create the insert & update data flow items and to set those to have an insert batch size of some set # - e.g. 1000 rows. How can I log how many rows have any passed through the entire transform AND have been written based on the batchsize?"

[2399 byte] By [GaryCabana] at [2008-2-15]
# 1
Gary,
A few points here, some of which may be relevant.

You can monitor the number of rows passed to a destination (i.e. any destination) using the "SSIS Pipeline" object in performance monitor. This is outside of SSIS though so you can't log it. Perhaps it would be a good idea to make this same statistic available to the event handling mechanism.

A batch of 1000 is not relevant for updates because these occur a row at a time (using the OLE DB Command component).

I would not assume that RTM will work packages you build now. You never know what is round the corner.

I think what you're talking about when you say "the data flow run in parallel" is pipeline parallelisation (where componentA can operate on record1 then pass it onto componentB. ComponentB then operates on record1 as componentA operates on record2). In terms of what actually happens, think of the data conceptually as sitting in memory with different operations continually being applied to it rather than data being passed about to different operations.

I also would love a tool that automatically monitors throughput as a package is running - also perhaps presenting historical executions in a nice GUI. (A rival product has just such a tool). I guess this would have to interrogate the SSIS runtime in order to do this.

Just some random thoughts...

-Jamie

JamieThomson at 2007-9-8 > top of Msdn Tech,SQL Server,SQL Server Integration Services...
# 2

Hi, another option to explore is the Log Entry OnPipeLineRowsSent that you can turn on for a dataflow task. This log entry has been modified to include rows passed down a data path as buffers are handed off. You cannot control increments but you can aggregate and store the data. I say the log entry has been modified and you will now see this data with IDW 15, but not prior IDW builds. This will be documented in BOL (sorry not in IDW15) but here some info to start.

OnPipelineRowsSent Rows were provided to a data flow component as input. : : 1801 : Audit Output 1 : 1796 : Flat File Destination : 1797 : Flat File Destination Input : 9960

The above is and acutally entry. You need to parse the data out...but it is there. The last number is the rows that were sent down the pipline.

Assuming you have a SQL Log provider and have turned this event on for the dataflow task

you can see the events using the following, I sorted descending to get the most recent and avoid old versions of the event without the new data.

select top 1 event,Message
from sysdtslog90
where [event] like 'onpipelinerowssent%'
order by starttime desc
sample output
event Message

OnPipelineRowsSent Rows were provided to a data flow component as input. : : 1801 : Audit Output 1 : 1796 : Flat File Destination : 1797 : Flat File Destination Input : 9960

so whats new is this

: : 1801 : Audit Output 1 : 1796 : Flat File Destination : 1797 : Flat File Destination Input : 9960

which a user can parse out for

PathID
PathIDName
ComponentID
ComponentIDName
InputID
InputIDName
rowssent

You can buid quieries/reports on this data to do things like store historical rows processed by a package for comparison over time and build a system to compare values for error detection...e.g. avg is usually 10,000 today we only prcessed 200, fire an alert. It can also be useful during package exection, refreshing a report/query and watch/monitor an aggregate of the rowssent as a way of eyeing package progress.

How you want to parese it up to you of course...many ways an including SSIS pacakge itself. I created some tsql (function and view) which parses it out and then I call that from a reporting services report.

Assuming the XLS paste below comes through...its a sample of the results of a large package I ran and I mined the onpipeline rows sent...the key is the path id, name, and the rows sent...which I aggregated based on the pathid. The numbers shown here, match the numbers you will visually see when you execute the package inside deisgner. so, again there are many cool things you could do with this data...historic run comaparisons (mine it regularly, save off and dump all the other eventts) or a re-freshing report showing you the pipeline progress. Very cool I think! Though I admit its a bit hard to get out....we will work on the for the future. We are working on several sample RS reports against the log data, and this should be in there...not sure yet when we will have that ready for download/distribution.

Hope this helps
-Craig
PathID BuffersToInputCount PathIDName OnPipeLineRowsSent
7758 3091 Derived Column Output 13454939
7949 3091 Flat File Source Output 13454939
7776 3091 Error Output 9669833
8858 3091 Multicast Output 2 3785106
8843 3091 Output 1 3785106
8844 3091 Multicast Output 1 3785106
1241 3091 Type N 3522930
9179 3091 Derived Column Output 3522930
11147 3091 RowCount Output 1 3522930
9193 3091 Multicast Output 2 262176
6308 3091 Multicast Output 3 262176
2805 3091 Type A and L 262176
11476 3091 RowCount Output 1 262176
5565 3091 Derived Column Output 262176
6399 2099 Long and Short Company Names 5682
5527 1 Aggregate Output 1 1173
5528 1 Sort Output 1173
8888 1 Aggregate Output 1 183
8909 1 Sort Output 183
6988 1 company names 100

CraigGuyer at 2007-9-8 > top of Msdn Tech,SQL Server,SQL Server Integration Services...

SQL Server

Site Classified