Throughput post followup question
Hello folks. SSIS - Lovin it!
I have a couple questions/comments:
Nov 7th is a long ways away. Ideally, I'd like to convert my current ETL over to SSIS immediately and begin running in production and work around the current bugs. Yeah, I know its unsupported, but it would be nice to be ahead of the technology. Currently using stored procedures and they have grown too complex to manage. I find myself adding code (2 minutes) and then spending a great deal of time helping the optimizer make sense out of the 30 joins of large tables.
#1 -Is it safe to say that SSIS RTM will most likely work with CTP versions of packages?
I saw Jamie's post regarding throughput, and having the capability to log # rows processed and time taken for each component within a data flow. I would love to see this in a future version.
I had just sent this email to Donald asking the same thing. so, Donald, if you're reading this first, disregard my email.
#2 - Can you explain how the engine works to the following question.
"Donald, I’ve seen some ETL tools where tasks within the data flow run in parallel, where chunks of data are continually sent through the pipeline. When watching the SSIS window in debug mode, the data moves as 1 chunk through the dataflow for ease of debugging. But how does it actually work during execution on the server?
In particular, I’m looking to log information on how many rows have been processed by the task. For example, in DTS, every 1000 rows , I’d have script inside the transform data task to connect to the database and update a custom log table to reflect the # of rows processed so far. How would I accomplish this in SSIS? Preferably, I’d like to use a slowly changing dimension wizard to create the insert & update data flow items and to set those to have an insert batch size of some set # - e.g. 1000 rows. How can I log how many rows have any passed through the entire transform AND have been written based on the batchsize?"
Gary,
A few points here, some of which may be relevant.
You can monitor the number of rows passed to a destination (i.e. any destination) using the "SSIS Pipeline" object in performance monitor. This is outside of SSIS though so you can't log it. Perhaps it would be a good idea to make this same statistic available to the event handling mechanism.
A batch of 1000 is not relevant for updates because these occur a row at a time (using the OLE DB Command component).
I would not assume that RTM will work packages you build now. You never know what is round the corner.
I think what you're talking about when you say "the data flow run in parallel" is pipeline parallelisation (where componentA can operate on record1 then pass it onto componentB. ComponentB then operates on record1 as componentA operates on record2). In terms of what actually happens, think of the data conceptually as sitting in memory with different operations continually being applied to it rather than data being passed about to different operations.
I also would love a tool that automatically monitors throughput as a package is running - also perhaps presenting historical executions in a nice GUI. (A rival product has just such a tool). I guess this would have to interrogate the SSIS runtime in order to do this.
Just some random thoughts...
-Jamie
Hi, another option to explore is the Log Entry OnPipeLineRowsSent that you can turn on for a dataflow task. This log entry has been modified to include rows passed down a data path as buffers are handed off. You cannot control increments but you can aggregate and store the data. I say the log entry has been modified and you will now see this data with IDW 15, but not prior IDW builds. This will be documented in BOL (sorry not in IDW15) but here some info to start.OnPipelineRowsSent Rows were provided to a data flow component as input. : : 1801 : Audit Output 1 : 1796 : Flat File Destination : 1797 : Flat File Destination Input : 9960
The above is and acutally entry. You need to parse the data out...but it is there. The last number is the rows that were sent down the pipline.Assuming you have a SQL Log provider and have turned this event on for the dataflow task
you can see the events using the following, I sorted descending to get the most recent and avoid old versions of the event without the new data.
select top 1 event,Message
from sysdtslog90
where [event] like 'onpipelinerowssent%'
order by starttime desc
sample output
event Message
OnPipelineRowsSent Rows were provided to a data flow component as input. : : 1801 : Audit Output 1 : 1796 : Flat File Destination : 1797 : Flat File Destination Input : 9960
so whats new is this
: : 1801 : Audit Output 1 : 1796 : Flat File Destination : 1797 : Flat File Destination Input : 9960
which a user can parse out for
PathID
PathIDName
ComponentID
ComponentIDName
InputID
InputIDName
rowssent
You can buid quieries/reports on this data to do things like store historical rows processed by a package for comparison over time and build a system to compare values for error detection...e.g. avg is usually 10,000 today we only prcessed 200, fire an alert. It can also be useful during package exection, refreshing a report/query and watch/monitor an aggregate of the rowssent as a way of eyeing package progress.
How you want to parese it up to you of course...many ways an including SSIS pacakge itself. I created some tsql (function and view) which parses it out and then I call that from a reporting services report.
Assuming the XLS paste below comes through...its a sample of the results of a large package I ran and I mined the onpipeline rows sent...the key is the path id, name, and the rows sent...which I aggregated based on the pathid. The numbers shown here, match the numbers you will visually see when you execute the package inside deisgner. so, again there are many cool things you could do with this data...historic run comaparisons (mine it regularly, save off and dump all the other eventts) or a re-freshing report showing you the pipeline progress. Very cool I think! Though I admit its a bit hard to get out....we will work on the for the future. We are working on several sample RS reports against the log data, and this should be in there...not sure yet when we will have that ready for download/distribution.
Hope this helps
-Craig
| PathID | BuffersToInputCount | PathIDName | OnPipeLineRowsSent |
| 7758 | 3091 | Derived Column Output | 13454939 |
| 7949 | 3091 | Flat File Source Output | 13454939 |
| 7776 | 3091 | Error Output | 9669833 |
| 8858 | 3091 | Multicast Output 2 | 3785106 |
| 8843 | 3091 | Output 1 | 3785106 |
| 8844 | 3091 | Multicast Output 1 | 3785106 |
| 1241 | 3091 | Type N | 3522930 |
| 9179 | 3091 | Derived Column Output | 3522930 |
| 11147 | 3091 | RowCount Output 1 | 3522930 |
| 9193 | 3091 | Multicast Output 2 | 262176 |
| 6308 | 3091 | Multicast Output 3 | 262176 |
| 2805 | 3091 | Type A and L | 262176 |
| 11476 | 3091 | RowCount Output 1 | 262176 |
| 5565 | 3091 | Derived Column Output | 262176 |
| 6399 | 2099 | Long and Short Company Names | 5682 |
| 5527 | 1 | Aggregate Output 1 | 1173 |
| 5528 | 1 | Sort Output | 1173 |
| 8888 | 1 | Aggregate Output 1 | 183 |
| 8909 | 1 | Sort Output | 183 |
| 6988 | 1 | company names | 100 |