splitting data

I've read that we need to split data into a training and testing data set. How do I go about that?
[99 byte] By [RogerMcKinney] at [2007-12-25]
# 1

You can use the sampling transforms in Integration Services, Percentage Sampling and Row Sampling. Percentage Sampling simply divides an input set by a %. A typical division is 70% training and 30% test. Row sampling allows you to choose exactly how many rows you want to select from an input of unknown size

In general for splitting a dataset you would use % sampling. Row sampling is good if you have an aribtrarily large dataset that you want to reduce to a reasonable size before mining - e.g. millions to 100's or 10's of thousands (not that you need to, it's just faster)

There are some tips on advanced sampling at

http://www.sqlserverdatamining.com/DMCommunity/TipsNTricks/2615.aspx

http://www.sqlserverdatamining.com/DMCommunity/TipsNTricks/4048.aspx

JamieMacLennan at 2007-9-3 > top of Msdn Tech,SQL Server,Data Mining...
# 2
I understand how to split the data in SSIS, but how do I tell the DM tool, using a tree algorithm, that one table is training data and the other is testing/validation data?
RogerMcKinney at 2007-9-3 > top of Msdn Tech,SQL Server,Data Mining...
# 3
You don't need to - any internal validation done by the algorithm is opaque to the user based on the dataset that you provide. The testing/validation data is used for the accuracy charts, and is indicated when you go to the accuracy chart tab.
JamieMacLennan at 2007-9-3 > top of Msdn Tech,SQL Server,Data Mining...
# 4
Great! Thanks! Any documentation on how the algorithms choose the training and validation cases?
RogerMcKinney at 2007-9-3 > top of Msdn Tech,SQL Server,Data Mining...
# 5
Microsoft Neural Networks and Microsoft Logistic Regression will pick a random sample of the data. The size of the sample is determined by the value of the HOLDOUT_PERCENTAGE parameter, which is, by default, 30% of the training set.
BogdanCrivat at 2007-9-3 > top of Msdn Tech,SQL Server,Data Mining...

SQL Server

Site Classified