Time Series - Key Time and Missing Value Substitution
1. I am using datetime as the Key Time column and if I leave the parameter Missing_Value_Substitution (MVS) blank, it complains there are missing values. I have stripped the data down to a single sample series which *definitely* has exactly 1 row with each month (e.g. from 2003-01-01, 2003-02-01 ... 2005-07-01) and it still gives this error.
2. What happens if there are multiple Key Time values, e.g.
2003-01-01 Bread 100 Shop1
2003-01-01 Bread 200 Shop2
2003-01-01 Milk 100 Shop1
2003-02-01 Bread 100 Shop1
Where Shop1/2 are treated as Inputs but Bread/Milk are in the Case Key column. I do not want to add the Shop ID to the Case Key (i.e. predict Shop1:Bread etc.) as there would be far too many. Does this make sense or am I misunderstanding how it works?
What I am looking for is any seasonal variations in sales (which months it always peaks in) and the reason I do not want to aggregate the sales per shop is that there are separate patterns in different groups of shops. I get the feeling there may be a chicken and egg situation here, as I really need to group the shops with similar Time Series patterns first and THEN do the Time Series on each group to find the pattern?! Any ideas?
Thanks
Here's a response from our Time Series expert:
When both Key Time and Case key are used, then each input column generates as many time series as there are case key values. All these time series must have aligned time stamps - otherwise you must use MISSING_VALUE SUBSTITUTION
Here are a couple of simple examples
| Date (Key Time) | Location (Case Key) | Sales |
| 2/15/2006 | Redmond | 100 |
| 2/16/2006 | Bellevue | 200 |
| 2/16/2006 | Redmond | 300 |
| 2/17/2006 | Bellevue | 400 |
| 2/17/2006 | Redmond | 500 |
| Date (Key Time) | Location (Case Key) | Sales |
| 2/15/2006 | Redmond | 100 |
| 2/15/2006 | Bellevue | 200 |
| 2/16/2006 | Redmond | 300 |
| 2/16/2006 | Bellevue | 400 |
| 2/17/2006 | Redmond | 500 |
Each of these tables generates two time series: Sales.Bellevue and Sales.Redmond
The second table has some missing data that requires the use of MISSING_VALUE_SUBSITUTION
Indeed, Sales.Redmond has the value for 2/17 but Sales.Bellevue does not.
The user must indicate what that means (e.g. - no sales or M_V_S = 0 ; or record lost, but the data can be extrapolated from the previous day , i.e. M_V_S = Previous).
The first table has some missing data too, namely Sales.Bellevue for 2/15, but the lack of this kind of data is forgiven, as it is the piece in far past that is missing. The algorithm may simply truncate the Redmond curve and use the aligned values of 2/16 and 2/17 for the two series.
This set of examples is not exhaustive.
Consider the third table
| Date (Key Time) | Location (Case Key) | Sales |
| 2/15/2006 | Redmond | 100 |
| 2/15/2006 | Bellevue | 200 |
| 2/16/2006 | Redmond | 300 |
| 2/17/2006 | Bellevue | 400 |
| 2/17/2006 | Redmond | 500 |
Here the data for Sales.Bellevue is missing in the middle of the aligned time segment , i.e. for 2/16/2006.
Again the M_V_S must be defined or the missing row must be supplied.
The bottom line is, the user should be mindful that they may well get (many) more series than there are columns in the data table and all those series need to be aligned (at minimum - right-aligned and without gaps). I believe that when TS complains about missing data, that is for good reason, so either the data needs to be cleaned or MISSING_VALUE_SUBSTITUTION needs to be defined.
Thank you for the reply, but it is not quite what I mean.
Essentially I have 2 case keys (though one is not being defined as such, because the algorithm doesn't allow it). To use your example, it would be like having Location as the actual Case key, but an additional Store ID column, with 2 stores in Redmond and another 3 in Bellevue. Each of these stores would (usually) have monthly statistics, so there will be multiple rows with the same date for Richmond. Is this permissable or is it going to produce nonsense results?
As I said, I want to look for patterns that may be different in each store.
In relation to the error message, I am sure the series has no missing dates, so perhaps it is a caching issue (I have noticed odd things when I make changes to a db but they seem to be ignored, even if I refresh the view). I will try a very simple sequence to see if it is reproduceable.
Thanks,
Neil.
In this case, you can make the case key composite by either
a) binding it to multiple KeyColumns (you can do this in the in the Mining Structure Editor in BI Development Studio by going to KeyColumns in the Properties pane for a mining structure column, clicking on the "..." button and adding multiple source columns.
or
b) creating a computed column in your DataSourceView that combines the two source columns and binding to the computed column in your mining structure
In both cases, the case key is a single column from the mining structure's perspective but it's bound to multiple source columns.