Time-Series Cross Validation (tscv)

One of the most challenging parts of my project for GSoC is to create the functionality of time series cross-validation. Model Evaluation metrics are highly critical to check how your model performs on real life and unseen data. The model fit metrics(like AIC, BIC) are quite enough to know how your model forecasts on new data as these metric only use the model’s fit data.

Cross-validation is one of the widely accepted evaluation metrics to judge how a model responds to new data.

Cross-validation is a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it.

For classification problems, one typically uses stratified k-fold cross-validation, in which the folds are selected so that each fold contains roughly the same proportions of class labels.
In repeated cross-validation, the cross-validation procedure is repeated n times, yielding n random partitions of the original sample. The n results are again averaged (or otherwise combined) to produce a single estimation.

In Time Series Cross Validation, there is a series of test sets, each consisting of a single observation. The corresponding training set consists only of observations that occurred prior to the observation that forms the test set. Thus, no future observations can be used in constructing the forecast.

For time series forecasting, a cross-validation statistic is obtained as follows

  1. Fit the model to the data y1,,yt and let ^yt+1 denote the forecast of the next observation. Then compute the error (et+1=yt+1^yt+1) for the forecast observation.
  2. Repeat step 1 for t=m,,n1 where m is the minimum number of observations needed for fitting the model.
  3. Compute the MSE from em+1,,en.

Reference :

https://robjhyndman.com/hyndsight/crossvalidation/

Reaching to the first functional testing of Automatic Forecasting

I’m glad to post that the project I’ve worked on has now reached it first functional testing and it’s passing quite a number of unit tests created to test it against the auto.arima and ets functions of the forecast package in R.

In my last post, I’ve mentioned,

Apart from these, One of my tasks is to figure out a way to connect the various modules and classes, that I have built during the first month, with the ES modules so that they all work together.

Over the last few weeks, I’ve worked on doing these by using the Forecast class wrappers to create a completely automatic workflow for forecasting time series data using SARIMAX and ES models.

The SARIMAX models use the auto_order() function that I have created during my first month of coding and have stood firm against a lot of unit tests against the auto.arima function in R. The SARIMAX is now flexible in terms of selecting model as earlier it was limited to only using the AIC values for model selections, but now it can use all the Information Criteria available in statsmodels. This is a great deal because now the SARIMAX models are very flexible and give our end users some options for model selection.

The ES models use the auto_es() function that I have created during the first half of my second coding phase and are now working fine with a few unit tests against the ets function in R. However, it is limited to only using additive error models and not include the multiplicative error models which are both supported in the forecast package of R. I am working hard with my mentor to check if its possible to add this flexibility to the ES models. Apart from that, The ES models are flexible in terms of using Information Criteria for model selection which is a good sign.

Unlike my last month’s evaluation, This time I have also updated the documentation for all the classes and modules that have been built till now. I have also written quite a few smoke tests and unit tests for them, and I’m willing to write a few more tests to thoroughly test the automatic forecasting model.

The example notebook showing the work is present here in a Github gist:
https://gist.github.com/abhijeetpanda12/fb1fc40e560f5f6d390159488c0d1e4a

All my code contributions can be found at the following branch:
https://github.com/abhijeetpanda12/statsmodels/tree/auto-forecast-1

or, at the subsequent Pull Request:
https://github.com/statsmodels/statsmodels/pull/4621

Please comment on my work and provide me with feedback on how can I improve my project.