Reaching the Final submission date

It is weeks of hard work and research that has come for their final evaluation. As the submission dates for the final code submission and the evaluations are approaching, I have been working hard to give the project the last few touch ups and structure to finally able to submit it.

I am glad to say that the project has been able to meet most of its functional requirements and only a few tests are left to consider. The project would be beneficial to a variety of people who will be working on time-series forecasting models. The automatic prediction of model parameters will help save a lot of time of the users who spend it on ‘Hit-and-Trial’ to find the best fitting model.

The major components of the project(as of now) include:

  1. Automatic model Selection for SARIMAX models.
  2. Automatic model selection for Exponential Smoothing models.
  3. Automatic box-cox transformation.(parameter prediction)
  4. Forecast and ForecastSet classes to hold and compare different time-series models.
  5. Time Series Cross Validation module.

Over the next few days, I’ll be working on robust testing of the different modules that I have created to strengthen this project.

I am highly excited to see this project to be merged in the statsmodels repository and be a part of its release.

The project files can be found at:

https://github.com/statsmodels/statsmodels/pull/4621/files

Time-Series Cross Validation (tscv)

One of the most challenging parts of my project for GSoC is to create the functionality of time series cross-validation. Model Evaluation metrics are highly critical to check how your model performs on real life and unseen data. The model fit metrics(like AIC, BIC) are quite enough to know how your model forecasts on new data as these metric only use the model’s fit data.

Cross-validation is one of the widely accepted evaluation metrics to judge how a model responds to new data.

Cross-validation is a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it.

For classification problems, one typically uses stratified k-fold cross-validation, in which the folds are selected so that each fold contains roughly the same proportions of class labels.
In repeated cross-validation, the cross-validation procedure is repeated n times, yielding n random partitions of the original sample. The n results are again averaged (or otherwise combined) to produce a single estimation.

In Time Series Cross Validation, there is a series of test sets, each consisting of a single observation. The corresponding training set consists only of observations that occurred prior to the observation that forms the test set. Thus, no future observations can be used in constructing the forecast.

For time series forecasting, a cross-validation statistic is obtained as follows

  1. Fit the model to the data y1,,yt and let ^yt+1 denote the forecast of the next observation. Then compute the error (et+1=yt+1^yt+1) for the forecast observation.
  2. Repeat step 1 for t=m,,n1 where m is the minimum number of observations needed for fitting the model.
  3. Compute the MSE from em+1,,en.

Reference :

https://robjhyndman.com/hyndsight/crossvalidation/

Reaching to the first functional testing of Automatic Forecasting

I’m glad to post that the project I’ve worked on has now reached it first functional testing and it’s passing quite a number of unit tests created to test it against the auto.arima and ets functions of the forecast package in R.

In my last post, I’ve mentioned,

Apart from these, One of my tasks is to figure out a way to connect the various modules and classes, that I have built during the first month, with the ES modules so that they all work together.

Over the last few weeks, I’ve worked on doing these by using the Forecast class wrappers to create a completely automatic workflow for forecasting time series data using SARIMAX and ES models.

The SARIMAX models use the auto_order() function that I have created during my first month of coding and have stood firm against a lot of unit tests against the auto.arima function in R. The SARIMAX is now flexible in terms of selecting model as earlier it was limited to only using the AIC values for model selections, but now it can use all the Information Criteria available in statsmodels. This is a great deal because now the SARIMAX models are very flexible and give our end users some options for model selection.

The ES models use the auto_es() function that I have created during the first half of my second coding phase and are now working fine with a few unit tests against the ets function in R. However, it is limited to only using additive error models and not include the multiplicative error models which are both supported in the forecast package of R. I am working hard with my mentor to check if its possible to add this flexibility to the ES models. Apart from that, The ES models are flexible in terms of using Information Criteria for model selection which is a good sign.

Unlike my last month’s evaluation, This time I have also updated the documentation for all the classes and modules that have been built till now. I have also written quite a few smoke tests and unit tests for them, and I’m willing to write a few more tests to thoroughly test the automatic forecasting model.

The example notebook showing the work is present here in a Github gist:
https://gist.github.com/abhijeetpanda12/fb1fc40e560f5f6d390159488c0d1e4a

All my code contributions can be found at the following branch:
https://github.com/abhijeetpanda12/statsmodels/tree/auto-forecast-1

or, at the subsequent Pull Request:
https://github.com/statsmodels/statsmodels/pull/4621

Please comment on my work and provide me with feedback on how can I improve my project.

Moving forward with Exponential Smoothing models

The first part of GSoC 2018 is over now with the completion of the First Evaluations, and I am really thankful to my mentor for passing me in it.

I had planned my project in an organized manner while writing the proposal and I am happy to be on track now with a few remaining parts. The first month, according to the plan, was to focus on the SARIMAX model and their model selection while the second month, which is now, is to focus on the Exponential Smoothing models and their model selection. Last week has been spent in deciding the parameters that would be valuable to keep while selecting an ES model for automatically forecasting the time-series data.

To get into the details, I have referred to the ets function of the forecast package in R and had run a few unit tests to check if the results match with them. Furthermore, we are following a brute force approach to fit the Exponential Smoothing models and running a check for their in-sample Information Criteria like AIC to choose the best model. The best model is then returned to use it for the forecasts.

Apart from these, One of my tasks is to figure out a way to connect the various modules and classes, that I have built during the first month, with the ES modules so that they all work together.

I’ll keep on posting more on this project after I complete the module and make it a really functional project.

Project status after a month of coding

It’s almost one month since I officially started coding for Statsmodels as a part of Google Summer of Code. The journey till now has been very challenging and thrilling until now. The milestones which I cover every week has taught me a lot regarding code practice, statistics and open source. I am sharing a few of the work that I have done during the last two weeks which I feel were the most challenging milestones during my first month of contribution.

The third week of my code contribution was targetted at expanding my auto_order function(created during the first week) to support computing seasonality order and intercepts. This included developing the code to check for all the different possibilities of AR, MA parameters along with the seasonal parameters which would provide the least AIC for a particular input time-series.

The fourth week was focussed on building an auto_transformation module which would help in automatically transforming the time-series into a stationary time-series. Since statsmodels already includes the Box-Cox transformation functionality, my focus was to create a module which would predict the parameters for this transformation. The book by Draper and Smith – “Applied Regression Analysis” provided some useful techniques to do that. The parameters(lambda) for the Box-Cox transformation was predicted by checking the value of lambda that maximizes the likelihood of linear regression.

The functions a module that I have developed are now to be tested with real-life examples against other modules and package(like the forecast package in R).

 

Two weeks into Google summer of code

As I have now completed two full weeks as a Google summer of code student, things have got so much better now. I have learned quite a lot about open source communities and how good software is written. Most importantly I have learned a lot about the Python programming language.

As I have mentioned in my previous blogs that my project is based on building up an Automatic Forecasting module for the Statsmodels package which would help in automatically setting up time series models. I have been able to successfully meet my targets for the same. During the first week, my objective was to complete a simple module that would take a given range of parameter for SARIMAX models and would select the best combinations of parameters (p, q, i.e., Autoregressive and moving average parameters respectively) based on AIC(Akaike Information criteria) values.

The target for my second week was to design the classes and the supporting functions that the end user would require to use the models. For this part, we had split our dataset into two sets, i.e., the training set(on which the models are built) and the testing set(on which the models are validated). A part of this also included calculating different accuracy measures like MAE(Mean Absolute Error), RMSE(Root Mean Squared Error), MAPE(Mean Absolute Percentage Error), etc. which would be used to check the accuracy of our models. These accuracy measures were performed on the testing set, and the above measures were used to validate the models.

All my commits are present in a single pull request, and the fork to which I am pushing my changes can be found here at

https://github.com/abhijeetpanda12/statsmodels/tree/auto-forecast-1

Please provide me any feedback that would help in doing better for my GSoC project and my blog here.

checkout my code contributions

In response to the recent requirement of putting up a blog for posting my code publically (with new code at least once a week), I would like to provide the link of the branch where I commit my code

https://github.com/abhijeetpanda12/statsmodels/tree/auto-forecast-1

This branch contains all my code contributions to the local forked statsmodels repository. So far I am even with my first-week target and looking forward to work on my second-week milestone.

My project is about building an automatic forecasting module for the statsmodels package. This module would help in automatically determining the parameters for different time series models(SARIMAX and ES).

Say hello to the summer of code

This Summer is going to be great. It was my first attempt at GSoC and I’m glad I made it through.
I have been selected as a Google Summer of Code 2018 student at Statsmodels under the Python Software Foundation where I’ll be responsible for developing an Automatic forecasting model for time-series data.
The aim of the project is to implement an automatic forecasting infrastructure for statsmodels similar to auto.arima()/ets() of the ‘forecast’ package in R. The goals will be to use the existing models of statsmodels like SARIMAX and ES to build a forecasting method that would automatically detect the best model and forecast values based on that model.
Automatic forecasting algorithms determine an appropriate time series model, estimate the parameters and compute the forecasts. They are appropriate for various time series patterns, and applicable to large numbers of series without user intervention.
As of now, I have planned to start my project by first creating a modular infrastructure for the complete automatic forecasting process which I should be able to fit in any new models or variations as per requirements.
I have prepared myself with all the basic requirements that I need to have in terms of theory(Statistics background knowledge) and a good hands-on the python language which would help me to give a kickstart to my project.
I’ll be updating my whereabouts for this project here on this blog.