Skip to content

Google Summer of Code 2015

Josef Perktold edited this page Mar 17, 2015 · 14 revisions

Statsmodels has participated for six years in GSOC under the umbrella of the the Python Software Foundation. The focus in previous years has been on adding new models. There are still several areas where statsmodels is missing commonly used models, we also have several models that have been worked on but still need work to finish, add unit tests and integrate into statsmodels, and finally there are several areas where existing models can be extended. One important consideration in the selection of the project is the background of the student, and it is an advantage if the student is familiar with the topic and may be using it also in her or his research.

Introduction

Statsmodels is a library for statistics and econometrics written in Python with some extension using cython. It contains by now many of the most commonly used models for estimation, hypothesis tests and statistical graphs. See our documentation for more information. The developer pages describe in more details how to make contributions to statsmodels and our work flow for pull requests. Our issues are also on github, which include bug reports and wishlist items amd enhancement plans and ideas.

Guidelines & requirements

Statsmodels will participate in GSoC 2015 under the umbrella of Python Software Foundation.

PSF student guidelines: http://wiki.python.org/moin/SummerOfCode/Expectations

Advice on writing a proposal (written with the Mailman project in mind, but generally applicable)

The most important requirement that we expect from students is a sufficient background in statistics or econometrics. Students should be comfortable with Python (intermediate level). Knowing how to use Git is also important; this can be learned before the official start of GSoC if needed.

Advice

Potential candidates should take a look at the guidelines on how to contribute to statsmodels. Making a small enhancement/bugfix/documentation fix/etc (does not need to be related to your proposal) to statsmodels already before applying for the GSoC is a requirement from the PSF; it can help you get some idea how things would work during the GSoC.

Start on your proposal early, post a draft to the mailing list and iterate based on the feedback you receive. This will not only improve the quality of your proposal, but also help you find a suitable mentor.

Ideas

We encourage students to propose their own projects, but we also have several areas that are relatively high on our priority list. Our priority list is flexible, and it is important that the topic matches the interest and background of the student.

Note the difficulty level depends on the statistics/econometrics background and on the familiarity with the current statsmodels code.

common to all projects:

  • domain-specific knowledge: high level of statistics or econometrics knowledge for the specific topic
  • programming language: Python, intermediate level

Add Maximum Likelihood Models for other distributions

This is a relatively easy project in the sense that it can largely follow the existing patterns of current models. There is a large variety of distributions that can be added as Maximum Likelihood Models. One example are additional countmodels, zero-inflated, hurdle models, generalized distributions like generalized Poisson or NegativeBinomial, Poisson-inverse Gaussian and so on. Another example would be parametric survival or failure models, especially accelerated failure time models and similar. Another area that is not yet covered are models for compositional data (shares or proportions that add up to one, or to a constant).

difficulty: easy to intermediate

mentor: Josef Perktold

Panel Data

This is still one large category of basic models that are currently missing in statsmodels. There is a pull request for the standard econometrics model (PR #1133 ), but no work for extensions or for dynamic panel data models.

difficulty: intermediate

mentor: Kevin Sheppard?, Josef Perktold

Mixed Effects Models

statsmodels has now the basic linear mixed effects model. These can be extended to more general cases for the linear models, or extended to non-gaussian models like generalized linear models or discrete models.

difficulty: hard

mentor: Kerby Shedden, Josef Perktold

Extensions to State Space Models

Statsmodels includes now a general purpose Kalman filter and state space model. The main current application model on top of it is the SARIMAX model. There are many possibilities where the state space models can be extended

  • Additional models: VAR, VARMA, structural time series, dynamic factors, etc. (there are many more)
  • Additional estimation capability: EM algorithm, exact diffuse initialization
  • Additional post-estimation capability: test residuals for normality, heteroskedasticity, serial correlation (these should be very easy, not worth a GSOC unless there was more added)
  • Additional techniques: switching models (add the Kim filter which is a state space generalization of the Hamilton filter), non-linear and / or non-Gaussian filtering (see chapters 9 and 10 of the Durbin and Koopman book; examples include adding support for exponential family errors, stochastic volatility models, the extended Kalman filter, and the unscented Kalman filter).

difficulty: intermediate to hard

mentor: Chad Fulton

Survival Models

statsmodels has Cox proportional hazard model included and a pull request for Kaplan-Meier and similar that is almost finished. One possible extension would be to extend Cox proportional hazard model to time varying explanatory variables, and add a Poisson or generalized linear model representation that can be used for semi-parametric estimation, e.g. using splines for the baseline hazard.

difficulty: intermediate

mentor: Kerby Shedden, Josef Perktold

Propensity score matching, and treatment effects estimation

This is another area that is currently missing in statsmodels. There are some projects outside of statsmodels that implement partially implement it in Python. One possibility is to implement the equivalent of Stata's psmatch or the new tseffects, or similar packages in R, or GSOC sized parts of it. Pr #2288 has an implementation of the basic parts, related discussion is in issue #858

difficulty: intermediate

mentor: Josef Perktold, Kerby Shedden

Other possible projects

bring your own

penalization or regularization approaches for generalized linear or maximum likelihood models Currently the only models with penalized estimation are L1-penalization for discrete models, and L1/L2 penalization for linear models and Cox proportional hazards models. We don't want to duplicate the excellent facilities of scikit-learn, but there is a large range of use cases and models

classical multivariate analysis: pca, factor analysis and canonical correlation analysis There are algorithm for some of this in other python packages, but they either don't provide the full statistical model or don't have the associated statistical results for it. PCA is now available in statsmodels.

  • ...
  • ...

Cleanup, Refactor and integrate unfinished projects

difficulty: hard for GSOC, requires familiarity with large parts of current statsmodels code.

The general objective is to increase unit test coverage and to bring pull requests and higher priority code in the sandbox into a condition so they can be merged. Additional improvements and enhancements can also be added to the current core code. There are many improvements that will not require a large amount of time, below are a non-exhaustive list of ideas, that are mostly larger in terms of the required time. The issues on github will provide a starting point for most cases.

Close gaps in unit test coverage and fix bugs if necessary: Almost all core code has good functional coverage (verifying correctness) but less common code paths and unusual user inputs are insufficiently tested. Some code on the "fringes" has insufficient test coverage. Some functions need updating for the full integration with and support of pandas data structures.

system of equations, simultaneous equations: a previous GSOC project that needs to be updated to the current statsmodels code base, plus missing test coverage, and possibly additional results.

repeated measures anova: rewrite code in pull request to integrate with pandas and conform to statsmodels code structure.

Migrate Pandas.stats to statsmodels: see https://github.com/pydata/pandas/issues/6077

Power and effect size: Currently power and sample size calculation provide mainly a low level interface. We need additional effect size calculations and additional functions that make power and sample size calculations easier to use.

Bootstrap, resampling methods: we have bootstrap methods incorporated in several models, and there are additional examples and scripts inside and outside of statsmodels. statsmodels is still missing a consistent framework, helper functions and integration of it with existing models.

Clone this wiki locally