-
Notifications
You must be signed in to change notification settings - Fork 0
Ideas for Enhancement Projects
editorial comment: This is an updated version of Ideas for Google Summer of Code 2012 Projects
Under Pages there are additional drafts for statsmodels enhancement proposals SMEPs.
The area where the coverage in statsmodels is lacking is still pretty wide. So, if a student has a strong preference, then it should be or might be possible to cover it.
The idea is basically, pick your favorite chapters in an econometrics or statistics book, or R package or Stata topic or any other package for statistical analysis and see what is missing and would be useful to have available with higher priority.
Of course, support for a topic will also depend on the availability of a mentor with sufficient expertise to advice.
The following are some ideas. If you are interested in one of the topics, we can also help with additional information.
Status: implemented and will be release in statsmodels 0.5
Author: Skipper Seabold based on patsy
, Nathaniel Smith's formula package
Convenient support for categorical explanatory variables is still largely lacking in statsmodels. This can follow up on the existing formula implementation of Jonathan and of Nathaniel, and the start of the integration in the statsmodels account on github. The topic is pretty complex and I would recommend it only to someone familiar with the formula framework in R.
Status: was a GSOC 2012 project, partially finished, Pull Request
Linear_model, robust_linear_model and generalized_linear_model could all take a given non-linear function y = f(x, parameters) instead of the current linear version y = X*beta. Technically this can follow mostly the pattern of the current linear versions, but requires that one gets familiar with all three models.
Status: GSOC 2012 project, Pull Request to be merged multivariate models, seemingly unrelated regression, simultaneous equation models
Status: GSOC 2012 project, first part merged, will be released in statsmodels 0.5
Status: GSOC 2012 project, merged, will be released in statsmodels 0.5, some parts in sandbox
Status: Pull Request
Status: work in progress Pull Request, under refactoring by Skipper
Status: WIP Pull Request (#452) by Josef , additional work by Virgile
LTS, ELTS, MM-Estimators
Status: partially implemented, some in WIP, others missing
The coverage of statistical hypothesis tests is increasing. There are still tests that are missing in statsmodels or scipy.stats, or that have only limited options. Also Results
classes for the outcome of statistical tests are currently mostly missing, and need also supporting methods (plot, summary, confidence intervals, ...)
Additional support for power and sample size calculations and for effect sizes calculations just got started.
Generic GMM is mostly implemented in the sandbox, but it has missing pieces. Except for two-stage least squares case no specific models that use GMM are implemented. The possible application areas are wide, one possibility that has been popular in recent years would be support for weak instruments.
These are models with an additional random component that can be either implemented from a statistics or an econometrics viewpoint. The topic is large so some selection has to be taken.
Vincent has a pull request for the basic panel data model (within, between, and one-random-factor models)
similar ideas but different implementation from a statistics or an econometrics viewpoint. Estimation and inference based on moment conditions or estimating equations based on a panel or longitudinal structure of the data.
Review for Generalized Linear Mixed Models: Dean, C. B., and Jason D. Nielsen. 2007. “Generalized Linear Mixed Models: a Review and Some Extensions.” Lifetime Data Analysis 13 (November 14): 497–512. doi:10.1007/s10985-007-9065-x.
A wide range of models where statsmodels is completely lacking. Examples would be threshold models, markov switching models, ...
mainly Stock and Watson and offspring. Interesting would be also to link this up with some of the variable selection procedures in sklearn similar to Bai and Ng.
extending current vector_ar models to include VECM representation and estimation and the corresponding cointegration estimation.
adapt and integrate Wes's DLM code (JP: I don't know what the status is.)
large parts for univariate GARCH are written and in the sandbox, but needs cleanup, enhancements and verification.
Statsmodels is missing a systematic framework for bootstrap and other resampling approaches. Some bootstrap is included in several models and parts of statsmodels. A basic framework needs to make the tools (iterators) available, and tie it in with various models, or add them to statistical tests. (Might require more familiarity with the model structure in statsmodels.)
Status: slowly increasing, pandas had a GSOC 2012 project statsmodels has some plots with matplotlib included, but compared to other statistical packages there are still gaps. An idea would be to implement graphics with a coverage similar to other statistical packages in a user friendly way.
Other software packages promise: "Provides functions for multivariate and propensity score matching and for finding optimal balance based on a genetic search algorithm. A variety of univariate and multivariate metrics to determine if balance has been obtained are also provided."
For example:
- http://cran.r-project.org/web/packages/Matching/index.html
- http://cran.r-project.org/web/packages/MatchIt/index.html
- http://cran.r-project.org/web/packages/optmatch/index.html
- http://cran.r-project.org/web/packages/cem/index.html
two stage models (e.g. Heckman sample selection)
extension to discrete models
non-parametric estimation, extension to kernel regression
....