-
Notifications
You must be signed in to change notification settings - Fork 0
GSoC 2015 Proposal: Improvements to Mixed Effects Models
statsmodels - Python Software Foundation Google Summer of Code 2015
Improvements to Mixed Effects Models
-
Proposal Title: Improvements to Mixed Effects Model
-
Proposal Abstract:
-
statsmodels.regression.mixed_linear module is relatively new. Mixed effects models are used to often model experiments where a study has been repeated on an individual multiple times in order incorporate fixed-effect parameters and the unobserved random effects. Applications of LMM involve modeling longitudinal data in order to model random effects such as modeling response times for individuals under the influence of different types of drugs.
-
From a user point of view the main features that have been lacking include:
-
Variance Component models: Though distinct random effects can be constrained to be independent, the variances are not constrained to be equalThis was implemented in Kerby's PR -
**Heteroscedastic Residual errors: **Related issue(tangentially): https://github.com/statsmodels/statsmodels/issues/1948
-
Crossed Random Effects/
Nested Random Effects: The current model of mixed_linear module allows to model only random effect arising out of single factor. Cross-classified data where several factors are expected to have random effects, thus can’t be modeled. Examples of such studies will include gene expression studies where certain set of genes from different individuals are put under certain categories of stress and their expression level is measured. Patients constitute a random sample of the population and so does the level making such studies more cross-design suitable.Nested random effects can be then taken care of implicitly, where each ‘sample’ might be ‘nested’ with other covariates. This should also implicitly support for uncorrelated random effects[Implemented in Kerby's PR]. Related issue: https://github.com/statsmodels/statsmodels/issues/1946
-
Few discussions demanding these features are at: https://github.com/statsmodels/statsmodels/issues/646
-
and https://groups.google.com/forum/#!topic/pystatsmodels/CrHCZkIWj4w
-
Deliverables/End Goals & Challenges
-
Support for crossed random effects: This would probably be the most challenging part. The plan is to implement a non-cholmod based way to compute the cholesky decomposition of the error covariance matrix D. Two known approaches are documented in references [1] and [2]. This will also require significant benchmarking against lme4 methods. General methodology is detailed in [12]
-
Support for
variance component models, heteroscedastic residuals: nlme supports heteroscedasticity[6]. The implementation details are in [3]. This will also possibly require using sparse methods and hence would depend on the above. A non sparse and slower implementation, is however independent from the above. -
Support for other MLE and REML estimation methods:
lmm
package in R[11] implements several methods for rapid MLE and REML convergence. Two methods that could likely be ported to statsmodels include:fastml
andfastmode
andfastrml
-
Support for nonlinear mixed effect models: This would be an optional goal, given nonlinear mixed models have too specific use cases[4]
-
-
More notebooks, more examples, more unit tests
-
Rigorous benchmarking of Kerby's PR followed by optimization wherever possible(Cythonising can possibly be avoided unless really essential)
-
Optimisation using sparse matrices?
-
The vcomp formula should be parsed once and for all rather than reparsing for each group from scratch as in the current implementation.
-
Figure out the best
scipy
optimiser for our use case
-
Support for autoregressive covariance in the model
-
Support for heteroscedastic residuals
-
[Graphics] Two-dimensional profile likelihood plots of random effects params
-
A separated Mixed-design ANOVA convenience class? (Need to readup on Mixed design Anova!)
-
More post-estimation enhancements: More plots for visualising random effects
Week | Tasks |
Community Bonding Period
27 April - 24 May |
-- |
Week 1
May 25 - May 31 |
-- Improvements on top of [Kerby's PR](https://github.com/statsmodels/statsmodels/pull/2363): Add more unit tests, more notebooks following examples covered in Linear Models texts. See some reference texts at the end.
-- Documentation and Unit tests |
Week 2 and Week 3
June 1- June 7, June 8 - June 14 |
-- -- Support for heteroscedastic residual errors, add notebooks explaining everything with examples -- Documentation and Unit tests |
Week 3 and Week 4,
June 15 - June 21, |
-- Kerby's PR makes use of Patsy to process formula on every group, which possibly can be slower. I need to figure out an alternate way to get this done
-- Documentation and Unit tests |
Week 5
June 22 - June 28 |
-- -- Add support for autoregressive covariance in the model -- Tests and documentation |
Week 6 June 29 - July 5 |
-- Post-estimation improvements: Add graphics support for two dimensional profile likelihood plots
-- Tests and Documentation |
Midterm Deliverables |
-- -- -- -- Support for heteroskedasticity, autoregressive covariance -- Improvements, fixes and notebooks for Kerby's PR: More examples, unit tests, A better way to handle formula over the current approach -- Benchmarking, Optimisation |
Week 7
July 6 - July 12, |
-- Port `fastml` methods from `lmm`
-- Tests and documentation |
Week 8
July 13 - July 19 |
-- Port `fastml` methods from `lmm`
-- Tests and documentation |
Week 9
July 20 - July 26 |
-- Port ‘fastmode` from lmm R package
-- Tests and documentation |
Week 10 and Week 11
July 27 - August 2 |
-- Support for nonlinear models
-- Tests and documentation |
Week 12
August 3 - August 9 |
-- Code Profiling
-- iPython Notebook demos -- Tests and documentation |
Week 13
August 10 - August 17 August 18 - August 24 |
-- Code Profiling -- iPython Notebook demos -- Tests and documentation |
Term end Deliverables | -- Support for non-linear mixed effects
-- IPython notebooks with examples benchmarking/comparing methods against methods from |
I have deliberately kept the last two weeks as buffer periods in order to accommodate any pending/overdue tasks from previous weeks.
References For Examples/Data This is more of a scratchpad for picking up examples data(I have not gone through all of them)
-
Linear Mixed Models: A Practical Guide Using Statistical Software, Brady T. West et al.
-
Mixed Models: Theory and Applications with R by Eugene Demidenko
-
[Linear Mixed-Effects Models Using R: A Step-by-Step Approach by Andrzej Galecki et al.] (http://www.amazon.com/Linear-Mixed-Effects-Models-Using-Step/dp/1461438993/)
-
lme4 preprint: Fitting Linear Mixed-Effects Models using lme4
References:
-
Parameter Estimation in High Dimensional Gaussian Distributions http://www.math.ntnu.no/preprint/statistics/2012/S5-2012.pdf
-
FaST linear mixed models for genome-wide association studies(See supplement): http://www.nature.com/nmeth/journal/v8/n10/abs/nmeth.1681.html
-
Pinheiro, J.C. and Bates, D.M. (2000). Mixed-Effects Models in S and S-Plus. Springer
-
Nonlinear Mixed Effects Models http://www4.stat.ncsu.edu/~davidian/nlmmtalk.pdf
-
Fasta algorithms for ML and RML esitmation in linear models: http://raptor1.bizlab.mtsu.edu/s-drive/TEFF/Rlib/library/lmm/doc/improve.pdf
-
lme4 heteroscedasticity: https://stat.ethz.ch/pipermail/r-sig-mixed-models/2014q4/022753.html
-
Current status of Mixed Linear models in statsmodes: http://statsmodels.sourceforge.net/devel/mixed_linear.html
-
lme4 book: http://lme4.r-forge.r-project.org/book/
-
lme4 implementation: http://econ.ucsb.edu/~doug/245a/Papers/Mixed%20Effects%20Implement.pdf and MJ Lindstrom, DM Bates (1988). "Newton Raphson and EM algorithms for linear mixed effects models for repeated measures data"
-
Julia implementation: https://github.com/dmbates/MixedModels.jl
-
lmm in r: http://cran.r-project.org/web/packages/lmm/lmm.pdf
-
Mixed-effects modeling with crossed random effects for subjects and items**: **http://www.sciencedirect.com/science/article/pii/S0749596X07001398
-
Name: Saket Choudhary
-
Email: [email protected]
-
Telephone: +1-213-477-3770
-
Time zone: GMT -0800 Pacific Time Zone
-
IRC handle: [email protected]
-
Source control usernames:
-
Github: https://github.com/saketkc/
-
Bitbucket: https://bitbucket.org/saketkc
-
-
Twitter: https://twitter.com/saketkc
-
Home Page: http://saket-choudhary.me
-
Blog:
-
GSoC Blog RSS feed:
-
Other personal information you think we would find relevant:
-
I was part of GSoC’12 and worked on improving the Slideshow publishing API for OERPub with Connexions a Rice University project: [Proposal]
-
As part of GSoC’13, I worked for Galaxy Project, working on python codebase for implementing changes to the workflow system. A major part of my code didn’t make it to the codebase, I however still contribute to the project. We had a preprint submitted for part of our work too. [Proposal]
-
I also participated in GSoC’14 with BioJS and implemented a Human Genetic Variation Viewer , a manuscript is under submission. [Proposal]
-
I also contribute occasionally to scipy, pgmpy, homebrew-science
-
-
University: University Of Southern California
-
Major: Computational Biology & Bioinformatics
-
Current Year: 1st
-
Expected Graduation date: 2019
-
Degree: Ph.D
-
Patches contributed to statsmodels:
-
On Hold
-
Fix trendorder for trend only models in VAR: https://github.com/statsmodels/statsmodels/pull/2327
-
Doc fix for hazard_regression: https://github.com/statsmodels/statsmodels/pull/2236
-
-
Accepted and Merged
-
Check internet availability before running tests: https://github.com/statsmodels/statsmodels/pull/2247
-
Raise exception on incorrect trend type: https://github.com/statsmodels/statsmodels/pull/2329
-
Doc fixes for MixedLinear: https://github.com/statsmodels/statsmodels/pull/2333
-
-
I will probably be taking a course during summer 2015. Besides this, I do not have any other commitments during the coding period.