Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve scoring method due to shrinkage #28

Open
william-hula opened this issue Oct 11, 2021 · 5 comments
Open

Improve scoring method due to shrinkage #28

william-hula opened this issue Oct 11, 2021 · 5 comments

Comments

@william-hula
Copy link
Collaborator

Rob, Gerasimos, and Alex,
In sussing out the CI stuff, I discovered that the Bayesian shrinkage we're getting with extreme score estimates may be too large to be tolerated, especially in the context of the precision added by the sd =10 prior. The concern is that a clinician might administer a 10 item short form in the acute setting, obtain 10 incorrect responses and get a T-score (sem) of 26.6 (5.01). and then administer the full test a month later on which the person also gets zero correct, leading to an estimate (sem) of 17.7 (4.01), which, if they apply the correct math (which they probably won't, unless we do it for them), will suggest that there is a t-tailed 92% probability that the patient has gotten worse. Now arguably, that might be an appropriate conclusion under normal circumstances, but Gerasimos convinced me this is something to be concerned about, As I write this, I'm think that a quick Monte Carlo simulation with a constant extreme low generating theta, comparing score estimates for CAT-10 and full test might be in order. In any case, if this shrinkage is too much to bear, options for addressing include EAP with a uniform prior or ML estimation with fences, with fences being two dummy items at the extremes of the ability range, that are always administered and scored correct (for the low one) and incorrect (for the high one), to put some bounds on the ML procedure. Apparently it produces less error than EAP (https://journals.sagepub.com/doi/full/10.1177/0146621616631317?casa_token=iSbh34Qp4woAAAAA%3A6VadRw5h1_Nhg3hA_vxQEjK1LrrAjAwCV5e3jHCgT_lHjPmkVG5g4mzAkQLJI6HCiCKVJgYHo61GXw). catR will implement this easily and I'm currently playing around with it.

@rbcavanaugh
Copy link
Owner

Hi Will - glad you're working this out. Just to clarify - the 10-item PNT that is current in the app is just for testing purposes and I had planned on removing it before the final release. Are you saying that you want to keep it in for situations like acute-care assessment?

@william-hula
Copy link
Collaborator Author

william-hula commented Oct 11, 2021 via email

@william-hula
Copy link
Collaborator Author

Here are a couple of plots of preliminary results from 10 and 30-item CAT simulations comparing EAP scoring with MLE with fences. EAP is with a normal(50,10) prior, the fences were at 5 and 95 with discrimination set to 3x the constant estimated discrimination. The theta generating distribution was uniform (5,95) with 1800 simulees, originally intended to give 200 each in 9 theta bands, but then I realized that catR divides the distribution up into deciles automatically, so that's what's plotted.

The upshot is that EAP is superior in the range of about 30 to 70, while MLEF is better in the tails, which is not unexpected.

bias_plot
rmse_plot

I'm currently running another set of simulations parallel to this one, but using a skew normal theta distribution based on the empirical estimates from the latest sample of 335 mappd and r03 subjects. I'm running that one with 1000 simulees total.

I think the question here will be whether the worse performance of MLEF in the center of the theta distribution will be tolerable, given the shrinkage issue we've identified with EAP, which I think remains an issue even with a 30 item CAT and extreme response strings. Or are we worrying too much about edge cases that will affect a small number of users? I should be able to post the results from the empirically derived theta distribution later today. Given the higher density in the middle of the theta distribution there, I expect they'll show a larger average advantage for EAP.

Let me know if you'd like to see other conditions as well.

@rbcavanaugh
Copy link
Owner

  • We can also include a disclaimer stating that comparing tests of different numbers of items when performance is at the extremes can lead to less reliable results (or however you would like to say this). We can even automate this message by popping it up when a) the current test differs in the number of items as the previous test scores uploaded and/or b) one of the tests includes a final estimate < 25 or > 75. We can also include a brief explanation of why this happens
  • Is there any model stacking/averaging in IRT/CATR? It seems like combining information from both models might be advantageous.
  • Similarly, if the score is < 25 or > 75 would it be reasonable to provide an alternative final estimate using the MLEF approach?

@william-hula
Copy link
Collaborator Author

Ok, sorry folks, we need to hold up on interpreting those charts I sent. I took a deeper look at the simulation results and something is weird: the correlations between the generating and estimated thetas is in the 0.7-0.75 range which seems super low, and the scatter plots of estimated theta over generating theta look weird, for both MLEF and EAP, e.g.:
image
I need some time to see what's going on here. It might have something to do with the T-score transformation (I doubt it, but maybe).

@rbcavanaugh rbcavanaugh changed the title we need to tweak the score method due to shrinkage Improve scoring method due to shrinkage Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants