-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/filter outlier in trainset #119
Conversation
I rewrite the comments in the last commit. |
Does this mean that if a card is removed from the pretrain dataset, then all its reviews will be removed from the trainset? |
Yes. It's consistent with the Python optimizer. |
As I mentioned in #88 (comment), this outlier filter is too aggressive (it removes too many reviews). This is fine for the pretrain function (because the pretrain method is fragile). However, the train function should have access to more reviews. Let me explain with an example: Let's say that for first_rating = 3 and delta_t = 9 days, there are 8 cards and for delta_t = 10 days, there are 100 cards. I agree that some sort of outlier filter is required for the trainset also. But, I think that the same outlier filter should not be used for both. |
I don't think so. If a card is reviewed too early or too late, the first response will be unreliable. If we use the same model the calculate the initial memory state with the unreliable response, the subsequent training will be polluted. |
I agree, but 9 days can't be called "too early" when compared to 10 days. This is the reason I am suggesting to use a less aggressive outlier filter for the trainset. |
Taking the example of my collection (see first review data at the bottom of the comment),
So, what about using an outlier filter based on the ratio between delta_t and stability for the trainset? First review data: stability_for_pretrain.zip |
The current outlier filter only removes 5% cards. Is it aggressive? |
Yes (for trainset) / No (for pretrain) See my above comment; it includes specific examples. |
It's really hard to design a perfect solution for that. And I think |
Yes, but we can try to design a solution better than the current one. As I said in #119 (comment), what if create another condition based on the ratio (or something similar) of the delta_t and stability and then remove the reviews of cards that fulfill BOTH the conditions. This suggestion is for trainset. I recommend keeping the pretrain filter unchanged.
Based on these values, this group had only 1 lapse. Any group with only 1 lapse is too small to be used for calculating the stability. |
I recommend opening a new issue to discuss about it. |
No description provided.