-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dedupe edx content files further #1945
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing it seemed to work as intended. left minor note about potential edge case
|
||
yield ( | ||
filebytes, | ||
{ | ||
"key": f"{path}/{filename}", | ||
"key": checksum, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious if there are unexpected cases here since it seems like the "key" is unique with the associated run. are there cases where we could have 2 separate files in the same run that have the same content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario you are describing is exactly what this pr is fixing - we found that there are duplicate files with different file names and the same content in lots of edx course folders. This pr fixes that issue.
There is an edge case where files with different content that isn't parsed fully by tikka for whatever reason might end up looking like the same (almost blank) file. But since those files don't have useful content anyway, that's ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
67b6a59
to
e7836d0
Compare
What are the relevant tickets?
closes https://github.com/mitodl/hq/issues/6343
Description (What does it do?)
The edx course folders contain duplicates of some files. This PR updates the file etl to dedupe files that have the same run and content
How can this be tested?
If you don't have xpro files already populated, run
Run
docker-compose run web ./manage.py backpopulate_xpro_data
,docker-compose run web ./manage.py backpopulate_xpro_files --overwrite
in the main branch
From the shell run
You should have about 7187 xpro content files and 33 content files with checksum c701f34e5593021eec27db37561c45de
In this branch run
docker-compose run web ./manage.py backpopulate_xpro_files --overwrite
From the shell run
You will have about 5555 xpro content files and 1 file with checksum c701f34e5593021eec27db37561c45de