Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dedupe edx content files further #1945

Merged
merged 1 commit into from
Jan 9, 2025
Merged

dedupe edx content files further #1945

merged 1 commit into from
Jan 9, 2025

Conversation

abeglova
Copy link
Contributor

@abeglova abeglova commented Jan 7, 2025

What are the relevant tickets?

closes https://github.com/mitodl/hq/issues/6343

Description (What does it do?)

The edx course folders contain duplicates of some files. This PR updates the file etl to dedupe files that have the same run and content

How can this be tested?

If you don't have xpro files already populated, run
Run docker-compose run web ./manage.py backpopulate_xpro_data , docker-compose run web ./manage.py backpopulate_xpro_files --overwrite
in the main branch

From the shell run

from learning_resources.models import *
ContentFile.objects.filter(run__learning_resource__platform_id='xpro').count()
ContentFile.objects.filter(checksum='c701f34e5593021eec27db37561c45de').count()

You should have about 7187 xpro content files and 33 content files with checksum c701f34e5593021eec27db37561c45de

In this branch run
docker-compose run web ./manage.py backpopulate_xpro_files --overwrite

From the shell run

from learning_resources.models import *
ContentFile.objects.filter(run__learning_resource__platform_id='xpro').count()
ContentFile.objects.filter(checksum='c701f34e5593021eec27db37561c45de').count()

You will have about 5555 xpro content files and 1 file with checksum c701f34e5593021eec27db37561c45de

@abeglova abeglova marked this pull request as ready for review January 7, 2025 22:19
@shanbady shanbady self-assigned this Jan 8, 2025
Copy link
Contributor

@shanbady shanbady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing it seemed to work as intended. left minor note about potential edge case


yield (
filebytes,
{
"key": f"{path}/{filename}",
"key": checksum,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious if there are unexpected cases here since it seems like the "key" is unique with the associated run. are there cases where we could have 2 separate files in the same run that have the same content?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario you are describing is exactly what this pr is fixing - we found that there are duplicate files with different file names and the same content in lots of edx course folders. This pr fixes that issue.

There is an edge case where files with different content that isn't parsed fully by tikka for whatever reason might end up looking like the same (almost blank) file. But since those files don't have useful content anyway, that's ok

Copy link
Contributor

@shanbady shanbady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@abeglova abeglova force-pushed the ab-dedupe-edx-further branch from 67b6a59 to e7836d0 Compare January 9, 2025 15:58
@abeglova abeglova merged commit 161a4f2 into main Jan 9, 2025
11 checks passed
@odlbot odlbot mentioned this pull request Jan 10, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants