-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement VTT DAG process, resolves #20 #21
Conversation
…ta. Zips all data together and removes folder.
This pull request introduces 6 alerts when merging af50098 into d8eb21a - view on LGTM.com new alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have tested locally and works correctly, including manually verifiying filenames against UCAM. I've added minor suggestions below on how the code might be improved, but otherwise it's perfect 👍🏼
This pull request introduces 1 alert when merging b725075 into d8eb21a - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging cb462bc into d8eb21a - view on LGTM.com new alerts:
|
@jawrainey have implemented all comments as discussed - except for implementing the Can you check my latest changes and approve/merge if appropriate? |
Great work @davidverweij 👍🏼 Tested locally and works for me as before |
@jawrainey , just putting it here before I forget. I see a lot of discussion online about AWS billing when interacting with AWS services and resources - especially by people new to AWS (like me!). Do we have an agreement with VTT on limits to their bucket? Just thinking ahead of preventing costs and potential headaches along the way. |
This PR implements the VTT SMA data download, processing and upload pipeline following the Dreem implementation in #14. It hereby uses the UCAM DB mock service (#18) to translate VTT hashes into Patient ID's and more. The library
boto3
is used to connect to the AWS S3 bucket, andmypy_boto3_s3
for typing.Note that the folder structure in S3 buckets are symbolic. Instead of navigating the bucket, connecting returns a list of ObjectSummary with keys, where keys represent that symbolic structure, i.e.:
Note that the above two files are in the same 'folder', but result in two almost identical keys. The logic in
lib > vttsma.py
filters these duplicates to determine the present patients and work from there.Other changes/notes
dmpy.py
to .zip and immediately remove the original folder. This is used to combined the downloaded files from the S3 bucket in one file.task_prepare_data
does not cope with directories.Record
is created for the MongoDB, only one dump date is stored. Seevttsma_dump_date=item['dumps'][0],
DeviceTypes
is added toutils
. Enums were used for code readability (and providing code suggestions in your editor) and to avoid hardcoding mistakes - and other benefits. I did change TMA to SMA from the original list - do correct me if preferred otherwise.Test
ucam_db.csv
to your root folder.dtransfer.env
to include the keys below, where the VTTSMA global ID is the ID for all SMA 'devices' for the DMP as communicated on email.poetry install
to includeboto3
and other dependenciespoetry compose
to boot up the MongoDB. Erase any documents if present (as new attributes are required)python data_transfer/main.py
and monitor the 'data > input and data > uploading' folder. The code potentially fails if these folders are not presentTo Do
I have not coded any tests yet, as I deemed it a priority to get this stage finished for now. Could possibly open a new Issue for this. But if required I am happy to spend a bit more time on writing tests. Thoughts?