Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement VTT DAG process, resolves #20 #21

Merged
merged 15 commits into from
Feb 2, 2021
Merged

Implement VTT DAG process, resolves #20 #21

merged 15 commits into from
Feb 2, 2021

Conversation

davidverweij
Copy link
Member

This PR implements the VTT SMA data download, processing and upload pipeline following the Dreem implementation in #14. It hereby uses the UCAM DB mock service (#18) to translate VTT hashes into Patient ID's and more. The library boto3 is used to connect to the AWS S3 bucket, and mypy_boto3_s3 for typing.

Note that the folder structure in S3 buckets are symbolic. Instead of navigating the bucket, connecting returns a list of ObjectSummary with keys, where keys represent that symbolic structure, i.e.:

[
    ObjectSummary(key="date/raw/patienthash/file.zip", ...), 
    ObjectSummary(key="date/raw/patienthash/file.nfo", ...),
....
]

Note that the above two files are in the same 'folder', but result in two almost identical keys. The logic in lib > vttsma.py filters these duplicates to determine the present patients and work from there.

Other changes/notes

  • added a method in dmpy.py to .zip and immediately remove the original folder. This is used to combined the downloaded files from the S3 bucket in one file. task_prepare_data does not cope with directories.
  • the S3 bucket has various data dump folders. Currently the logic does not look across dumps to determine if patients exists in both. It does collect the various data dumps associated for each patient - but when the Record is created for the MongoDB, only one dump date is stored. See vttsma_dump_date=item['dumps'][0],
  • A list of DeviceTypes is added to utils. Enums were used for code readability (and providing code suggestions in your editor) and to avoid hardcoding mistakes - and other benefits. I did change TMA to SMA from the original list - do correct me if preferred otherwise.

Test

  • Add the latest ucam_db.csv to your root folder
  • update .dtransfer.env to include the keys below, where the VTTSMA global ID is the ID for all SMA 'devices' for the DMP as communicated on email.
    • VTTSMA_AWS_ACCESSKEY=""
    • VTTSMA_AWS_SECRET_ACCESSKEY=""
    • VTTSMA_AWS_BUCKET_NAME=""
    • VTTSMA_GLOBAL_DEVICE_ID=""
  • run poetry install to include boto3 and other dependencies
  • run poetry compose to boot up the MongoDB. Erase any documents if present (as new attributes are required)
  • run python data_transfer/main.py and monitor the 'data > input and data > uploading' folder. The code potentially fails if these folders are not present

To Do

I have not coded any tests yet, as I deemed it a priority to get this stage finished for now. Could possibly open a new Issue for this. But if required I am happy to spend a bit more time on writing tests. Thoughts?

@davidverweij davidverweij added VTT FS Device data-transfer Data Transfer Protocol labels Feb 2, 2021
@davidverweij davidverweij self-assigned this Feb 2, 2021
@lgtm-com
Copy link

lgtm-com bot commented Feb 2, 2021

This pull request introduces 6 alerts when merging af50098 into d8eb21a - view on LGTM.com

new alerts:

  • 5 for Unused import
  • 1 for Unused local variable

@davidverweij davidverweij linked an issue Feb 2, 2021 that may be closed by this pull request
6 tasks
Copy link
Member

@jawrainey jawrainey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have tested locally and works correctly, including manually verifiying filenames against UCAM. I've added minor suggestions below on how the code might be improved, but otherwise it's perfect 👍🏼

data_transfer/lib/vttsma.py Outdated Show resolved Hide resolved
data_transfer/lib/vttsma.py Outdated Show resolved Hide resolved
data_transfer/devices/vttsma.py Outdated Show resolved Hide resolved
data_transfer/devices/vttsma.py Show resolved Hide resolved
data_transfer/devices/vttsma.py Outdated Show resolved Hide resolved
data_transfer/lib/vttsma.py Outdated Show resolved Hide resolved
data_transfer/lib/vttsma.py Outdated Show resolved Hide resolved
data_transfer/schemas/record.py Outdated Show resolved Hide resolved
data_transfer/services/dmpy.py Outdated Show resolved Hide resolved
data_transfer/services/ucam.py Outdated Show resolved Hide resolved
@lgtm-com
Copy link

lgtm-com bot commented Feb 2, 2021

This pull request introduces 1 alert when merging b725075 into d8eb21a - view on LGTM.com

new alerts:

  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Feb 2, 2021

This pull request introduces 1 alert when merging cb462bc into d8eb21a - view on LGTM.com

new alerts:

  • 1 for Unused import

@davidverweij
Copy link
Member Author

@jawrainey have implemented all comments as discussed - except for implementing the SMA_Record data class. I was running into challenges regarding converting a lists of paths (from the S3 bucket) into unique patient SMA_Records without too much logic. In the end my approach became too complex for what it is currently worth I think. I've set up issue #26 if we want to pursue this further.

Can you check my latest changes and approve/merge if appropriate?

@jawrainey jawrainey merged commit a6ba0ad into master Feb 2, 2021
@jawrainey
Copy link
Member

Great work @davidverweij 👍🏼 Tested locally and works for me as before

@jawrainey jawrainey deleted the vtt branch February 2, 2021 17:08
@davidverweij
Copy link
Member Author

@jawrainey , just putting it here before I forget. I see a lot of discussion online about AWS billing when interacting with AWS services and resources - especially by people new to AWS (like me!). Do we have an agreement with VTT on limits to their bucket? Just thinking ahead of preventing costs and potential headaches along the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-transfer Data Transfer Protocol VTT FS Device
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data transfer protocol VTT SMA
2 participants