Implement VTT DAG process, resolves #20 #21

davidverweij · 2021-02-02T11:27:38Z

This PR implements the VTT SMA data download, processing and upload pipeline following the Dreem implementation in #14. It hereby uses the UCAM DB mock service (#18) to translate VTT hashes into Patient ID's and more. The library boto3 is used to connect to the AWS S3 bucket, and mypy_boto3_s3 for typing.

Note that the folder structure in S3 buckets are symbolic. Instead of navigating the bucket, connecting returns a list of ObjectSummary with keys, where keys represent that symbolic structure, i.e.:

[
    ObjectSummary(key="date/raw/patienthash/file.zip", ...), 
    ObjectSummary(key="date/raw/patienthash/file.nfo", ...),
....
]

Note that the above two files are in the same 'folder', but result in two almost identical keys. The logic in lib > vttsma.py filters these duplicates to determine the present patients and work from there.

Other changes/notes

added a method in dmpy.py to .zip and immediately remove the original folder. This is used to combined the downloaded files from the S3 bucket in one file. task_prepare_data does not cope with directories.
the S3 bucket has various data dump folders. Currently the logic does not look across dumps to determine if patients exists in both. It does collect the various data dumps associated for each patient - but when the Record is created for the MongoDB, only one dump date is stored. See vttsma_dump_date=item['dumps'][0],
A list of DeviceTypes is added to utils. Enums were used for code readability (and providing code suggestions in your editor) and to avoid hardcoding mistakes - and other benefits. I did change TMA to SMA from the original list - do correct me if preferred otherwise.

Test

Add the latest ucam_db.csv to your root folder
update .dtransfer.env to include the keys below, where the VTTSMA global ID is the ID for all SMA 'devices' for the DMP as communicated on email.
- VTTSMA_AWS_ACCESSKEY=""
- VTTSMA_AWS_SECRET_ACCESSKEY=""
- VTTSMA_AWS_BUCKET_NAME=""
- VTTSMA_GLOBAL_DEVICE_ID=""
run poetry install to include boto3 and other dependencies
run poetry compose to boot up the MongoDB. Erase any documents if present (as new attributes are required)
run python data_transfer/main.py and monitor the 'data > input and data > uploading' folder. The code potentially fails if these folders are not present

To Do

I have not coded any tests yet, as I deemed it a priority to get this stage finished for now. Could possibly open a new Issue for this. But if required I am happy to spend a bit more time on writing tests. Thoughts?

…ta. Zips all data together and removes folder.

data_transfer/lib/vttsma.py

lgtm-com · 2021-02-02T11:39:11Z

This pull request introduces 6 alerts when merging af50098 into d8eb21a - view on LGTM.com

new alerts:

5 for Unused import
1 for Unused local variable

jawrainey

Have tested locally and works correctly, including manually verifiying filenames against UCAM. I've added minor suggestions below on how the code might be improved, but otherwise it's perfect 👍🏼

data_transfer/lib/vttsma.py

data_transfer/devices/vttsma.py

data_transfer/lib/vttsma.py

data_transfer/schemas/record.py

data_transfer/services/dmpy.py

data_transfer/services/ucam.py

data_transfer/devices/vttsma.py

lgtm-com · 2021-02-02T14:53:28Z

This pull request introduces 1 alert when merging b725075 into d8eb21a - view on LGTM.com

new alerts:

1 for Unused import

lgtm-com · 2021-02-02T15:07:02Z

This pull request introduces 1 alert when merging cb462bc into d8eb21a - view on LGTM.com

new alerts:

1 for Unused import

davidverweij · 2021-02-02T15:49:07Z

@jawrainey have implemented all comments as discussed - except for implementing the SMA_Record data class. I was running into challenges regarding converting a lists of paths (from the S3 bucket) into unique patient SMA_Records without too much logic. In the end my approach became too complex for what it is currently worth I think. I've set up issue #26 if we want to pursue this further.

Can you check my latest changes and approve/merge if appropriate?

jawrainey · 2021-02-02T17:08:17Z

Great work @davidverweij 👍🏼 Tested locally and works for me as before

davidverweij · 2021-02-03T11:25:45Z

@jawrainey , just putting it here before I forget. I see a lot of discussion online about AWS billing when interacting with AWS services and resources - especially by people new to AWS (like me!). Do we have an agreement with VTT on limits to their bucket? Just thinking ahead of preventing costs and potential headaches along the way.

davidverweij added 10 commits February 1, 2021 14:57

initial mirror for vtt based on Dreem implementation

2c8858a

establish connection with S3 bucket and pull metadata

1331cdc

download meta AND download files working

1321f67

name change from vtt to vttsma

ca68ed7

add typing for boto3

5b00198

use UCAM api to retrieve patientID and wear times

1581e9a

add global VTT SMA Device ID

3c8f4c5

download original data from patient, including audio files and metada…

6d95eae

…ta. Zips all data together and removes folder.

implemented device type, see also #20

5f7e65e

slight comment cleanup

af50098

davidverweij added VTT FS Device data-transfer Data Transfer Protocol labels Feb 2, 2021

davidverweij requested a review from jawrainey February 2, 2021 11:27

davidverweij self-assigned this Feb 2, 2021

davidverweij commented Feb 2, 2021

View reviewed changes

data_transfer/lib/vttsma.py Outdated Show resolved Hide resolved

davidverweij commented Feb 2, 2021

View reviewed changes

data_transfer/lib/vttsma.py Outdated Show resolved Hide resolved

davidverweij linked an issue Feb 2, 2021 that may be closed by this pull request

Data transfer protocol VTT SMA #17

Closed

6 tasks

jawrainey reviewed Feb 2, 2021

View reviewed changes

data_transfer/devices/vttsma.py Show resolved Hide resolved

jawrainey reviewed Feb 2, 2021

View reviewed changes

data_transfer/devices/vttsma.py Outdated Show resolved Hide resolved

This was referenced Feb 2, 2021

VTT SMA: check if patients have data across individual data dumps #23

Closed

Abstract download_file protocols across devices due to similarity #24

Open

Re-evaluate wether VTT_hash has a one-to-one mapping with a Patient #25

Closed

reflect general PR comments

b725075

davidverweij added 2 commits February 2, 2021 15:00

refactor get_record for abstraction

cb462bc

remove unused dataclass import

6774338

davidverweij mentioned this pull request Feb 2, 2021

Create a VTT SMA data class for clarity across scripts #26

Open

clean up list comprehensions

b86756a

rename dumps --> exports

4fdb7ed

davidverweij mentioned this pull request Feb 2, 2021

Initial Data protocol for Byteflies #27

Closed

4 tasks

jawrainey merged commit a6ba0ad into master Feb 2, 2021

jawrainey deleted the vtt branch February 2, 2021 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement VTT DAG process, resolves #20 #21

Implement VTT DAG process, resolves #20 #21

davidverweij commented Feb 2, 2021

lgtm-com bot commented Feb 2, 2021

jawrainey left a comment

lgtm-com bot commented Feb 2, 2021

lgtm-com bot commented Feb 2, 2021

davidverweij commented Feb 2, 2021

jawrainey commented Feb 2, 2021

davidverweij commented Feb 3, 2021

Implement VTT DAG process, resolves #20 #21

Implement VTT DAG process, resolves #20 #21

Conversation

davidverweij commented Feb 2, 2021

Other changes/notes

Test

To Do

lgtm-com bot commented Feb 2, 2021

jawrainey left a comment

Choose a reason for hiding this comment

lgtm-com bot commented Feb 2, 2021

lgtm-com bot commented Feb 2, 2021

davidverweij commented Feb 2, 2021

jawrainey commented Feb 2, 2021

davidverweij commented Feb 3, 2021