Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post: Encrypting PHI in data resources #27

Open
gadenbuie opened this issue Feb 21, 2019 · 4 comments
Open

Post: Encrypting PHI in data resources #27

gadenbuie opened this issue Feb 21, 2019 · 4 comments

Comments

@gadenbuie
Copy link
Member

gadenbuie commented Feb 21, 2019

encryptr is interesting and allows you to do something like

gp %>% 
  encrypt(postcode, telephone)

To encrypt the columns postcode and telephone, enabling the data to be shared without the risk of exposing PHI.

encryptr uses RSA, so it has a similar authentication model to ssh, except it seems that the private key is required for decryption.

Decryption requires the private key generated using genkeys() and the password set at the time.

The package README really doesn't spend much time explaining how to use and share keys with others.

From How does RSA work?

RSA is an asymmetric system, which means that a key pair will be generated (we will see how soon), a public key and a private key, obviously you keep your private key secure and pass around the public one.
https://hackernoon.com/how-does-rsa-work-f44918df914b

A blog post could explore an example with more details about key generation, key sharing, etc.

Also there is a related ROpensci package cyphr which seems to be more oriented towards encrypting files. This might be a better package choice (better community support, etc.) but there is a gap in the README in terms of column-specific encryption.

Finally, another interesting package for secret sharing is secret by Gabor Csardi et al. This package is oriented towards sharing API keys but the UseR! 2017 presentation about secret could provide a good starting point for sketching out the ideal key-sharing workflow.

@gadenbuie
Copy link
Member Author

Also the ROpenSci package sodium which as a pretty decent overview of how encryption can be handled in R:

# Bob's keypair:
bob_key <- keygen()
bob_pubkey <- pubkey(bob_key)

# Alice's keypair:
alice_key <- keygen()
alice_pubkey <- pubkey(alice_key)

# Bob sends encrypted message for Alice:
msg <- charToRaw("TTIP is evil")
ciphertext <- auth_encrypt(msg, bob_key, alice_pubkey)

# Alice verifies and decrypts with her key
out <- auth_decrypt(ciphertext, alice_key, bob_pubkey)
stopifnot(identical(out, msg))

# Alice sends encrypted message for Bob
msg <- charToRaw("Let's protest")
ciphertext <- auth_encrypt(msg, alice_key, bob_pubkey)

# Bob verifies and decrypts with his key
out <- auth_decrypt(ciphertext, bob_key, alice_pubkey)
stopifnot(identical(out, msg))

@gadenbuie
Copy link
Member Author

The main idea behind the private key, pubkey pair is that users share their public keys with others. Data is encrypted for a particular person by using their public key (and your private key). They can then encrypt using the reverse keys – i.e. their private key and your public key.

image

diagram
sequenceDiagram
  participant O as Data Owner
  participant U as User
  Note over O: Has Pub/Private Key
  Note over U: Has Pub/Private Key
  O->>U: Here's my public key
  U->>O: Cool, here's my public key too
  Note over O: Encodes data with<br>Owner's Private Key<br/>+ Users's Pub Key
  O->>U: Here's the data
  Note over U: Decodes data using<br/>User's Private Key<br/>+ Owner's Pub Key

The main objective is that you need a public and private key pair to decrypt the data, and in all cases the private key should not be transmitted, moved, or sent.

So when @tgerke and I talked about this originally, we thought we could later provide keys to the end user to let them decrypt data they have. This probably wouldn't be a good idea from a security perspective.

What we could do instead would be to initially deliver data encrypted using the owner's private/public keys, knowing that it will not be decryptable to anyone else. If at a later point the user is granted access, we could

  1. Regenerate the data set using the user's public key and send them the new data
  2. Have the user return the encrypted data, which is then decrypted using the owner's key pair and then re-encrypted using the user's public key.

In both cases end users can use/manipulate/etc the unencrypted data as they see fit. In the first case, the regenerated data might be updated, contain more records, etc. but would hopefully be the same shape. The second case could be used for any derivative data or for situations where the source data may have changed but the user only has access to the version they received.

@tgerke
Copy link
Member

tgerke commented Feb 22, 2019

Good find re: not providing keys later. Does the Providing a public key section https://github.com/SurgicalInformatics/encryptr help? TBH I don't think I fully understand how that's different than the initial solution, but it must be since it's got a section of its own.

@gadenbuie
Copy link
Member Author

I'm not sure I fully understand either, so I think that's where the blog post can go: walking through a scenario with multiple collaborators sharing data.

My current understanding is that putting the Owner's (or data pool's) shared key would handle the first arrow above in terms of the "User" getting the Owner's pub key. But I still think the data needs to be encrypted for someone specific, otherwise anyone with the data pool public key could just decrypt the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants