Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split S3 files into smaller files to send large union file #77

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

yuyashiraki
Copy link
Contributor

Summary:

Context

We found that AWS-SDK S3 API would fail when we try to write more than 5GB of data. It is a blocking us to do capacity testing for a larger FARGATE container.

In this diff, as mentioned in the post, we are splitting union file based on number of rows.

Description

We have made following changes.

  • Added new arg s3api_max_rows in the private-id-multi-key-client and private-id-multi-key-server binaries. We will use this to split a file for S3 upload.
  • Added an optional arg num_split in save_id_map() and writer_helper(). When num_split is specified, it would use the arg path as its prefix and save files in {path}_0, {path}_1, etc.
  • In rpc_server.rs and client.rs, calculates the num_split based on s3api_max_rows, and passes the num_split arg for S3 only. Then, for each split file, it calls copy_from_local().

Differential Revision: D39219674

@facebook-github-bot facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Sep 4, 2022
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D39219674

yuyashiraki pushed a commit to yuyashiraki/Private-ID that referenced this pull request Sep 4, 2022
…esearch#77)

Summary:
Pull Request resolved: facebookresearch#77

# Context
We found that AWS-SDK S3 API would fail when we try to write more than 5GB of data. It is a blocking us to do capacity testing for a larger FARGATE container.

In this diff, as mentioned in [the post](https://fb.workplace.com/groups/pidmatchingxfn/posts/493743615908631), we are splitting union file based on number of rows.

# Description
We have made following changes.
- Added new arg `s3api_max_rows` in the private-id-multi-key-client and private-id-multi-key-server binaries. We will use this to split a file for S3 upload.
- Added an optional arg `num_split` in save_id_map() and writer_helper(). When `num_split` is specified, it would use the arg `path` as its prefix and save files in `{path}_0`, `{path}_1`, etc.
- In rpc_server.rs and client.rs, calculates the num_split based on s3api_max_rows, and passes the num_split arg for S3 only. Then, for each split file, it calls copy_from_local().

Differential Revision: D39219674

fbshipit-source-id: 82dc1788b0d4db5cf9c3de07178b52a8cc11633c
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D39219674

Jian Cao and others added 3 commits September 4, 2022 15:44
Summary:
# What
* Add unit tests for encrypt and create_id_map funcion on partner side
* Add create_key function to create fixed keys for testing.
* encrypt and create_id_map function both use partner.private_keys.1 to encrypt.
* self_permutation also needs to be fixed when we test create_id_map()

# Why
* need to improve code coverage

Differential Revision: https://internalfb.com/D39127178

fbshipit-source-id: 22acb4c9d2d642b8df1348547098a7539f6ce7df
Summary:
Pull Request resolved: facebookresearch#76

# What
* Add unit tests for save_id_map funcion on partner side.
* save_id_map function is called after the create_id_map().
* Add create_key function to create fixed keys for testing.
* create_id_map function use partner.private_keys.1 to encrypt.
* self_permutation also needs to be fixed when we test create_id_map().
* Create a temp file and pass the path to save_id_map() and check  the string in the file is correct or not.

# Why
* need to improve code coverage

Differential Revision: D39142927

fbshipit-source-id: 82884647935873fe1f2feef5b061f3cc5385bba2
…esearch#77)

Summary:
Pull Request resolved: facebookresearch#77

# Context
We found that AWS-SDK S3 API would fail when we try to write more than 5GB of data. It is a blocking us to do capacity testing for a larger FARGATE container.

In this diff, as mentioned in [the post](https://fb.workplace.com/groups/pidmatchingxfn/posts/493743615908631), we are splitting union file based on number of rows.

# Description
We have made following changes.
- Added new arg `s3api_max_rows` in the private-id-multi-key-client and private-id-multi-key-server binaries. We will use this to split a file for S3 upload.
- Added an optional arg `num_split` in save_id_map() and writer_helper(). When `num_split` is specified, it would use the arg `path` as its prefix and save files in `{path}_0`, `{path}_1`, etc.
- In rpc_server.rs and client.rs, calculates the num_split based on s3api_max_rows, and passes the num_split arg for S3 only. Then, for each split file, it calls copy_from_local().

Differential Revision: D39219674

fbshipit-source-id: 871df40d1a377ef8115422e39a868a26e09e027d
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D39219674

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants