Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/cldsrv 546 post object #5601

Open
wants to merge 23 commits into
base: epic/RING-45960-postObject-api
Choose a base branch
from

Conversation

KazToozs
Copy link

@KazToozs KazToozs commented Jun 25, 2024

Description

Motivation and context

This API adds the ability to upload objects via HTML web forms per customer demand. This is a preliminary implementation and does not yet include Authz, POST policy checking or extra parameters.

Related issues

ARSN-422 - Adds support for postObject API

Notes

  • AWS has different error response XMLs. This is noted in this conversation.
  • AWS has a single upload object size limit of 5Gb; our products do not.

@bert-e
Copy link
Contributor

bert-e commented Jun 25, 2024

Hello kaztoozs,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@KazToozs KazToozs changed the base branch from development/8.8 to development/7.70 June 25, 2024 15:51
@KazToozs KazToozs force-pushed the feature/CLDSRV-546-post-object branch 6 times, most recently from 7948c37 to 1f132e5 Compare July 1, 2024 19:44
@KazToozs KazToozs marked this pull request as ready for review July 1, 2024 19:54
@KazToozs KazToozs force-pushed the feature/CLDSRV-546-post-object branch from 1f132e5 to 084eb5b Compare July 1, 2024 20:12
@scality scality deleted a comment from bert-e Jul 1, 2024
@KazToozs KazToozs force-pushed the feature/CLDSRV-546-post-object branch from 084eb5b to 7c42b2f Compare July 2, 2024 09:38
@bert-e
Copy link
Contributor

bert-e commented Jul 2, 2024

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@KazToozs KazToozs force-pushed the feature/CLDSRV-546-post-object branch from 7c42b2f to 625603a Compare July 2, 2024 10:58
@scality scality deleted a comment from bert-e Jul 2, 2024
@scality scality deleted a comment from bert-e Jul 2, 2024
@KazToozs KazToozs changed the base branch from development/7.70 to epic/RING-45960-postObject-api July 4, 2024 09:49
@KazToozs KazToozs force-pushed the feature/CLDSRV-546-post-object branch from 625603a to d18f9ec Compare July 4, 2024 14:09
lib/api/api.js Outdated Show resolved Hide resolved
lib/api/api.js Outdated Show resolved Hide resolved
lib/api/objectPost.js Show resolved Hide resolved
@KazToozs KazToozs requested a review from dvasilas July 8, 2024 13:28
lib/api/api.js Outdated Show resolved Hide resolved
lib/api/api.js Outdated Show resolved Hide resolved
lib/api/api.js Outdated Show resolved Hide resolved
lib/api/api.js Outdated Show resolved Hide resolved
lib/api/api.js Outdated Show resolved Hide resolved
@KazToozs KazToozs force-pushed the feature/CLDSRV-546-post-object branch from 57f34b9 to d3b76f0 Compare July 9, 2024 10:01
lib/api/api.js Outdated Show resolved Hide resolved
@KazToozs KazToozs force-pushed the feature/CLDSRV-546-post-object branch from 00752ae to 252205a Compare July 17, 2024 15:03
@KazToozs KazToozs force-pushed the feature/CLDSRV-546-post-object branch 2 times, most recently from 5325ca1 to 7ef48e1 Compare July 22, 2024 11:38
@KazToozs KazToozs force-pushed the feature/CLDSRV-546-post-object branch from 7ef48e1 to 95140e3 Compare July 22, 2024 11:48
@KazToozs KazToozs force-pushed the feature/CLDSRV-546-post-object branch from c8e4647 to 92d9273 Compare July 22, 2024 12:50
@KazToozs KazToozs requested a review from BourgoisMickael July 23, 2024 11:11
/* eslint-disable no-param-reassign */
let formDataParser;
try {
formDataParser = busboy({ headers: request.headers });
Copy link
Contributor

@BourgoisMickael BourgoisMickael Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more details about multipart/form-data, parsing by busboy and aws behavior with some links to documentation:

Content-Type

Documentation for the Content-Type, with the boundary directive and example with body: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type

Again, busboy will accept application/x-www-form-urlencoded (https://github.com/fastify/busboy/blob/master/lib/types/urlencoded.js#L9)

That will unnecessarily read the whole body and trigger an error 400 InvalidArgument instead of the 412 Precondition Failed returned by aws and stopping the request early.

Content-Disposition

The documentation about Content-Disposition gives you details for its usage in multipart/form-data: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition

In the body, a part will look like

--boundary
Content-Disposition: form-data; name="f1"

value1
--boundary

Here busboy will parse this and emit a field event with fieldname being f1

You can send this with:

  • an html form with an <input type="text" name="f1">
  • curl (https://curl.se/docs/manpage.html#-F)
    • curl -F f1=value1
    • curl -F "f1=</etc/hostname" (this put the content of the file but not the filename)
  • nodejs with form-data or raw

You can also have

--boundary
Content-Disposition: form-data; name="f2"; filename="example.txt"

value2
--boundary

You can send this with:

Notice the filename directive, this will make busboy emit a file event with fieldname being f2.
As described in busboy documentation: https://www.npmjs.com/package/@fastify/busboy#busboy-methods with the default config isPartAFile.

You can notice in AWS documentation about the file form field: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPOST.html#RESTObjectPOST-requests-form-fields

The file or text content.
...
Type: File or text content

So you can either send:

  • Content-Disposition: form-data; name="file" (with an input of type text)
  • Content-Disposition: form-data; name="file"; filename="whatever.txt"

Which means by default the file form field can be emitted either by the field or file busboy event.

Also by testing on AWS, it accepts other form fields like key, acl having a filename directive (meaning coming from an input type="file"), example this should be possible:

  • Content-Disposition: form-data; name="key"; filename="whatever.txt"
  • Content-Disposition: form-data; name="acl"; filename="whatever.txt"

So any form field can be emitted by either the field or file busboy event.
The processing / validation should be performed based on the fieldname and not the busboy event.

Note that it could be interesting to replace the default behavior of busboy and defining isPartAFile to consider a file only the fieldname === 'file' (with potentially presence of filename).


When doing that there should also be some limits defined on busboy, if a form field other than file is being a big file and now parsed as a string instead of a stream, limit would help busboy to stop reading it. (The default limit.fieldSize is 1MiB but AWS accepts only 20KB for the form-data exluding file).

The form data and boundaries (excluding the contents of the file) cannot exceed 20KB.
Note: it looks like it's not just the field value but potentially the whole boundary with Content-Disposition and potential headers of the part, but I'm not sure, might need to be tested.

${filename} replacement

About ${filename} replacement, it's not only on the key, but all form fields: https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-HTTPPOSTForms.html

The variable ${filename} is automatically replaced with the name of the file provided by the user and is recognized by all form fields. If the browser or client provides a full or partial path to the file, only the text following the last slash (/) or backslash () is used (for example, C:\Program Files\directory1\file.txt is interpreted as file.txt). If no file or file name is provided, the variable is replaced with an empty string.

Also the documentation defines how filename should be sanitized (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition#filename)
Busboy should sanitize it, but you could add some tests with filename containing ../xxx, a/b/c, etc to ensure the ${filename} works as aws.

And validation on form fields, like key length, should then be performed after ${filename} replacement as it could impact the value.

Additional validation

You can receive a Content-Disposition: form-data without name, that would make an undefined fieldname and crash:
TypeError: Cannot read properties of undefined (reading 'toLowerCase')

Also by testing on AWS, they seem to trim the fieldname of whitespaces: name=" \n key \n " will should match key.


It's also possible to have multiple parts with the same name.
For example with this:

<input type="text" name="key">
<input type="text" name="key">
<input type="text" name="acl">
<input type="text" name="acl">
<input type="file" name="file" multiple>

or curl -F f1=value1 -F f1=value2

We should ensure, just like AWS that we don't have multiple times the same form field, whether busboy emits it via field or file.


Maybe the file should not be read if the parsing failed and an error was caught.
And when reading (pipe) the file stream, we need to catch error, otherwise it can crash:
Error: Part terminated early due to unexpected end of multipart data

It seems the finish event is triggered once the file is completely read, and there is a validation logic in there, the logic could happen earlier before writting file to disk.


There is a validation on Content-Length as well:

    <Code>EntityTooLarge</Code>
    <Message>Your proposed upload exceeds the maximum allowed size</Message>
    <ProposedSize>6442453100</ProposedSize>
    <MaxSizeAllowed>5368730624</MaxSizeAllowed>

The key should be validated just like it's done for regular PutObject.
See https://github.com/scality/Arsenal/blob/7eb2701f21aba91d45c2454ee8c5c257fea1c0db/lib/s3routes/routes.ts#L65-L73
Maybe there could be some blacklisted object prefix on cloudserver


You should find attached a javascript script to write the raw body form-data easier than using `nc`. It allows to test invalid bodies that are usually well formated by lib like `form-data` or by `curl` or `postman`.

Also you can add a delay between writes to test server timeout behavior.

const http = require('http');

const CRLF = '\r\n';

const body = `\
--test
Content-Disposition: form-data; name="key"

some_key
--test
Content-Disposition: form-data; name="file"; filename="some_name"

a
--test--
`.replace(/\n/g, CRLF);

const options = {
    method: 'POST',
    headers: {
        'Host': 'micktestpost.localhost',
        'Content-Type': 'multipart/form-data; boundary=test',
        'Content-Length': Buffer.byteLength(body),
    },
};

const req = http.request('http://localhost:8000', options, (res) => {
    console.log(req)
    console.log('RESPONSE', { statusCode: res.statusCode, message: res.statusMessage, headers: res.headers });

    res.on('data', (chunk) => {
        console.log(`BODY: ${chunk}`);
    });
    res.on('end', () => {
        console.log('END');
    });
});

req.on('error', (e) => {
    console.error(`problem with request: ${e.toString()}`);
});

// Split body in block of 20 chars
const split = body.match(/[\s\S]{1,20}/g);
// Increase delay without writting to test server timeout
const writeDelayMs = 1_000;

function writeBodyPart(i) {
    if (i < split.length) {
        console.log('Writting', i, split[i])
        req.write(split[i]);
        setTimeout(() => writeBodyPart(i + 1), writeDelayMs);
    } else {
        console.log('Done writting body')
        req.end();
    }
}

// setTimeout(() => writeBodyPart(0), writeDelayMs)

req.write(body);
req.end();

Generating the body raw without a lib can help you test any kind of invalid body to make sure you use busboy correctly and validate all input before using them to prevent a crash of the server.

Busboy already takes care of all the parsing, we need to configure it correctly (cf busboy readme) to match AWS spec and validate inputs to prevent a crash.

Ideally you could extract the parsing / validation (parseFormData function) with busboy into its own isolated class to make it easier for testing. And if you want to be consistent with callbacks usage instead of promises and async await, you can pass a callback as input of your parsing function to call once parsing /validation is done.

request.formData[lowerFieldname] = val;
}
// add only the recognized fields to the formData object
if (POST_OBJECT_OPTIONAL_FIELDS.some(field => lowerFieldname.startsWith(field))) {
Copy link
Contributor

@BourgoisMickael BourgoisMickael Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startsWith will allow things like acl abc. Known fields should be checked with equality and the special x-amz-meta- can check checked aside

const xml2js = require('xml2js');
const axios = require('axios');
const crypto = require('crypto');
const FormData = require('form-data');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependency as well does not appear on package.json.
This is present only because it's a dependency of a dependency of a dependency, and is an old version.

Make sure to install dependencies you want to use, with latest version if possible (in devDependencies for tests)

> npm ls form-data
[email protected] /home/mickael/scality/cloudserver
└─┬ [email protected]
  └─┬ [email protected]
    └── [email protected]

> npm why form-data
[email protected]
node_modules/form-data
  form-data@"~2.3.2" from [email protected]
  node_modules/request
    request@"^2.86.0" from [email protected]
    node_modules/azure-storage
      azure-storage@"^2.1.0" from the root project
      azure-storage@"~2.10.7" from [email protected]
      node_modules/arsenal
        arsenal@"git+https://github.com/scality/arsenal#4ef5748c028619edff10d6d38b21df43c8d63d88" from the root project
      azure-storage@"~2.10.7" from [email protected]
      node_modules/bucketclient/node_modules/arsenal
      azure-storage@"~2.10.7" from [email protected]
      node_modules/utapi/node_modules/arsenal
      azure-storage@"~2.10.7" from [email protected]
      node_modules/vaultclient/node_modules/arsenal
    request@"^2.79.0" from [email protected]
    node_modules/google-auto-auth
      google-auto-auth@"^0.9.1" from the root project
    request@"^2.88.0" from [email protected]
    node_modules/oas-tools
      oas-tools@"^2.2.2" from [email protected]
      node_modules/utapi
        utapi@"git+https://github.com/scality/utapi#7.70.4" from the root project

method: 'POST',
hostname: url.hostname,
port: url.port,
path: url.pathname + url.search,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review the documentation: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPOST.html#RESTObjectPOST-requests

POST / HTTP/1.1
Host: destinationBucket.s3.amazonaws.com

Anything (more than /) in the path should return a 405 (see #5601 (comment) and scality/Arsenal#2248 (comment))

And any querystring (url.search) should return a 400 (see scality/Arsenal#2248 (comment))

Don't put them in tests if you don't want them to fail once url validation is correct


req.on('response', res => {
try {
assert.equal(res.statusCode, 204);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer strictEqual

const req = http.request(options);

req.on('response', res => {
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this request handling make the same code snippet duplicated in all tests.

Ideally, you could encapsulate all this request sending and response / error callback.

You could promisify the request to simplify the code snippet in every tests, (some most recent tests with the aws-sdk uses promise).

Comment on lines +167 to +170
formData.getLength((err, length) => {
if (err) {
return done(err);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getLengthSync can be interesting to not have a callback

});
});

it('should handle error when bucket does not exist', done => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could include the statusCode in description for quicker troubleshooting when it fails

});
});

it('should successfully upload an object using a POST form', done => {
Copy link
Contributor

@BourgoisMickael BourgoisMickael Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for organization you could have an additionnal sub describe to regroup testing of successful request and testing of errors

const req = http.request(options);

req.on('response', res => {
if (res.statusCode === 404) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally it should look like assert.strictEqual(res.statusCode, 404) so it's easy to see in the logs tha actual value vs expected


req.on('response', res => {
if (res.statusCode === 204) {
s3.listObjectsV2({ Bucket: bucketName }, (err, data) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a HeadObject or GetObject would be more appropriate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants