Binary file alterd after transfered via HttpDataSource #2360

reisman234 · 2022-12-16T16:06:11Z

reisman234
Dec 16, 2022

Today in the Q&A, I mentioned a strange behavior in transferring a binary file from an HttpDataSource which I noticed during some test.
My actual intention was to do transfer of a large binary file via the provider from an HttpDataSource to an HttpDataSink or S3Sink, but with respect to #2333 and the information I got from today's Q&A from @paullatzelsperger I will overthink my current approach.

At first, I did a smaller transfer of plain data to test the transfer in general, which worked so far. The next test was to transfer a small binary file (png file), but the result file in both destination was a changed file (in size and checksum).
As a third test, I did the same binary file transfer from S3 to S3 or HTTP, which worked as expected.

So the problem could be my testing http backend (as shown below), which reads the file and serves it to the provider's DataSource. Or the HttpDataSource itself could be the problem. Or did I miss something here?

import asyncio
import uvicorn
import datetime
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse, FileResponse

app = FastAPI()

@app.post("/large/")
async def postFile(request: Request):
    print("request received...")
    print(request.headers)
    fname = datetime.datetime.now().strftime("%y.%m.%d-%H:%M:%S")
    with open(fname, "a+b") as input_file:
        b = 0
        print(f"write in to {fname}, wait for data...")
        async for data in request.stream():
            b += input_file.write(data)
            print(b)
    return {}

@app.get("/large/")
async def getLarge():
    headers = {"content-type": "application/octet-stream"}
    return FileResponse("./sample",headers=headers)

The asset registration is done with the following curl command

curl --location --request POST 'http://192.168.205.10:8182/api/v1/management/assets' \
--header 'x-api-key: password' \
--header 'Content-Type: application/json' \
--data-raw '{
  "asset": {
    "properties": {
      "asset:prop:id": "demo-asset-1",
      "asset:prop:name": "test-demo-asset",
      "asset:prop:contenttype": "application/json",
      "asset:prop:policy-id": "use-eu"
    }
  },
  "dataAddress": {
    "properties": {
      "baseUrl": "http://proto-backend:8001/large/",
      "name":"",
      "type": "HttpData"
    },
    "transferType": {
    "contentType": "application/octet-stream",
    "isFinite": true
    }
  }
}'

Answered by paullatzelsperger

Dec 19, 2022

There is many things that can go wrong here, like padding, offsets, endianism, encoding, file-system or OS specific weirdness...

The HTTP dataplane is intended for structured data (e.g. JSON) only, because binary data shouldn't be transmitted in HTTP responses (webserver-specific timeouts, body-size-limits, etc.). Transmitting binary is what the S3/BlobStore framework is optimised for.
Or you could send back a URL in the HTTP response, that points to the binary file. Then, in your client code, simply download the file from that URL. Of course the URL must be accessible to the client.

View full answer

jimmarino · 2022-12-16T17:42:35Z

jimmarino
Dec 16, 2022
Collaborator

Is the data base64 encoded?

1 reply

reisman234 Dec 19, 2022
Author

You mean the input data, no I didn't do that

paullatzelsperger · 2022-12-19T15:43:24Z

paullatzelsperger
Dec 19, 2022
Collaborator

There is many things that can go wrong here, like padding, offsets, endianism, encoding, file-system or OS specific weirdness...

The HTTP dataplane is intended for structured data (e.g. JSON) only, because binary data shouldn't be transmitted in HTTP responses (webserver-specific timeouts, body-size-limits, etc.). Transmitting binary is what the S3/BlobStore framework is optimised for.
Or you could send back a URL in the HTTP response, that points to the binary file. Then, in your client code, simply download the file from that URL. Of course the URL must be accessible to the client.

1 reply

reisman234 Jan 12, 2023
Author

Thanks for your clarification, and please excuse my delayed answer.

I just wanted to point out that strange behavior of the HTTP data-plane. At the moment, there is no information about the intended use case (structured data only) for that data-plane in the concerning README. At the moment, it's ok for me to switch to the s3 data-plane.

FlorianJa · 2023-10-27T18:29:10Z

FlorianJa
Oct 27, 2023

Sorry for coming back to such an old discussion but I am running into a similar problem. I have an existing ecosystem that provide data in various formats (images, json, csv, netcdf) via https. The data also varies in size from a few kilobytes to gigabytes but everything works well for https.

My main problem is that images are getting modified during the transfer (pull) through the connector. This can be easily check when following the pull transfer sample with an image asset and afterwards compare the checksum of the original and the connector transferred file or just try to open the image.

I have read in multiple discussions and issues that the http dataplane is not intended to transmit large data and one should use the S3/BlobStore framework or ftp. This seems to be a great overhead to integrate my backend in a dataspace. What would you (@paullatzelsperger) suggest in this case to make it work?

0 replies

jimmarino · 2023-10-27T18:37:52Z

jimmarino
Oct 27, 2023
Collaborator

I suggest using S3 or a specialized protocol that can handle binary files properly. Integrating these technologies is straightforward.

0 replies

FlorianJa · 2023-10-27T18:49:57Z

FlorianJa
Oct 27, 2023

What does this mean in terms of my backend? Do you suggest to push the data from my backend to S3 and than use the S3 dataplane? I have the feeling that i am missing something.

4 replies

jimmarino Oct 27, 2023
Collaborator

I don't know the details of your backend systems, but I imagine most "backend" systems should not be exposed directly over the internet. Most setups typically use S3 or a similar object storage system to serve data. This can be made extremely efficient and cost-effective using a CDN, which most of those storage systems support.

FlorianJa Oct 27, 2023

Thank you for you quick reply. I am using rasdaman (geo-spatial data storage and processing tool) as backend. It offers common standard API (like WCS) to query data from the database. The connector should be used as a proxy in front rasdaman, so the backend (rasdaman) is not directly exposed to the internet.

I imagine to do the following:

contract was negotiated
consumer connector triggers pull transfer
provider connector sends EDR to consumer connector
consumer connector dispatches the EDR to the consumer backend
consumer backend uses EDR to hit the providers public url
provider connector calls rasdaman with the given query
data is proxyed from rasdaman -> provider connector -> consumer backend

jimmarino Oct 27, 2023
Collaborator

Without knowing the specifics of what you are trying to achieve, it appears there are at least two ways to approach this:

If the query can be executed quickly and the data is small, write a DataSource extension for the data plane that performs queries against your database. Or, write your own custom data plane
If queries take a long time to execute or the returned data is large, write a custom provisioner extension that places the data in temporary object storage for the client to retrieve

FlorianJa Oct 27, 2023

Thanks, I guess I'll take a look at how to write a custom provisioner.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary file alterd after transfered via HttpDataSource #2360

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Binary file alterd after transfered via HttpDataSource #2360

reisman234 Dec 16, 2022

Replies: 5 comments · 6 replies

jimmarino Dec 16, 2022 Collaborator

reisman234 Dec 19, 2022 Author

paullatzelsperger Dec 19, 2022 Collaborator

reisman234 Jan 12, 2023 Author

FlorianJa Oct 27, 2023

jimmarino Oct 27, 2023 Collaborator

FlorianJa Oct 27, 2023

jimmarino Oct 27, 2023 Collaborator

FlorianJa Oct 27, 2023

jimmarino Oct 27, 2023 Collaborator

FlorianJa Oct 27, 2023

reisman234
Dec 16, 2022

Replies: 5 comments 6 replies

jimmarino
Dec 16, 2022
Collaborator

reisman234 Dec 19, 2022
Author

paullatzelsperger
Dec 19, 2022
Collaborator

reisman234 Jan 12, 2023
Author

FlorianJa
Oct 27, 2023

jimmarino
Oct 27, 2023
Collaborator

FlorianJa
Oct 27, 2023

jimmarino Oct 27, 2023
Collaborator

jimmarino Oct 27, 2023
Collaborator