Binary file alterd after transfered via HttpDataSource #2360
-
Today in the Q&A, I mentioned a strange behavior in transferring a binary file from an HttpDataSource which I noticed during some test. At first, I did a smaller transfer of plain data to test the transfer in general, which worked so far. The next test was to transfer a small binary file (png file), but the result file in both destination was a changed file (in size and checksum). So the problem could be my testing http backend (as shown below), which reads the file and serves it to the provider's DataSource. Or the HttpDataSource itself could be the problem. Or did I miss something here? import asyncio
import uvicorn
import datetime
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse, FileResponse
app = FastAPI()
@app.post("/large/")
async def postFile(request: Request):
print("request received...")
print(request.headers)
fname = datetime.datetime.now().strftime("%y.%m.%d-%H:%M:%S")
with open(fname, "a+b") as input_file:
b = 0
print(f"write in to {fname}, wait for data...")
async for data in request.stream():
b += input_file.write(data)
print(b)
return {}
@app.get("/large/")
async def getLarge():
headers = {"content-type": "application/octet-stream"}
return FileResponse("./sample",headers=headers) The asset registration is done with the following curl command
|
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 6 replies
-
Is the data base64 encoded? |
Beta Was this translation helpful? Give feedback.
-
There is many things that can go wrong here, like padding, offsets, endianism, encoding, file-system or OS specific weirdness... The HTTP dataplane is intended for structured data (e.g. JSON) only, because binary data shouldn't be transmitted in HTTP responses (webserver-specific timeouts, body-size-limits, etc.). Transmitting binary is what the S3/BlobStore framework is optimised for. |
Beta Was this translation helpful? Give feedback.
-
Sorry for coming back to such an old discussion but I am running into a similar problem. I have an existing ecosystem that provide data in various formats (images, json, csv, netcdf) via https. The data also varies in size from a few kilobytes to gigabytes but everything works well for https. My main problem is that images are getting modified during the transfer (pull) through the connector. This can be easily check when following the pull transfer sample with an image asset and afterwards compare the checksum of the original and the connector transferred file or just try to open the image. I have read in multiple discussions and issues that the http dataplane is not intended to transmit large data and one should use the S3/BlobStore framework or ftp. This seems to be a great overhead to integrate my backend in a dataspace. What would you (@paullatzelsperger) suggest in this case to make it work? |
Beta Was this translation helpful? Give feedback.
-
I suggest using S3 or a specialized protocol that can handle binary files properly. Integrating these technologies is straightforward. |
Beta Was this translation helpful? Give feedback.
-
What does this mean in terms of my backend? Do you suggest to push the data from my backend to S3 and than use the S3 dataplane? I have the feeling that i am missing something. |
Beta Was this translation helpful? Give feedback.
There is many things that can go wrong here, like padding, offsets, endianism, encoding, file-system or OS specific weirdness...
The HTTP dataplane is intended for structured data (e.g. JSON) only, because binary data shouldn't be transmitted in HTTP responses (webserver-specific timeouts, body-size-limits, etc.). Transmitting binary is what the S3/BlobStore framework is optimised for.
Or you could send back a URL in the HTTP response, that points to the binary file. Then, in your client code, simply download the file from that URL. Of course the URL must be accessible to the client.