hdfs support #196

skewballfox · 2024-09-14T21:51:51Z

Still trying to figure a few things out. This might be easier than I was initially expecting given it's the default shared filesystem other implementations have to be compatible with.

Some relevant info I've found so far:

spark docs discussing the hdfs state store provider. Note there were some other hits in it's RDD programming guide
since hdfs is only mentioned in the spark docs in reference to streaming datasets, I think these might be the relevant test. I looked through the rest of the hits not limited to that directory, and I didn't find any other test that reference hdfs directly.
Notes on unit testing spark jobs that require HDFS

all hits for object_store in crates happen within sail plan, so I'm guessing this should primarily be implemented within that crate?

shehabgamin · 2024-09-15T06:26:05Z

Thank you @skewballfox -- this is a great first contribution and a great improvement!

For reference, you may find the following PRs helpful, as they encapsulate the Object Store integration work done by @linhr:
#146
#150

Side Note: I linked your PR to the associated Github issue (#173).

skewballfox · 2024-09-15T18:20:36Z

hey, trying to setup spark for test but I'm running into an issue. even with a clean slate I'm getting a error somewhere

rm -rf opt/*
git clone https://github.com/apache/spark.git opt/spark
git clone https://github.com/ibis-project/testing-data.git opt/ibis-testing-data
scripts/spark-tests/build-pyspark.sh

here's the last bit of output. scrolling up can't take me past a long stream of "copying " statements to the origin.

...
copying pyspark/tests/typing/test_resultiterable.yml -> pyspark-3.5.1/pyspark/tests/typing
copying pyspark.egg-info/SOURCES.txt -> pyspark-3.5.1/pyspark.egg-info
Writing pyspark-3.5.1/setup.cfg
creating dist
Creating tar archive
removing 'pyspark-3.5.1' (and everything under it)
error: Your local changes to the following files would be overwritten by checkout:
	pom.xml
Please commit your changes or stash them before you switch branches.
Aborting

given this post has a match for all the non-specific strings, I'm guessing this error comes from setup.py or immediately after setup.py is called. Any ideas what's going on?

linhr · 2024-09-16T11:03:37Z

We've seen this error occasionally and you can ignore it and continue with the next step. The PySpark package has been built but the Spark patch somehow is not revert correctly. (You can manually drop the change in the opt/spark directory.)

Sorry for the confusion! We'll update the documentation and look into the root cause.

skewballfox · 2024-09-16T20:53:14Z

Would it be alright if I pushed some "off-topic" commits? I added some changes to the build-pyspark.sh script that make it easier to resume after a failed partially-completed run, and makes sure the venv is setup/sourced before running the python command

shehabgamin · 2024-09-16T21:07:16Z

Yeah go for it!

linhr · 2024-09-16T22:47:59Z

I just took a look at the script change and I feel it goes a bit beyond what the script is supposed to do. Since the script is also used in CI, it may conflict with other parts of the GitHub Actions workflow.

Would you mind reverting the script change? Sorry for the back and forth. These are valuable suggested changes though. We'll definitely consider a way to incorporate the idea to make the developer experience better.

linhr · 2024-09-16T22:54:34Z

BTW I like the idea of using a lock file to skip expensive Maven build. I'll need some time to test it in CI so let's revisit this in a future PR.

shehabgamin · 2024-09-16T23:19:03Z

I just took a look at the script change and I feel it goes a bit beyond what the script is supposed to do. Since the script is also used in CI, it may conflict with other parts of the GitHub Actions workflow.

Would you mind reverting the script change? Sorry for the back and forth. These are valuable suggested changes though. We'll definitely consider a way to incorporate the idea to make the developer experience better.

@skewballfox Feel free to put the changes in a separate draft PR so that none of your work is lost.

skewballfox · 2024-09-16T23:32:39Z

I just took a look at the script change and I feel it goes a bit beyond what the script is supposed to do. Since the script is also used in CI, it may conflict with other parts of the GitHub Actions workflow.

Would you mind reverting the script change? Sorry for the back and forth. These are valuable suggested changes though. We'll definitely consider a way to incorporate the idea to make the developer experience better.

no worries, I used git reset to undo the last commit. and I'll push the script to a separate branch. Sorry for the delayed response, been running into issues getting the test environment setup

shehabgamin · 2024-09-17T00:03:19Z

I just took a look at the script change and I feel it goes a bit beyond what the script is supposed to do. Since the script is also used in CI, it may conflict with other parts of the GitHub Actions workflow.
Would you mind reverting the script change? Sorry for the back and forth. These are valuable suggested changes though. We'll definitely consider a way to incorporate the idea to make the developer experience better.

no worries, I used git reset to undo the last commit. and I'll push the script to a separate branch. Sorry for the delayed response, been running into issues getting the test environment setup

You're actually quite prompt! We're thrilled to have your contribution, and if there's anything we can do to help with setting up your test environment, don't hesitate to reach out.

skewballfox · 2024-09-17T14:53:16Z

a couple things.

Is there a way to get pyo3 to use a uv installed python version? I tried running uv python install 3.11 prior to running start_server.sh but would still get linker errors for python3.11. I fixed this by installing it system level (sudo dnf install python3.11-devel), but was wondering if there was a way to easily avoid this (I try to keep development dependencies user level)

Second, I'm having trouble getting the run-client.sh script to work after everything has been set up. here's the output:

user@fedora ~/t/sail (hdfs_support) [1]> scripts/spark-tests/run-tests.sh
Removing existing test logs...
Test suite: test-connect
ERROR: module or package not found: pyspark.sql.tests.connect (missing __init__.py?)

================================================= test session starts ==================================================
platform linux -- Python 3.11.9, pytest-8.3.3, pluggy-1.5.0
rootdir: /path/to/sail
configfile: pyproject.toml
plugins: hypothesis-6.112.1, xdist-3.6.1, timeout-2.3.1, snapshot-0.9.0, reportlog-0.4.0, repeat-0.9.3, mock-3.14.0, pytest_httpserver-1.1.0
collected 0 items

-------- generated report log file: /path/to/sail/tmp/spark-tests/latest/test-connect.jsonl ---------
================================================ no tests ran in 0.17s =================================================
Test suite: doctest-column
===================== test session starts ======================
platform linux -- Python 3.11.9, pytest-8.3.3, pluggy-1.5.0
rootdir: /path/to/sail
configfile: pyproject.toml
plugins: hypothesis-6.112.1, xdist-3.6.1, timeout-2.3.1, snapshot-0.9.0, reportlog-0.4.0, repeat-0.9.3, mock-3.14.0, pytest_httpserver-1.1.0
collected 33 items

.venvs/test/lib/python3.11/site-packages/pyspark/sql/column.py E [  3%]
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE                         [100%]

- generated report log file: /path/to/sail/tmp/spark-tests/latest/doctest-column.jsonl -
====================== 33 errors in 0.22s ======================

I've made sure hatch run test:install-pyspark has run, and run-server.sh is running on a separate terminal. Kinda scratching my head on this one, it looks as though the issue comes from pyspark. I tried cding into opt/spark/python, running python setup.py sdist again, and rerunning the install script with hatch, but error is the same. Is there some step I'm missing?

shehabgamin · 2024-09-17T19:36:41Z

Have you gone through the Environment Setup steps?
https://docs.lakesail.com/sail/latest/development/setup/

After the above, it should just be the following commands to setup your test env:

git clone [email protected]:apache/spark.git opt/spark

scripts/spark-tests/build-pyspark.sh

hatch env create test
hatch run test:install-pyspark

Although, i'm also curious if maybe there is a problem with the Java installation. Unfortunately running the Spark tests requires Java to be installed.

Also, the Github Actions workflow runs the tests, so it may be good to manually follow the steps there (It looks like the actions workflow uses Java Corretto 17):
https://github.com/lakehq/sail/blob/main/.github/workflows/spark-tests.yml

Can we try to get some more verbose log outputs? It seems like doctest-column is able to run tests but they all fail. Maybe there are some useful error logs there:

export TEST_RUN_NAME=col && scripts/spark-tests/run-tests.sh --doctest-modules --pyargs pyspark.sql.column -v

Lastly, yes it should be possible to get pyo3 to use a uv installed python version. I would refer to the Environment Setup steps that I linked at the very beginning of this message.

skewballfox · 2024-09-17T23:39:26Z

Although, i'm also curious if maybe there is a problem with the Java installation. Unfortunately running the Spark tests requires Java to be installed.

I have java installed and JAVA_HOME set to /usr/lib/jvm/java-17-openjdk

Can we try to get some more verbose log outputs? It seems like doctest-column is able to run tests but they all fail. Maybe there are some useful error logs there:

when I ran that command. I'm getting a series of no such file or directory errors for files such as '/path/to/sail/.venvs/test/lib/python3.11/site-packages/pyspark/python/test_support'. when I look in the parent directory mentioned, the only contents are .venvs/test/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip, but when listing the contents of the zip, the missing files (i.e. test_support) are there. I'm guessing some late stage build step is suppose to unpack that file but isn't?

linhr · 2024-09-18T02:19:05Z

Could you run the Spark build script unmodified and see if it works? I took another look at #200 and it seems to me that the patch lock file may cause the patch to be skipped when you build the Python package. If the patch is not applied, neither the tests nor the test support files will be in the correct location in the zip package.

The Maven lock file seems fine to me though.

shehabgamin · 2024-09-18T04:06:12Z

Could you run the Spark build script unmodified and see if it works? I took another look at #200 and it seems to me that the patch lock file may cause the patch to be skipped when you build the Python package. If the patch is not applied, neither the tests nor the test support files will be in the correct location in the zip package.

The Maven lock file seems fine to me though.

Before doing this, I suggest going into the /opt/spark directory and doing a git add . + git stash to make sure that everything in there is clean.

skewballfox · 2024-09-18T17:16:24Z

I renamed my current copy of sail and recloned the repo. Suprisingly it worked, but I'm not sure why given I removed the opt directory a few times (including the lock files), so I retested on my original copy (pruned/recreated the virtual envs, removed opt, reran the build script, etc). It's working now

I think I may finally have a hadoop container setup in pseudo-distributed mode for testing hdfs support. Where should it go? under scripts, opts or somewhere else?

EDIT: right now it's a Dockerfile, but I can try to integrate it into the existing compose file, might take me a bit though because I'm running rootless podman which comes with a few extra caveats around container networking.

linhr · 2024-09-18T22:31:40Z

Glad to hear that you got the setup working!

Let's define the container in compose.yml.

Testing external systems with containers is not part of CI right now but it could be future work. For this PR, it would be good enough to have a local setup. Either Python or Rust tests would be great.

skewballfox · 2024-09-19T19:02:35Z

hey, kind of stuck atm. I'm setting up hdfs in a local container, verifying it's working via the web portal, and then connecting to the hdfs container using the method mentioned in the dev docs.

p/t/sail> env SPARK_CONNECT_MODE_ENABLED=1 SPARK_REMOTE="sc://localhost:50051" hatch run pyspark
>>>path=hdfs://localhost:9000/user/skewballfox/test.json
>>>spark.read.json(path).show()

This throws a Object store error originating from this generic IO error. I know it's connecting since the client returns okay, and because if I change it to a non-existent file, I get an error indicating the resource doesn't exist. So it's making it somewhere past the object store instantiation and failing during execution (I think). I'm having trouble tracing it any further than that though.

here's the last bit of traceback from python:

File "/p/t/sail/.venvs/default/lib/python3.11/site-packages/pyspark/sql/connect/client/core.py", line 1503, in _handle_error
    self._handle_rpc_error(error)
  File "p/t/sail/.venvs/default/lib/python3.11/site-packages/pyspark/sql/connect/client/core.py", line 1539, in _handle_rpc_error
    raise convert_exception(info, status.message) from None
pyspark.errors.exceptions.connect.SparkRuntimeException: Object Store error: Generic HdfsObjectStore error: IO error occurred while communicating with HDFS

here's the last trace from the server

[2024-09-19T18:39:08Z DEBUG sail_spark_connect::server trace: 241683380366876879727100582833342366091 span: 5329696312847761413] ReleaseExecuteResponse { session_id: "487e2373-2881-41e9-ad87-d8362c5ef938", operation_id: Some("4e1997e1-2bd1-433a-a252-1826cb68986e") }

which is returned by release_execute

any ideas what the issue is or have suggestions on what I should try to track down the problem?

shehabgamin · 2024-09-19T23:24:53Z

Default logging filters for the spark connect server only. If you want expanded debug logging you can start the server doing env RUST_LOG=debug ...

Let's see if we're able to get any better logs that way.

I also highly recommend using the debugger. Let me know if you'd like help getting one setup!

skewballfox · 2024-09-22T00:59:16Z

Just an update, Managed to finally read a json file from the hdfs instance, was a configuration issue. Still failing to write to hdfs, though that is probably also a configuration issue (working on fixing it).

Launching the hdfs container still doesn't work locally via podman compose. It keeps dying after around 15 seconds without producing any error and returning a normal exit code. I'm running the docker file under scripts/hadoop via

env HADOOP_USER_NAME=sail podman build -t hadoop-container .
podman run -it -p 50070:9870 -p 9000:9000 -p 9864:9864 -p 9866:9866 hadoop-container /bin/bash
#/hdfs-init.sh

I think i might have found a bug in the s3 code that may be upstream. If you upload a json file via the web portal, and then try to read it via spark.read.json(path).show() you'll get: Json error: Not valid JSON: EOF while parsing an object at line 1 column 1. It's definitely valid json, tested with a few files uploaded to the minIO instance

This doesn't happen with the code from the docs, but the query spark.sql("SELECT 1").write.json(path) isn't doing what it looks like. bar.json is created as a directory, and json files are created under it with the write command. When you look at the metadata for those files the content-type is listed as binary/octet-stream, whereas the files it fails to read have the content-type application/json. you can run read json commands on the octet-stream files created without issue

linhr · 2024-09-22T08:41:40Z

bar.json is created as a directory, and json files are created under it with the write command.

This is the expected behavior. The content type needs investigation though.

…till requires root. compose still broken

skewballfox · 2024-09-28T00:08:06Z

related issue for hdf-native-object-store

skewballfox · 2024-09-28T00:25:38Z

So I got reading, writing files working, at least for json. what other functions should I implement, confirm working prior to merging this?

with testing, I'm still using the dockerfile, bringing both containers up with the compose file doesn't seem to work, but it's most likely a misconfiguration of hadoop or podman. Traffic goes in to the container (write request are fulfilled), but responses don't make it back out.

Given that, should I just document how to set up the dockerfile?

linhr

This is an exciting progress!

I just triggered the workflow. Let's format the code using the command here: https://docs.lakesail.com/sail/main/development/build/rust.html

Reading and writing JSON files is a perfect scope for this PR. Let's add a FIXME comment in compose.yml for future investigation of the container issues, but there is no need to add documentation for manual container setup.

I have a few comments but overall this is quite an impressive change! I can see especially that the Dockerfile requires non-trivial efforts. I think the PR is pretty close to the ready state.

crates/sail-plan/src/object_store/registry.rs

crates/sail-plan/src/resolver/data_source.rs

scripts/hadoop/Dockerfile

crates/sail-plan/src/object_store/config.rs

compose.yml

skewballfox · 2024-09-28T16:59:52Z

btw, compose might work on other systems(with or without sudo). I think part of the issue might be differences in rootless network configuration between podman-compose and podmans default on fedora 40. I know when launching hdfs via compose write request go through, because the files are created, outbound traffic is just broken

EDIT: So I sort of figured out the problem. podman-compose launches containers in bridge mode, where using podman directly uses pasta (a more recent rootless networking tool). In bridge mode I believe containers have a separate network namespace, and for some reason this breaks outbound traffic from the namenode. binding the containers port 9000 to the host 9000 fixes it and that's the only port that requires it.

also, from trial and error It seems like trying to map port 9000 to another port breaks hdfs during initialization(at least with pasta). I'm guessing from traffic it's trying to send itself. it's really not designed to run the entire cluster in a single container lol

linhr

Thanks for addressing the comments! Also thanks for the notes regarding container setup.

I have one more minor comment, but I feel this is in a good shape!

It seems there is a formatting issue in compose.yml. Could you use the following commands to format the non-Rust code? More information can be found here.

pnpm install
pnpm run format
pnpm run lint

Also, when you commit the changes this time, you can add [spark tests] in the commit message so that we can get the Spark tests triggered. After this I think this PR is ready to be merged!

crates/sail-plan/src/resolver/data_source.rs

…o longer necessary

skewballfox · 2024-09-29T16:08:02Z

sorry I forgot to put [spark tests] in the commit message, could you retrigger it on your end?

linhr · 2024-09-30T06:20:37Z

sorry I forgot to put [spark tests] in the commit message, could you retrigger it on your end?

No worries! Since this PR does not change PySpark logic, it's fine to skip the tests. The tests will be run after the code is merged to main.

I've also created an issue (#227) so that the tests can be triggered without the special commit message.

linhr

@skewballfox We are so glad to have you as the first community contributor of Sail! Thank you for your contribution!

boilerplate for hdfs

b35c144

shehabgamin linked an issue Sep 15, 2024 that may be closed by this pull request

HDFS Support #173

Closed

skewballfox force-pushed the hdfs_support branch from a7befb8 to b35c144 Compare September 16, 2024 23:18

skewballfox added 2 commits September 19, 2024 18:34

standalone container for hdfs in pseudo-distributed mode

950c813

managed to get a read, compose still broken locally

ad33496

skewballfox added 4 commits September 22, 2024 16:02

can now write to the container with hdfs, username is configurable, s…

9634e55

…till requires root. compose still broken

docker-compose now works

794db5d

Merge branch 'main' into hdfs_support

036ef4d

read/write behavior for hdfs matches s3

d260eaf

skewballfox marked this pull request as ready for review September 28, 2024 00:08

removed unused import

a6742b4

linhr reviewed Sep 28, 2024

View reviewed changes

changes from code review

71eb9d2

got the compose file working

ae5d010

linhr approved these changes Sep 29, 2024

View reviewed changes

crates/sail-plan/src/resolver/data_source.rs Outdated Show resolved Hide resolved

ran pnpm lint, cargo update, removed hdfs specific check since it's n…

13695f7

…o longer necessary

linhr approved these changes Sep 30, 2024

View reviewed changes

linhr merged commit b5334f1 into lakehq:main Sep 30, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hdfs support #196

hdfs support #196

skewballfox commented Sep 14, 2024 •

edited

Loading

shehabgamin commented Sep 15, 2024 •

edited

Loading

skewballfox commented Sep 15, 2024

linhr commented Sep 16, 2024

skewballfox commented Sep 16, 2024

shehabgamin commented Sep 16, 2024

linhr commented Sep 16, 2024

linhr commented Sep 16, 2024

shehabgamin commented Sep 16, 2024

skewballfox commented Sep 16, 2024

shehabgamin commented Sep 17, 2024

skewballfox commented Sep 17, 2024

shehabgamin commented Sep 17, 2024

skewballfox commented Sep 17, 2024

linhr commented Sep 18, 2024

shehabgamin commented Sep 18, 2024

skewballfox commented Sep 18, 2024 •

edited

Loading

linhr commented Sep 18, 2024

skewballfox commented Sep 19, 2024

shehabgamin commented Sep 19, 2024

skewballfox commented Sep 22, 2024

linhr commented Sep 22, 2024 •

edited

Loading

skewballfox commented Sep 28, 2024

skewballfox commented Sep 28, 2024

linhr left a comment

skewballfox commented Sep 28, 2024 •

edited

Loading

linhr left a comment

skewballfox commented Sep 29, 2024

linhr commented Sep 30, 2024

linhr left a comment

hdfs support #196

hdfs support #196

Conversation

skewballfox commented Sep 14, 2024 • edited Loading

shehabgamin commented Sep 15, 2024 • edited Loading

skewballfox commented Sep 15, 2024

linhr commented Sep 16, 2024

skewballfox commented Sep 16, 2024

shehabgamin commented Sep 16, 2024

linhr commented Sep 16, 2024

linhr commented Sep 16, 2024

shehabgamin commented Sep 16, 2024

skewballfox commented Sep 16, 2024

shehabgamin commented Sep 17, 2024

skewballfox commented Sep 17, 2024

shehabgamin commented Sep 17, 2024

skewballfox commented Sep 17, 2024

linhr commented Sep 18, 2024

shehabgamin commented Sep 18, 2024

skewballfox commented Sep 18, 2024 • edited Loading

linhr commented Sep 18, 2024

skewballfox commented Sep 19, 2024

shehabgamin commented Sep 19, 2024

skewballfox commented Sep 22, 2024

linhr commented Sep 22, 2024 • edited Loading

skewballfox commented Sep 28, 2024

skewballfox commented Sep 28, 2024

linhr left a comment

Choose a reason for hiding this comment

skewballfox commented Sep 28, 2024 • edited Loading

linhr left a comment

Choose a reason for hiding this comment

skewballfox commented Sep 29, 2024

linhr commented Sep 30, 2024

linhr left a comment

Choose a reason for hiding this comment

skewballfox commented Sep 14, 2024 •

edited

Loading

shehabgamin commented Sep 15, 2024 •

edited

Loading

skewballfox commented Sep 18, 2024 •

edited

Loading

linhr commented Sep 22, 2024 •

edited

Loading

skewballfox commented Sep 28, 2024 •

edited

Loading