Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hdfs support #196

Merged
merged 11 commits into from
Sep 30, 2024
Merged

hdfs support #196

merged 11 commits into from
Sep 30, 2024

Conversation

skewballfox
Copy link
Contributor

@skewballfox skewballfox commented Sep 14, 2024

Still trying to figure a few things out. This might be easier than I was initially expecting given it's the default shared filesystem other implementations have to be compatible with.

Some relevant info I've found so far:

all hits for object_store in crates happen within sail plan, so I'm guessing this should primarily be implemented within that crate?

@shehabgamin shehabgamin linked an issue Sep 15, 2024 that may be closed by this pull request
@shehabgamin
Copy link
Contributor

shehabgamin commented Sep 15, 2024

Thank you @skewballfox -- this is a great first contribution and a great improvement!

For reference, you may find the following PRs helpful, as they encapsulate the Object Store integration work done by @linhr:
#146
#150

Side Note: I linked your PR to the associated Github issue (#173).

@skewballfox
Copy link
Contributor Author

hey, trying to setup spark for test but I'm running into an issue. even with a clean slate I'm getting a error somewhere

rm -rf opt/*
git clone https://github.com/apache/spark.git opt/spark
git clone https://github.com/ibis-project/testing-data.git opt/ibis-testing-data
scripts/spark-tests/build-pyspark.sh

here's the last bit of output. scrolling up can't take me past a long stream of "copying " statements to the origin.

...
copying pyspark/tests/typing/test_resultiterable.yml -> pyspark-3.5.1/pyspark/tests/typing
copying pyspark.egg-info/SOURCES.txt -> pyspark-3.5.1/pyspark.egg-info
Writing pyspark-3.5.1/setup.cfg
creating dist
Creating tar archive
removing 'pyspark-3.5.1' (and everything under it)
error: Your local changes to the following files would be overwritten by checkout:
	pom.xml
Please commit your changes or stash them before you switch branches.
Aborting

given this post has a match for all the non-specific strings, I'm guessing this error comes from setup.py or immediately after setup.py is called. Any ideas what's going on?

@linhr
Copy link
Contributor

linhr commented Sep 16, 2024

We've seen this error occasionally and you can ignore it and continue with the next step. The PySpark package has been built but the Spark patch somehow is not revert correctly. (You can manually drop the change in the opt/spark directory.)

Sorry for the confusion! We'll update the documentation and look into the root cause.

@skewballfox
Copy link
Contributor Author

Would it be alright if I pushed some "off-topic" commits? I added some changes to the build-pyspark.sh script that make it easier to resume after a failed partially-completed run, and makes sure the venv is setup/sourced before running the python command

@shehabgamin
Copy link
Contributor

Yeah go for it!

@linhr
Copy link
Contributor

linhr commented Sep 16, 2024

I just took a look at the script change and I feel it goes a bit beyond what the script is supposed to do. Since the script is also used in CI, it may conflict with other parts of the GitHub Actions workflow.

Would you mind reverting the script change? Sorry for the back and forth. These are valuable suggested changes though. We'll definitely consider a way to incorporate the idea to make the developer experience better.

@linhr
Copy link
Contributor

linhr commented Sep 16, 2024

BTW I like the idea of using a lock file to skip expensive Maven build. I'll need some time to test it in CI so let's revisit this in a future PR.

@shehabgamin
Copy link
Contributor

I just took a look at the script change and I feel it goes a bit beyond what the script is supposed to do. Since the script is also used in CI, it may conflict with other parts of the GitHub Actions workflow.

Would you mind reverting the script change? Sorry for the back and forth. These are valuable suggested changes though. We'll definitely consider a way to incorporate the idea to make the developer experience better.

@skewballfox Feel free to put the changes in a separate draft PR so that none of your work is lost.

@skewballfox
Copy link
Contributor Author

I just took a look at the script change and I feel it goes a bit beyond what the script is supposed to do. Since the script is also used in CI, it may conflict with other parts of the GitHub Actions workflow.

Would you mind reverting the script change? Sorry for the back and forth. These are valuable suggested changes though. We'll definitely consider a way to incorporate the idea to make the developer experience better.

no worries, I used git reset to undo the last commit. and I'll push the script to a separate branch. Sorry for the delayed response, been running into issues getting the test environment setup

@shehabgamin
Copy link
Contributor

I just took a look at the script change and I feel it goes a bit beyond what the script is supposed to do. Since the script is also used in CI, it may conflict with other parts of the GitHub Actions workflow.
Would you mind reverting the script change? Sorry for the back and forth. These are valuable suggested changes though. We'll definitely consider a way to incorporate the idea to make the developer experience better.

no worries, I used git reset to undo the last commit. and I'll push the script to a separate branch. Sorry for the delayed response, been running into issues getting the test environment setup

You're actually quite prompt! We're thrilled to have your contribution, and if there's anything we can do to help with setting up your test environment, don't hesitate to reach out.

@skewballfox
Copy link
Contributor Author

a couple things.

Is there a way to get pyo3 to use a uv installed python version? I tried running uv python install 3.11 prior to running start_server.sh but would still get linker errors for python3.11. I fixed this by installing it system level (sudo dnf install python3.11-devel), but was wondering if there was a way to easily avoid this (I try to keep development dependencies user level)

Second, I'm having trouble getting the run-client.sh script to work after everything has been set up. here's the output:

user@fedora ~/t/sail (hdfs_support) [1]> scripts/spark-tests/run-tests.sh
Removing existing test logs...
Test suite: test-connect
ERROR: module or package not found: pyspark.sql.tests.connect (missing __init__.py?)

================================================= test session starts ==================================================
platform linux -- Python 3.11.9, pytest-8.3.3, pluggy-1.5.0
rootdir: /path/to/sail
configfile: pyproject.toml
plugins: hypothesis-6.112.1, xdist-3.6.1, timeout-2.3.1, snapshot-0.9.0, reportlog-0.4.0, repeat-0.9.3, mock-3.14.0, pytest_httpserver-1.1.0
collected 0 items

-------- generated report log file: /path/to/sail/tmp/spark-tests/latest/test-connect.jsonl ---------
================================================ no tests ran in 0.17s =================================================
Test suite: doctest-column
===================== test session starts ======================
platform linux -- Python 3.11.9, pytest-8.3.3, pluggy-1.5.0
rootdir: /path/to/sail
configfile: pyproject.toml
plugins: hypothesis-6.112.1, xdist-3.6.1, timeout-2.3.1, snapshot-0.9.0, reportlog-0.4.0, repeat-0.9.3, mock-3.14.0, pytest_httpserver-1.1.0
collected 33 items

.venvs/test/lib/python3.11/site-packages/pyspark/sql/column.py E [  3%]
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE                         [100%]

- generated report log file: /path/to/sail/tmp/spark-tests/latest/doctest-column.jsonl -
====================== 33 errors in 0.22s ======================

I've made sure hatch run test:install-pyspark has run, and run-server.sh is running on a separate terminal. Kinda scratching my head on this one, it looks as though the issue comes from pyspark. I tried cding into opt/spark/python, running python setup.py sdist again, and rerunning the install script with hatch, but error is the same. Is there some step I'm missing?

@shehabgamin
Copy link
Contributor

Have you gone through the Environment Setup steps?
https://docs.lakesail.com/sail/latest/development/setup/

After the above, it should just be the following commands to setup your test env:

git clone [email protected]:apache/spark.git opt/spark

scripts/spark-tests/build-pyspark.sh

hatch env create test
hatch run test:install-pyspark

Although, i'm also curious if maybe there is a problem with the Java installation. Unfortunately running the Spark tests requires Java to be installed.

Also, the Github Actions workflow runs the tests, so it may be good to manually follow the steps there (It looks like the actions workflow uses Java Corretto 17):
https://github.com/lakehq/sail/blob/main/.github/workflows/spark-tests.yml

Can we try to get some more verbose log outputs? It seems like doctest-column is able to run tests but they all fail. Maybe there are some useful error logs there:

export TEST_RUN_NAME=col && scripts/spark-tests/run-tests.sh --doctest-modules --pyargs pyspark.sql.column -v

Lastly, yes it should be possible to get pyo3 to use a uv installed python version. I would refer to the Environment Setup steps that I linked at the very beginning of this message.

@skewballfox
Copy link
Contributor Author

Although, i'm also curious if maybe there is a problem with the Java installation. Unfortunately running the Spark tests requires Java to be installed.

I have java installed and JAVA_HOME set to /usr/lib/jvm/java-17-openjdk

Can we try to get some more verbose log outputs? It seems like doctest-column is able to run tests but they all fail. Maybe there are some useful error logs there:

when I ran that command. I'm getting a series of no such file or directory errors for files such as '/path/to/sail/.venvs/test/lib/python3.11/site-packages/pyspark/python/test_support'. when I look in the parent directory mentioned, the only contents are .venvs/test/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip, but when listing the contents of the zip, the missing files (i.e. test_support) are there. I'm guessing some late stage build step is suppose to unpack that file but isn't?

@linhr
Copy link
Contributor

linhr commented Sep 18, 2024

Could you run the Spark build script unmodified and see if it works? I took another look at #200 and it seems to me that the patch lock file may cause the patch to be skipped when you build the Python package. If the patch is not applied, neither the tests nor the test support files will be in the correct location in the zip package.

The Maven lock file seems fine to me though.

@shehabgamin
Copy link
Contributor

Could you run the Spark build script unmodified and see if it works? I took another look at #200 and it seems to me that the patch lock file may cause the patch to be skipped when you build the Python package. If the patch is not applied, neither the tests nor the test support files will be in the correct location in the zip package.

The Maven lock file seems fine to me though.

Before doing this, I suggest going into the /opt/spark directory and doing a git add . + git stash to make sure that everything in there is clean.

@skewballfox
Copy link
Contributor Author

skewballfox commented Sep 18, 2024

I renamed my current copy of sail and recloned the repo. Suprisingly it worked, but I'm not sure why given I removed the opt directory a few times (including the lock files), so I retested on my original copy (pruned/recreated the virtual envs, removed opt, reran the build script, etc). It's working now

I think I may finally have a hadoop container setup in pseudo-distributed mode for testing hdfs support. Where should it go? under scripts, opts or somewhere else?

EDIT: right now it's a Dockerfile, but I can try to integrate it into the existing compose file, might take me a bit though because I'm running rootless podman which comes with a few extra caveats around container networking.

@linhr
Copy link
Contributor

linhr commented Sep 18, 2024

Glad to hear that you got the setup working!

Let's define the container in compose.yml.

Testing external systems with containers is not part of CI right now but it could be future work. For this PR, it would be good enough to have a local setup. Either Python or Rust tests would be great.

@skewballfox
Copy link
Contributor Author

hey, kind of stuck atm. I'm setting up hdfs in a local container, verifying it's working via the web portal, and then connecting to the hdfs container using the method mentioned in the dev docs.

p/t/sail> env SPARK_CONNECT_MODE_ENABLED=1 SPARK_REMOTE="sc://localhost:50051" hatch run pyspark
>>>path=hdfs://localhost:9000/user/skewballfox/test.json
>>>spark.read.json(path).show()

This throws a Object store error originating from this generic IO error. I know it's connecting since the client returns okay, and because if I change it to a non-existent file, I get an error indicating the resource doesn't exist. So it's making it somewhere past the object store instantiation and failing during execution (I think). I'm having trouble tracing it any further than that though.

here's the last bit of traceback from python:

File "/p/t/sail/.venvs/default/lib/python3.11/site-packages/pyspark/sql/connect/client/core.py", line 1503, in _handle_error
    self._handle_rpc_error(error)
  File "p/t/sail/.venvs/default/lib/python3.11/site-packages/pyspark/sql/connect/client/core.py", line 1539, in _handle_rpc_error
    raise convert_exception(info, status.message) from None
pyspark.errors.exceptions.connect.SparkRuntimeException: Object Store error: Generic HdfsObjectStore error: IO error occurred while communicating with HDFS

here's the last trace from the server

[2024-09-19T18:39:08Z DEBUG sail_spark_connect::server trace: 241683380366876879727100582833342366091 span: 5329696312847761413] ReleaseExecuteResponse { session_id: "487e2373-2881-41e9-ad87-d8362c5ef938", operation_id: Some("4e1997e1-2bd1-433a-a252-1826cb68986e") }

which is returned by release_execute

any ideas what the issue is or have suggestions on what I should try to track down the problem?

@shehabgamin
Copy link
Contributor

Default logging filters for the spark connect server only. If you want expanded debug logging you can start the server doing env RUST_LOG=debug ...

Let's see if we're able to get any better logs that way.

I also highly recommend using the debugger. Let me know if you'd like help getting one setup!

@skewballfox
Copy link
Contributor Author

Just an update, Managed to finally read a json file from the hdfs instance, was a configuration issue. Still failing to write to hdfs, though that is probably also a configuration issue (working on fixing it).

Launching the hdfs container still doesn't work locally via podman compose. It keeps dying after around 15 seconds without producing any error and returning a normal exit code. I'm running the docker file under scripts/hadoop via

env HADOOP_USER_NAME=sail podman build -t hadoop-container .
podman run -it -p 50070:9870 -p 9000:9000 -p 9864:9864 -p 9866:9866 hadoop-container /bin/bash
#/hdfs-init.sh

I think i might have found a bug in the s3 code that may be upstream. If you upload a json file via the web portal, and then try to read it via spark.read.json(path).show() you'll get: Json error: Not valid JSON: EOF while parsing an object at line 1 column 1. It's definitely valid json, tested with a few files uploaded to the minIO instance

This doesn't happen with the code from the docs, but the query spark.sql("SELECT 1").write.json(path) isn't doing what it looks like. bar.json is created as a directory, and json files are created under it with the write command. When you look at the metadata for those files the content-type is listed as binary/octet-stream, whereas the files it fails to read have the content-type application/json. you can run read json commands on the octet-stream files created without issue

@linhr
Copy link
Contributor

linhr commented Sep 22, 2024

bar.json is created as a directory, and json files are created under it with the write command.

This is the expected behavior. The content type needs investigation though.

@skewballfox
Copy link
Contributor Author

related issue for hdf-native-object-store

@skewballfox skewballfox marked this pull request as ready for review September 28, 2024 00:08
@skewballfox
Copy link
Contributor Author

So I got reading, writing files working, at least for json. what other functions should I implement, confirm working prior to merging this?

with testing, I'm still using the dockerfile, bringing both containers up with the compose file doesn't seem to work, but it's most likely a misconfiguration of hadoop or podman. Traffic goes in to the container (write request are fulfilled), but responses don't make it back out.

Given that, should I just document how to set up the dockerfile?

Copy link
Contributor

@linhr linhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an exciting progress!

I just triggered the workflow. Let's format the code using the command here: https://docs.lakesail.com/sail/main/development/build/rust.html

Reading and writing JSON files is a perfect scope for this PR. Let's add a FIXME comment in compose.yml for future investigation of the container issues, but there is no need to add documentation for manual container setup.

I have a few comments but overall this is quite an impressive change! I can see especially that the Dockerfile requires non-trivial efforts. I think the PR is pretty close to the ready state.

crates/sail-plan/src/object_store/registry.rs Outdated Show resolved Hide resolved
crates/sail-plan/src/object_store/registry.rs Outdated Show resolved Hide resolved
crates/sail-plan/src/resolver/data_source.rs Outdated Show resolved Hide resolved
scripts/hadoop/Dockerfile Outdated Show resolved Hide resolved
scripts/hadoop/Dockerfile Outdated Show resolved Hide resolved
scripts/hadoop/Dockerfile Outdated Show resolved Hide resolved
crates/sail-plan/src/object_store/config.rs Show resolved Hide resolved
compose.yml Outdated Show resolved Hide resolved
@skewballfox
Copy link
Contributor Author

skewballfox commented Sep 28, 2024

btw, compose might work on other systems(with or without sudo). I think part of the issue might be differences in rootless network configuration between podman-compose and podmans default on fedora 40. I know when launching hdfs via compose write request go through, because the files are created, outbound traffic is just broken

EDIT: So I sort of figured out the problem. podman-compose launches containers in bridge mode, where using podman directly uses pasta (a more recent rootless networking tool). In bridge mode I believe containers have a separate network namespace, and for some reason this breaks outbound traffic from the namenode. binding the containers port 9000 to the host 9000 fixes it and that's the only port that requires it.

also, from trial and error It seems like trying to map port 9000 to another port breaks hdfs during initialization(at least with pasta). I'm guessing from traffic it's trying to send itself. it's really not designed to run the entire cluster in a single container lol

Copy link
Contributor

@linhr linhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments! Also thanks for the notes regarding container setup.

I have one more minor comment, but I feel this is in a good shape!

It seems there is a formatting issue in compose.yml. Could you use the following commands to format the non-Rust code? More information can be found here.

pnpm install
pnpm run format
pnpm run lint

Also, when you commit the changes this time, you can add [spark tests] in the commit message so that we can get the Spark tests triggered. After this I think this PR is ready to be merged!

crates/sail-plan/src/resolver/data_source.rs Outdated Show resolved Hide resolved
@skewballfox
Copy link
Contributor Author

sorry I forgot to put [spark tests] in the commit message, could you retrigger it on your end?

@linhr
Copy link
Contributor

linhr commented Sep 30, 2024

sorry I forgot to put [spark tests] in the commit message, could you retrigger it on your end?

No worries! Since this PR does not change PySpark logic, it's fine to skip the tests. The tests will be run after the code is merged to main.

I've also created an issue (#227) so that the tests can be triggered without the special commit message.

Copy link
Contributor

@linhr linhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skewballfox We are so glad to have you as the first community contributor of Sail! Thank you for your contribution!

@linhr linhr merged commit b5334f1 into lakehq:main Sep 30, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HDFS Support
3 participants