-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] ELSER crashes in local serverless setup #106206
Comments
Pinging @elastic/ml-core (Team:ML) |
I just confirmed that these steps don't cause a crash in the ESS CFT region running 8.14.0-SNAPSHOT. This is interesting, because the code should be very similar. Serverless is running on c6i instances in AWS. CFT is running on n2 instances in GCP. So the problem might be down to serverless or might be down to the exact type of hardware. |
Logs show the crash happened on ARM:
ML nodes on serverless are supposed to be on Intel hardware. I just tried reproducing this in a serverless project and the steps worked fine. However, as expected, my ML node was on Intel. So it may be that the bug here is really "ELSER crashes on ARM". And then the next question would be how did we end up with an ML node on ARM in serverless? |
Just reading through the report more closely, this wasn't even using real serverless. It was using simulated serverless running locally on a Mac. That explains why it was on ARM. But also, running locally on a Mac, it's running Docker images in a Linux VM. We don't know how much memory that Linux VM had. It may be that it was trying to do too much in too little memory and because of the vagaries of Docker on a Mac that ended up as a SEGV rather than an out-of-memory error. Given the circumstances I don't think this bug is anywhere near as serious as the title makes it sound. |
I tried these steps on an (Originally, I tried on an Therefore, this problem really does seem to be confined to running in a Docker container in a Linux VM on ARM macOS. It's not great that this crash happens, and it's still a bug that running in Docker on a Mac doesn't work, but at least it's not going to affect customers in production. |
I encountered this bug yesterday trying to set up some integration tests locally on my Mac through Docker. The problem is not ELSER-specific but happens for other trained models too. For local dev it would be quite nice to have this working. |
@maxjakob Which other trained models did you try? |
I deployed
and the logs showed
(line breaks from me to show that it's the same issue as reported above) |
And I should add, this was with a regular Elasticsearch |
Looking back over the comments on this issue I'm trying to understand if the problem is running the linux version of our inference code on arm Macs. There is no reason to expect that instructions used by libtorch will be supported if they don't exist on the target platform: it will use a lot of hand rolled SIMD stuff via the MKL. These are sometimes emulated, but it isn't guaranteed. I would have bet that this is what is the cause, except the latest error report was for a SIGSEGV (11) rather than a SIGILL (4). In any case, I think we need to understand exactly what build of our code inference is being run in this scenario. |
I've tested on a bunch of different docker versions and the good news is that before 8.13 you can run the ELSER model in docker on macOS without it crashing. In 8.13 libtorch was upgraded (elastic/ml-cpp#2612) to 2.1.2 from 1.13. This was a major version upgrade and could have introduced some incompatibility. MKL was also upgraded in 8.13 but that shouldn't be a problem as MKL is only used in the Linux x86 build and these crashes are on Aarch64 ( Perhaps something changed in the way the docker image is created in 8.13 and it would be a good first step to eliminate that possibility |
oneapi-src/oneDNN#1832 looks interesting. |
Including some ideas from @davidkyle
|
I am using Mac OS 13.6.6 on Intel hardware. I have self-hosted Elastic Search version 8.13.2 on a local machine and getting the same error while running infer on a Huggingface model(entence-transformers__stsb-distilroberta-base-v2). |
Facing the same issue with different type of environment where I can't use ELSER (.elser_model_2_linux-x86_64). Environment 1
Environment 2
Environment 3
ErrorOn every environment, the following behavior. But getting the following error on ELSER Linux version as soon as I try to ingest the Observability Knowledge Base. Let me know if I can help. |
I also experience this issue. Running plain Elasticsearch, not serverless. My setup is a Macbook M1 Pro running macOS Sonoma 14.5 and I am running Elasticsearch in a docker container for local development and integration tests. Have tested it on 8.12.2, 8.14.1 and 8.15.0. |
The crashes related to builds running
This bug seems to have been fixed back in March and hence |
Any procedure of PyTorch upgrade for Elasticsearch docker image available or we have to do it ourselves? |
Elasticsearch 8.15.2 has the PyTorch upgrade which fixes the crash. See elastic/ml-cpp#2705 Closed this issue as fixed in 8.15.2, please upgrade to the latest version. |
Description
When interacting with ELSER in serverless locally it crashes when attempting to perform inference.
Steps to reproduce
yarn es serverless --projectType=security --ssl
yarn start --serverless=security --ssl
The text was updated successfully, but these errors were encountered: