Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] unable to run knn search with neural query on OS 2.16.0 #2838

Open
IanMenendez opened this issue Aug 19, 2024 · 28 comments
Open

[BUG] unable to run knn search with neural query on OS 2.16.0 #2838

IanMenendez opened this issue Aug 19, 2024 · 28 comments
Labels
bug Something isn't working

Comments

@IanMenendez
Copy link

IanMenendez commented Aug 19, 2024

What is the bug?
Searching with neural query brings down OS 2.16.0.

This is happening in OS 2.16.0 Image FROM opensearchproject/opensearch:2.16.0 but not in 2.15.0 or lower

How to reproduce the bug?

  1. Upload and deploy ML model
We have a custom transformers model but you can upload another one with:

POST /_plugins/_ml/models/_register
{
  "name": "huggingface/sentence-transformers/msmarco-distilbert-base-tas-b",
  "version": "1.0.2",
  "model_group_id": "Z1eQf4oB5Vm0Tdw8EIP2",
  "model_format": "TORCH_SCRIPT"
}

POST /_plugins/_ml/models/dxyObJEBGnTvwYNln7p8/_deploy

  1. Create index with settings knn = true and an ingest pipeline for the model
PUT _ingest/pipeline/test
{
  "description": "",
  "processors": [
    {
      "text_embedding": {
        "model_id": "dxyObJEBGnTvwYNln7p8",
        "field_map": {
          "text": "text_embedding"
        }
      }
    }
  ]
}

PUT /testing
{
  "settings": {
    "index.knn": true,
    "index.default_pipeline": "test"
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text"
      },
      "text_embedding": {
        "type": "knn_vector",
        "dimension": 2
      }
    }
  }
}

  1. ingest some docs
    POST /testing/_doc
{
  "text": "testing knn"
}
  1. Search with a neural query
POST /testing/_search
{
  "query": {
    "neural": {
      "text_embedding": {
        "model_id": "dxyObJEBGnTvwYNln7p8",
        "query_text": "testing_neural"
      }
    }
  }
}
  1. Cluster goes down with:
opensearch-node2       | fatal error in thread [opensearch[opensearch-node2][refresh][T#3]], exiting
opensearch-node2       | java.lang.UnsatisfiedLinkError: /usr/share/opensearch/plugins/opensearch-knn/lib/libopensearchknn_common.so: /usr/share/opensearch/data/ml_cache/pytorch/1.13.1-cpu-precxx11-linux-x86_64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /usr/share/opensearch/plugins/opensearch-knn/lib/libopensearchknn_util.so)
opensearch-node2       | 	at java.base/jdk.internal.loader.NativeLibraries.load(Native Method)
opensearch-node2       | 	at java.base/jdk.internal.loader.NativeLibraries$NativeLibraryImpl.open(NativeLibraries.java:331)
opensearch-node2       | 	at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:197)
opensearch-node2       | 	at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:139)
opensearch-node2       | 	at java.base/jdk.internal.loader.NativeLibraries.findFromPaths(NativeLibraries.java:259)
opensearch-node2       | 	at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:251)
opensearch-node2       | 	at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2451)
opensearch-node2       | 	at java.base/java.lang.Runtime.loadLibrary0(Runtime.java:916)
opensearch-node2       | 	at java.base/java.lang.System.loadLibrary(System.java:2063)
opensearch-node2       | 	at org.opensearch.knn.jni.JNICommons.lambda$static$0(JNICommons.java:26)
opensearch-node2       | 	at java.base/java.security.AccessController.doPrivileged(AccessController.java:319)
opensearch-node2       | 	at org.opensearch.knn.jni.JNICommons.<clinit>(JNICommons.java:25)
opensearch-node2       | 	at org.opensearch.knn.index.codec.transfer.VectorTransferFloat.transfer(VectorTransferFloat.java:68)
opensearch-node2       | 	at org.opensearch.knn.index.codec.transfer.VectorTransferFloat.close(VectorTransferFloat.java:59)
opensearch-node2       | 	at org.opensearch.knn.index.codec.util.KNNCodecUtil.getPair(KNNCodecUtil.java:61)
opensearch-node2       | 	at org.opensearch.knn.index.codec.KNN80Codec.KNN80DocValuesConsumer.addKNNBinaryField(KNN80DocValuesConsumer.java:147)
opensearch-node2       | 	at org.opensearch.knn.index.codec.KNN80Codec.KNN80DocValuesConsumer.addBinaryField(KNN80DocValuesConsumer.java:87)
opensearch-node2       | 	at org.apache.lucene.index.BinaryDocValuesWriter.flush(BinaryDocValuesWriter.java:132)
opensearch-node2       | 	at org.apache.lucene.index.IndexingChain.writeDocValues(IndexingChain.java:424)
opensearch-node2       | 	at org.apache.lucene.index.IndexingChain.flush(IndexingChain.java:282)
opensearch-node2       | 	at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:445)
opensearch-node2       | 	at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:496)
opensearch-node2       | 	at org.apache.lucene.index.DocumentsWriter.maybeFlush(DocumentsWriter.java:450)
opensearch-node2       | 	at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:643)
opensearch-node2       | 	at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:578)
opensearch-node2       | 	at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:381)
opensearch-node2       | 	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:355)
opensearch-node2       | 	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:345)
opensearch-node2       | 	at org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112)
opensearch-node2       | 	at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
opensearch-node2       | 	at org.opensearch.index.engine.OpenSearchReaderManager.refreshIfNeeded(OpenSearchReaderManager.java:72)
opensearch-node2       | 	at org.opensearch.index.engine.OpenSearchReaderManager.refreshIfNeeded(OpenSearchReaderManager.java:52)
opensearch-node2       | 	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167)
opensearch-node2       | 	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
opensearch-node2       | 	at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:433)
opensearch-node2       | 	at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:413)
opensearch-node2       | 	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167)
opensearch-node2       | 	at org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:213)
opensearch-node2       | 	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1865)
opensearch-node2       | 	at org.opensearch.index.engine.InternalEngine.maybeRefresh(InternalEngine.java:1844)
opensearch-node2       | 	at org.opensearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:4648)
opensearch-node2       | 	at org.opensearch.index.IndexService.maybeRefreshEngine(IndexService.java:1157)
opensearch-node2       | 	at org.opensearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1301)
opensearch-node2       | 	at org.opensearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:159)
opensearch-node2       | 	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:882)
opensearch-node2       | 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
opensearch-node2       | 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
opensearch-node2       | 	at java.base/java.lang.Thread.run(Thread.java:1583)

What is the expected behavior?
Dont crash and perform a search

What is your host/environment?
Docker with FROM opensearchproject/opensearch:2.16.0 Image

NOTE: This issue only happens the first time you run a neural query. After running a neural query sometime after the first one, the cluster seems to get the libraries. I do not know what was changed from OS 2.15 to 2.16

@IanMenendez IanMenendez added bug Something isn't working untriaged labels Aug 19, 2024
@heemin32
Copy link

Does it happen when you run knn query directly as well or it happens only when you query knn field through neural plugin?

@naveentatikonda
Copy link
Member

naveentatikonda commented Aug 19, 2024

This issue is coming from ml-commons through pytorch/djl. Similar issue - #2563

@IanMenendez Can you please share the configuration and mapping details of the models and indices to replicate the issue.

@ylwu-amzn Can you please take a look into the above issue and confirm if it is coming from ml-commons. Thanks!

@IanMenendez
Copy link
Author

Does it happen when you run knn query directly as well or it happens only when you query knn field through neural plugin?

Tested this and it is only happening with neural query, so I think it's better to move this to neural search or ml commons repo? can you do this?

@IanMenendez
Copy link
Author

This issue is coming from ml-commons through pytorch/djl. Similar issue - opensearch-project/ml-commons#2563

@IanMenendez Can you please share the configuration and mapping details of the models and indices to replicate the issue.

@ylwu-amzn Can you please take a look into the above issue and confirm if it is coming from ml-commons. Thanks!

Yes, I updated the issue description

@IanMenendez
Copy link
Author

IanMenendez commented Aug 20, 2024

I figured out that the issue is solved by itself if you restart the cluster. For some reason then the libraries seem to be there.

The problem is that we use neural query in our testing pipeline and we cannot just restart a cluster during the testing pipeline.

@navneet1v
Copy link
Contributor

This issue is coming from ml-commons through pytorch/djl. Similar issue - opensearch-project/ml-commons#2563

@IanMenendez Can you please share the configuration and mapping details of the models and indices to replicate the issue.

@ylwu-amzn Can you please take a look into the above issue and confirm if it is coming from ml-commons. Thanks!

@IanMenendez as provided by @naveentatikonda the issue seems to be coming from ML Commons. So, ideally this issue should be moved to ML Commons. But we don't have permission to transfer the issue to ML Commons. @opensearch-project/admin can you move this issue to ML Commons repo.

@gaiksaya gaiksaya transferred this issue from opensearch-project/k-NN Aug 20, 2024
@Zhangxunmt
Copy link
Collaborator

what is the operation system you used for producing this error? @IanMenendez

@IanMenendez
Copy link
Author

what is the operation system you used for producing this error? @IanMenendez

@Zhangxunmt I replicated the issue in the OS 2.16 docker container FROM opensearchproject/opensearch:2.16.0

But It's also happening on my local machine with Linux Mint 21.1

@ylwu-amzn
Copy link
Collaborator

ylwu-amzn commented Aug 29, 2024

From the error
opensearch-node2 | java.lang.UnsatisfiedLinkError: /usr/share/opensearch/plugins/opensearch-knn/lib/libopensearchknn_common.so: /usr/share/opensearch/data/ml_cache/pytorch/1.13.1-cpu-precxx11-linux-x86_64/libstdc++.so.6: version GLIBCXX_3.4.21' not found (required by /usr/share/opensearch/plugins/opensearch-knn/lib/libopensearchknn_util.so)`

It's KNN plugin can't find libopensearchknn_common.so.

@IanMenendez , can you run predict API directly ?

POST _plugins/_ml/models/<your_model_id>/_predict
{
  "text_docs": ["hello"]
}

If this can work, then it's not from ml-commons. KNN team can help take a look.

@IanMenendez
Copy link
Author

@ylwu-amzn
I can confirm that using the predict API directly does not crash the cluster.

Using neural query crashes the cluster

@ylwu-amzn
Copy link
Collaborator

@navneet1v , as predict API works correctly, I think the issue is from KNN, the log also shows it's related with KNN java.lang.UnsatisfiedLinkError: /usr/share/opensearch/plugins/opensearch-knn/lib/libopensearchknn_common.so , can KNN team help take a look ?

I have no permission to k-NN plugin repo, so can't transfer it to k-NN repo

@yuye-aws
Copy link
Member

Also meet with the same issue. Can you take a look @navneet1v @martin-gaievski ?

@yuye-aws
Copy link
Member

@IanMenendez Maybe you can try this workaround: https://forum.opensearch.org/t/issue-with-opensearch-knn/12633/3

@yuye-aws
Copy link
Member

Here is my error log

=== Standard error of node `node{::integTest-0}` ===
»   ↓ last 40 non error or warning messages from /Users/yuyezhu/Desktop/Code/neural-search/build/testclusters/integTest-0/logs/opensearch.stderr.log ↓
» WARNING: Using incubator modules: jdk.incubator.vector
»  WARNING: A terminally deprecated method in java.lang.System has been called
»  WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.OpenSearch (file:/Users/yuyezhu/Desktop/Code/neural-search/build/testclusters/integTest-0/distro/3.0.0-ARCHIVE/lib/opensearch-3.0.0-SNAPSHOT.jar)
»  WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.OpenSearch
»  WARNING: System::setSecurityManager will be removed in a future release
»  WARNING: A terminally deprecated method in java.lang.System has been called
»  WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.Security (file:/Users/yuyezhu/Desktop/Code/neural-search/build/testclusters/integTest-0/distro/3.0.0-ARCHIVE/lib/opensearch-3.0.0-SNAPSHOT.jar)
»  WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.Security
»  WARNING: System::setSecurityManager will be removed in a future release
»  fatal error in thread [opensearch[integTest-0][refresh][T#3]], exiting
»  java.lang.UnsatisfiedLinkError: no opensearchknn_common in java.library.path: /Users/yuyezhu/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:.
»       at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2458)
»       at java.base/java.lang.Runtime.loadLibrary0(Runtime.java:916)
»       at java.base/java.lang.System.loadLibrary(System.java:2063)
»       at org.opensearch.knn.jni.JNICommons.lambda$static$0(JNICommons.java:26)
»       at java.base/java.security.AccessController.doPrivileged(AccessController.java:319)
»       at org.opensearch.knn.jni.JNICommons.<clinit>(JNICommons.java:25)
»       at org.opensearch.knn.index.codec.transfer.OffHeapFloatVectorTransfer.transfer(OffHeapFloatVectorTransfer.java:24)
»       at org.opensearch.knn.index.codec.transfer.OffHeapVectorTransfer.transfer(OffHeapVectorTransfer.java:57)
»       at org.opensearch.knn.index.codec.nativeindex.DefaultIndexBuildStrategy.buildAndWriteIndex(DefaultIndexBuildStrategy.java:70)
»       at org.opensearch.knn.index.codec.nativeindex.NativeIndexWriter.buildAndWriteIndex(NativeIndexWriter.java:154)
»       at org.opensearch.knn.index.codec.nativeindex.NativeIndexWriter.flushIndex(NativeIndexWriter.java:111)
»       at org.opensearch.knn.index.codec.KNN990Codec.NativeEngines990KnnVectorsWriter.trainAndIndex(NativeEngines990KnnVectorsWriter.java:265)
»       at org.opensearch.knn.index.codec.KNN990Codec.NativeEngines990KnnVectorsWriter.flush(NativeEngines990KnnVectorsWriter.java:87)
»       at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsWriter.flush(PerFieldKnnVectorsFormat.java:115)
»       at org.apache.lucene.index.VectorValuesConsumer.flush(VectorValuesConsumer.java:76)
»       at org.apache.lucene.index.IndexingChain.flush(IndexingChain.java:296)
»       at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:445)
»       at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:496)
»       at org.apache.lucene.index.DocumentsWriter.maybeFlush(DocumentsWriter.java:450)
»       at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:643)
»       at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:578)
»       at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:381)
»       at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:355)
»       at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:345)
»       at org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112)
»       at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
»       at org.opensearch.index.engine.OpenSearchReaderManager.refreshIfNeeded(OpenSearchReaderManager.java:72)
»       at org.opensearch.index.engine.OpenSearchReaderManager.refreshIfNeeded(OpenSearchReaderManager.java:52)
»       at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167)
»       at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
»       at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:433)
»       at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:413)
»       at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167)
»       at org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:213)
»       at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1774)
»       at org.opensearch.index.engine.InternalEngine.maybeRefresh(InternalEngine.java:1753)
»       at org.opensearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:4633)
»       at org.opensearch.index.IndexService.maybeRefreshEngine(IndexService.java:1179)
»       at org.opensearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1323)
»       at org.opensearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:159)
»       at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:923)
»       at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
»       at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
»       at java.base/java.lang.Thread.run(Thread.java:1583)

@martin-gaievski
Copy link
Member

@yuye-aws this error looks specific to native knn engines. Can you check which one you're using, if it's one of native ones (faiss or nmslib) can you try defining lucene as your knn engine.

@ylwu-amzn
Copy link
Collaborator

@martin-gaievski , can you transfer this issue to k-nn plugin repo ? I think change to lucene can work , but we should also support other two native engines. Suggest K-NN team to try this, seems very easy to reproduce.

@yuye-aws
Copy link
Member

yuye-aws commented Sep 11, 2024

@yuye-aws this error looks specific to native knn engines. Can you check which one you're using, if it's one of native ones (faiss or nmslib) can you try defining lucene as your knn engine.

My problem has been resolved after this index mapping. Thank you @martin-gaievski ! Do you think the original issue problem can be resolved in a similar manner?

  "mappings": {
    "properties": {
      "text_embedding": {
        "type": "nested",
        "properties": {
          "knn": {
            "type": "knn_vector",
            "dimension": 768,
            "method": {
              "name": "hnsw",
              "engine": "lucene"
            }
          }
        }
      }
    }
  }

@IanMenendez
Copy link
Author

IanMenendez commented Sep 12, 2024

Changing the index mapping to use lucene as it was suggested worked. But I still think OS should not crash the cluster if libstdc++.so.6 is not found. It should catch the exception and throw an error

@CorentinLimier
Copy link

CorentinLimier commented Jan 7, 2025

Any update on this issue ? I had a similar error (this time when indexing the documents on an index that uses faiss engine)

java.lang.UnsatisfiedLinkError: /usr/share/opensearch/plugins/opensearch-knn/lib/libopensearchknn_faiss_avx2.so: /usr/share/opensearch/data/ml_cache/pytorch/1.13.1-cpu-precxx11-linux-x86_64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /usr/share/opensearch/plugins/opensearch-knn/lib/libopensearchknn_faiss_avx2.so) 

Error sometimes occur which make the cluster crash and then cluster restarts and for some reason library seems to be found.

Solution to switch on Lucene doesn't work in my case because of opensearch-project/k-NN#2347

Opensearch version on my cluster is 2.17

@heemin32
Copy link

heemin32 commented Jan 7, 2025

@CorentinLimier Could you confirm if the issue happen when you use knn field directly without neural?

@CorentinLimier
Copy link

CorentinLimier commented Jan 7, 2025

@heemin32 Sadly I can't. We rollbacked the index to lucene engine because it happened on a production cluster. Switching all the searches from neural to knn field would be too costly.

And I'm still unsure on when this error happened : my first lead was that it occured when indexing documents, but since we do use neural queries as well, maybe in parallel a neural query was launched ?
The indexing uses an ingestion pipeline that uses an embedding model.

    "processors": [
      {
        "text_embedding": {
          "model_id": "XwhQQJQBZa8OPgyhKMzt",
          "field_map": {
            "metadata": "metadataEmbedding"
          }
        }
      }
    ]

Model :

{
  "name": "huggingface/sentence-transformers/all-MiniLM-L6-v2",
  "version": "1.0.1",
  "model_group_id": <model_group_id>,
  "model_format": "TORCH_SCRIPT"
}

We will try to upgrade to 2.18.0 and see if we still have the issue.

@0ctopus13prime
Copy link

0ctopus13prime commented Jan 7, 2025

@IanMenendez
Hi, the linking error (GLIBCXX_3.4.21 not found) arises because it fails to look up the symbol in GCC standard library.
This indicates that libstdc++.so does not have the required symbol KNN shared library is looking for.
It can happen when different GCC was used for building and during runtime. (ex: GCC7 for building, GCC5 for runtime)

So the hot fix you can take is to build a new image based on the one that you're using then upgrade GCC to 7 inside. (actually it will work any versions later than 5)

sudo yum install centos-release-scl
sudo yum install devtoolset-7-gcc devtoolset-7-gcc-c++
scl enable devtoolset-7 bash

Could you share the output of running g++ --version within container?

@0ctopus13prime
Copy link

0ctopus13prime commented Jan 7, 2025

@peterzhuamazon
Could you confirm which GCC version is being used within the container?
If it is later than 5, then it must be with LD_LIBRARY_PATH is misconfigured making loader to pick an older version of GCC.
If it is still GCC 4 or 5, then I believe we need to consider bump up GCC version.

@CorentinLimier
Copy link

@0ctopus13prime awesome, thanks a lot for the lead, I will try that and tell you if it works + the output of g++ --version before and after

One question though, from the logs I can see that it uses libstdc++.so.6 from this path : /usr/share/opensearch/data/ml_cache/pytorch/1.13.1-cpu-precxx11-linux-x86_64/libstdc++.so.6

Full log :

java.lang.UnsatisfiedLinkError: /usr/share/opensearch/plugins/opensearch-knn/lib/libopensearchknn_faiss_avx2.so: /usr/share/opensearch/data/ml_cache/pytorch/1.13.1-cpu-precxx11-linux-x86_64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /usr/share/opensearch/plugins/opensearch-knn/lib/libopensearchknn_faiss_avx2.so) 

Would updating the image and installing another version of libstdc++.so.6 fix the one that is used by opensearch there ?

Thanks 🙏

@peterzhuamazon
Copy link
Member

In 2.16 we switch from centos7 (glibc 2.17 gcc9) as it deprecated, and switch to AL2 (gblic 2.26 gcc10) for the build image.

It is well-documented here:

In 2.19 we plan to upgrade to gcc 13 for a new knn feature soon:

In June 2025 we plan to move again from AL2 to Almalinux8 (glibc 2.28 gcc11/13) due to AL2 deprecation.

@peterzhuamazon
Copy link
Member

The issue here is very similar on libstdc++ here:

cc: @ylwu-amzn

@CorentinLimier
Copy link

@0ctopus13prime

Indeed when doing

strings /usr/share/opensearch/data/ml_cache/pytorch/1.13.1-cpu-precxx11-linux-x86_64/libstdc++.so.6 | grep "GLIBCXX"

We can see that GLIBCXX stops at version 3.4.19 (which explain the missing 3.4.21)

We did install latest libstdc++ and updated env var LD_LIBRARY_PATH in docker image, then rebuilt.

ENV LD_LIBRARY_PATH="/usr/lib64/libstdc++.so.6:$LD_LIBRARY_PATH"

We have two data nodes, and while both have a correct libstdc++ in /usr/lib64/libstdc++.so.6, the one used by plugin strings /usr/share/opensearch/data/ml_cache/pytorch/1.13.1-cpu-precxx11-linux-x86_64/libstdc++.so.6 has GLIBCXX with the correct versions on first node but not on second node.

Both links to the same exact docker image.

I feel the file /usr/share/opensearch/data/ml_cache/pytorch/1.13.1-cpu-precxx11-linux-x86_64/libstdc++.so.6 is created after the image is built, I don't know how, and sometimes uses the correct libstdc++ version and sometimes not.

We are considering trying to replace /usr/share/opensearch/data/ml_cache/pytorch/1.13.1-cpu-precxx11-linux-x86_64/libstdc++.so.6 and link it directly to /usr/lib64/libstdc++.so.6 which we know contain the correct GLIBCXX versions on both nodes, but I'm not sure if :

  • it will work as expected
  • there is a simpler way to guarantee that /usr/share/opensearch/data/ml_cache/pytorch/1.13.1-cpu-precxx11-linux-x86_64/libstdc++.so.6 contains the correct GLIBCXX versions.

What do you think ?

@CorentinLimier
Copy link

CorentinLimier commented Jan 10, 2025

We solved this issue by replacing /usr/share/opensearch/data/ml_cache/pytorch/1.13.1-cpu-precxx11-linux-x86_64/libstdc++.so.6 by /usr/lib64/libstdc++.so.6 that contains the correct version of GLIBCXX (once the container runs since the file is generated after build time).

Issue seems to remain in 2.18.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

No branches or pull requests