Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the Artifact Build Process for k-NN Plugin #4386

Closed
naveentatikonda opened this issue Jan 30, 2024 · 26 comments
Closed

Change the Artifact Build Process for k-NN Plugin #4386

naveentatikonda opened this issue Jan 30, 2024 · 26 comments

Comments

@naveentatikonda
Copy link
Member

naveentatikonda commented Jan 30, 2024

Problem Statement

For the 2.12 release, in k-NN plugin a new feature Faiss SQFP16 is being supported which also includes the SIMD optimization (Single Instruction Multiple Data) through AVX2 on x86 architecture and using NEON on ARM architecture. This SIMD optimization helps to boost the overall performance a lot. But, this might be a breaking change for the users who are upgrading from version 2.11 to 2.12 if their system or processor doesn’t support these optimizations which might result in a crash.

Proposed Solution

To solve this issue, we will be building two different versions of Faiss library(with same name) in k-NN plugin, one with SIMD optimization enabled and the other one without SIMD optimization and store them in different sub directories (as they both have same name). By default, during runtime we will load the lib without SIMD optimization.

For those users who know that their system supports these latest optimizations and want to try them, they can enable it by using an optional experimental feature flag something like -Dopensearch.kn.experimental.feature.faiss.simd.enabled=true. For this, they need to stop the existing cluster and need to pass the above optional flag during bootup which will load the other Faiss library with SIMD optimization(AVX2 or NEON depending on architecture).

@github-actions github-actions bot added the untriaged Issues that have not yet been triaged label Jan 30, 2024
@gaiksaya
Copy link
Member

[Triage] Removing the untriage label here since @peterzhuamazon @prudhvigodithi @bbarani are aware of the issue and have discussed the possible next steps. Would be helpful if someone can post them here.
Thanks!

@peterzhuamazon
Copy link
Member

peterzhuamazon commented Jan 30, 2024

Work with Naveen to help him disable avx (avx / avx2 / avx512......) instruction set on an existing EC2 instance:

  • Open /etc/default/grub, update GRUB_CMDLINE_LINUX_DEFAULT line with noxsave.
  • On Debian/Ubuntu, run sudo update-grub; on RHEL/CentOS, run sudo grub2-mkconfig -o /boot/grub2/grub.cfg.
  • Reboot, use lscpu to check the instruction set.

Before disable AVX:

    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq s
                         sse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq r
                         dseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke

After disable AVX:

    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq s
                         sse3 cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb ida arat

@naveentatikonda will take a look and test both lib files to confirm the results. Also, will post comparisons on the SIMD disabled vs. enabled performance metrics.

Thanks.

@bbarani
Copy link
Member

bbarani commented Jan 30, 2024

Tagging @dblock @elfisher K-NN will be adding optional experimental feature flag -Dopensearch.kn.experimental.feature.faiss.simd.enabled=true to enable SIMD optimization as some processors doesn't support these optimizations resulting in a crash of K-NN plugin.

@naveentatikonda Do we need to add any documentation around this flag? If so, can you please create a documentation issue and associated PR's in documentation-website repo?

@naveentatikonda
Copy link
Member Author

naveentatikonda commented Jan 30, 2024

@naveentatikonda Do we need to add any documentation around this flag? If so, can you please create a documentation issue and associated PR's in documentation-website repo?

@bbarani I have opened a PR for Faiss SQFP16 feature documentation. Will include the documentation for these changes in that PR.
opensearch-project/documentation-website#6249

@naveentatikonda
Copy link
Member Author

naveentatikonda commented Jan 30, 2024

The cluster crashed resulting in below error when I tried to copy the AVX2 enabled Faiss library after disabling AVX2 on the machine, spin up the cluster and ran a benchmarking test which hits the AVX2 logic in Faiss.

Later, I replaced the Faiss library with the default one (without AVX2) and spinned the cluster. Now, it works fine when I resumed the same test and there is no loss of data.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x00007f5aac4b5188, pid=17599, tid=18074
#
# JRE version: OpenJDK Runtime Environment (17.0+35) (build 17+35-2724)
# Java VM: OpenJDK 64-Bit Server VM (17+35-2724, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libopensearchknn_faiss.so+0xb5188]  _GLOBAL__sub_I_IndexHNSW.cpp+0x8
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /home/ec2-user/opensearch-2.11.1/core.17599)
#
# An error report file with more information is saved as:
# logs/hs_err_pid17599.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

Similarly, I ran another test on AVX2 enabled machine. I spinned up the cluster with the default Faiss library (without AVX2), and ingested some data. Later, I terminated the cluster and replaced the Faiss library with the one that has AVX2 enabled and spinned up the cluster. I don't see any data loss and the test ran successfully without any drop in recall.

@peterzhuamazon
Copy link
Member

After talking to @prudhvigodithi @naveentatikonda @navneet1v here is the path forward:

  1. On all distribution startup scripts add if statement based on env var KNN_ENABLE_SIMD. If true, use simd lib, else use old one.
  • TAR
  • Docker / Docker-compose
  • Helm
  • Ansible
  • DEB
  • RPM
  • ZIP
  1. On KNN repository, change build.gradle to build both libs, change build.sh to put the libs into different subfolders.

Thanks.

@naveentatikonda
Copy link
Member Author

  1. On KNN repository, change build.gradle to build both libs, change build.sh to put the libs into different subfolders.

@peterzhuamazon any specific naming convention for the subfolders ?
Can we name it lib and lib_simd ?

@peterzhuamazon
Copy link
Member

  1. On KNN repository, change build.gradle to build both libs, change build.sh to put the libs into different subfolders.

@peterzhuamazon any specific naming convention for the subfolders ? Can we name it lib and lib_simd ?

sure 👍

@peterzhuamazon
Copy link
Member

peterzhuamazon commented Feb 5, 2024

Changed to KNN_SIMD_ENABLED in sync with opensearch-project/k-NN#1449:

  • TAR: ✅
  • Docker / Docker-compose / Readme: ✅
  • Helm
  • Ansible
  • DEB
  • RPM
  • ZIP

@Pallavi-AWS
Copy link
Member

If this change is resulting in significant impact to the build process, we should revisit if 2.12 is the right version given its short notice.

@bbarani
Copy link
Member

bbarani commented Feb 5, 2024

If this change is resulting in significant impact to the build process, we should revisit if 2.12 is the right version given its short notice.

I agree. This requires good amount of effort from build team to accommodate this change and impacts other RC related activities. I would recommend to move this feature to 2.13.0 release as this is an experimental feature and requires additional testing before it can become GA feature. CC: @vamshin @naveentatikonda

@vamshin
Copy link
Member

vamshin commented Feb 6, 2024

Couple options we see here

Option 1: Remove the SIMD library and ship fp_16 feature in 2.12
pros: no changes required to build process, around 50% memory improvement, get feedback from community on feature.
cons: impacts search performance and poor customer experience

Option 2: Move the feature to 2.13 and have SIMD library shipped in the artifacts
pros: around 50% memory improvement and improved performance
cons: Changes required to bulild process, delays the feature by couple months

Considering the impact to search performance with Option1, we are ok to defer to 2.13 release and have the SIMD library support for better customer experience. For 2.13 we cannot afford to have this as experimental release as we now have enough head start to test mechanism to ship these libraries.

cc: @bbarani @Pallavi-AWS

@bbarani
Copy link
Member

bbarani commented Feb 6, 2024

@elfisher @dblock We need your inputs on the above options. Basically, we would need to ship a new SIMD library along with existing library to support new feature opensearch-project/k-NN#1138 through AVX2 on x86 architecture and using NEON on ARM architecture. This might be a breaking change for the users who are upgrading from version 2.11 to 2.12 if their system or processor doesn’t support these optimizations which might result in a crash.

To solve this issue, we were planning to build two different versions of Faiss library(with same name) in k-NN plugin, one with SIMD optimization enabled and the other one without SIMD optimization and allow user to take action but need your inputs before we finalize this approach for 2.13.0 release.

@elfisher
Copy link
Contributor

elfisher commented Feb 6, 2024

How does this break users if both libraries are packaged?

@bbarani
Copy link
Member

bbarani commented Feb 6, 2024

By default, during runtime they load the lib without SIMD optimization. @naveentatikonda Can you confirm the experience if user access the new feature opensearch-project/k-NN#1138 using the library without SIMD optimization?

For those users who know that their system supports these latest optimizations and want to try them, the plan was to enable it using a flag ( We initially thought of using experimental flag due to time crunch but I think we should add a dedicated flag for 2.13.0 release).

@naveentatikonda
Copy link
Member Author

By default, during runtime they load the lib without SIMD optimization. @naveentatikonda Can you confirm the experience if user access the new feature opensearch-project/k-NN#1138 using the library without SIMD optimization?

Without SIMD Optimization, users can still use the Faiss SQFP16 feature and there is no change in the UX or API parameters input. But, there is a considerable drop in the search performance upto 3X or 4X (depending on the dataset).

@naveentatikonda
Copy link
Member Author

These are some of the benchmarking results on x86 for Faiss SQFP16 comparing with and without AVX2. We can see a performance boost in Indexing Throughput and Search after enabling AVX2.

  • Indexing Throughput (docs/s) - The number of documents per second that can be ingested
Document_Cnt / (total_index_time_s + total_refresh_time_s)
Document_Cnt/ (( ingest_took_total)_s + (refresh_index_took_total)_s)
  • Query Time (ms) - Time per query, summarized as mean, p50, p90, p99
S. No. Datasets Dimension of Vector Primary Shards Space Type faiss sqfp16 - Indexing Throughput faiss sqfp16 AVX2 - Indexing Throughput faiss sqfp16 - Query time p50 (ms) faiss sqfp16 AVX2 - Query time p50 (ms) faiss sqfp16 - Query time p90 (ms) faiss sqfp16 AVX2 - Query time p90 (ms) faiss sqfp16 - Query time p99 (ms) faiss sqfp16 AVX2 - Query time p99 (ms)
1 gist-960-euclidean 960 24 L2 956 1007 84.67 25.1 101 28.3 112 30.5
2 sift-128-euclidean 128 24 L2 5692 12690 19.6 7.2 22.2 8.2 24.4 9.1
3 mnist-784-euclidean 784 24 L2 2823 5796 29 8 32 8 35 9.1

@dblock
Copy link
Member

dblock commented Feb 6, 2024

I don't think I understand the concern: after reading this there's no breaking change, and a new feature flag to enable an optimization.

By default, during runtime they load the lib without SIMD optimization.

So there's no breaking change to the user.

For those users who know that their system supports these latest optimizations and want to try them, the plan was to enable it using a flag.

This is like all feature flags.

For the long run you want to flip this when SIMD support is available, dynamically. Meaning that the software should be smart enough to pick the optimized path when possible (packaging both libraries and dynamically loading the right one).

@bbarani
Copy link
Member

bbarani commented Feb 6, 2024

Thanks @dblock. Option 1 will be a breaking change so we will have to go with Option 2 with a feature flag.

@bbarani bbarani moved this from In Progress to Backlog in OpenSearch Engineering Effectiveness Feb 12, 2024
@naveentatikonda
Copy link
Member Author

After today's discussion with @peterzhuamazon and @prudhvigodithi we figured out these four options which could be feasible:

  1. Detect if the underlying hardware supports SIMD and load the library dynamically through java from k-NN plugin after the cluster starts (use custom-codecs as a reference).
  2. Use the startup script and detect if SIMD is supported by hardware even before the cluster starts and load the library dynamically.
  3. If either Option1 or Option2 works then, provide an override flag or environment variable for the user to disable SIMD. In that case if it is set to true, even if the underlying hardware supports SIMD the libraries without SIMD support are loaded.
  4. If Option1 or Option2 doesn’t work then by default we will load the libraries without SIMD and if user enables SIMD either through feature flag or environment variable (which needs further discussion) after stopping the cluster. Then, during bootup the SIMD enabled libraries are loaded.

cc: @vamshin @bbarani

@peterzhuamazon
Copy link
Member

peterzhuamazon commented Mar 6, 2024

@peterzhuamazon
Copy link
Member

Close this as all the necessary changes are done for 2.13.0.

@navneet1v
Copy link
Contributor

Work with Naveen to help him disable avx (avx / avx2 / avx512......) instruction set on an existing EC2 instance:

  • Open /etc/default/grub, update GRUB_CMDLINE_LINUX_DEFAULT line with noxsave.
  • On Debian/Ubuntu, run sudo update-grub; on RHEL/CentOS, run sudo grub2-mkconfig -o /boot/grub2/grub.cfg.
  • Reboot, use lscpu to check the instruction set.

Before disable AVX:

    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq s
                         sse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq r
                         dseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke

After disable AVX:

    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq s
                         sse3 cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb ida arat

@naveentatikonda will take a look and test both lib files to confirm the results. Also, will post comparisons on the SIMD disabled vs. enabled performance metrics.

Thanks.

@prudhvigodithi I disabled the avx2 as mentioned in the above steps. But when I tried to enable it back by reverting the changes I did above, it didn't enable back the avx2. Can you please provide some resolution for this.

cc: @naveentatikonda

@bbarani
Copy link
Member

bbarani commented Apr 8, 2024

@navneet1v Can you please open a separate issue to track this issue since this issue is for the implementation?

@navneet1v
Copy link
Contributor

@navneet1v Can you please open a separate issue to track this issue since this issue is for the implementation?

its not the issue with implementation. I just have a specific question on how to revert the change of disabling simd on a machine.

@navneet1v
Copy link
Contributor

So I am finally able to resolve the issue. Thanks to @naveentatikonda .

So after removing the noxsave, I need to run this command: sudo grub2-mkconfig -o /boot/grub2/grub.cfg for my AL2 machine. I think we should update the comment to be more precise on how to enable or disable avx2. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

10 participants