[XLA:CPU] Allow convert natively on supported CPUs #17222

kanvi-nervana · 2024-09-16T18:17:40Z

The performance for some workloads dropped and git bisect points to this commit on XLA to be causing the drop. The comments indicate that LLVM optimizations are being suppressed when converting from FP32-BF16 and back since it may cause performance degradation on other cpu's. Since, some cpu's can handle BF16 efficiently, this is not required and can be bypassed.

cheshire · 2024-09-18T08:16:48Z

The performance for some workloads dropped and git bisect points to this commit on

You're linking a private commit on an Intel branch, which actual commit is it?

kanvi-nervana · 2024-09-18T16:40:43Z

The performance for some workloads dropped and git bisect points to this commit on

You're linking a private commit on an Intel branch, which actual commit is it?

This commit on XLA

cheshire · 2024-09-19T09:53:58Z

@penpornk

penpornk

Thank you for the PR and sorry for the delay!

penpornk · 2024-09-25T10:41:10Z

xla/service/cpu/ir_emitter.cc

+#if defined(INTEL_MKL) && defined(ENABLE_ONEDNN_V3)
+                      !IsNativeConvertSupportedOnThisCPU()
+#else
+                      true
+#endif


We would like to avoid having #ifdefs littered in the code as much as possible. I think a better long-term solution is to have a generic IsNativeConvertSupported that can be called regardless of whether INTEL_MKL is defined or not. I'd like to keep this PR short and focused so I'll do some refactoring (of this method and other utils) in a future PR instead.

I think we should do this in the current CL. Even as is IsNativeConvertSupported seems pretty generic. It may need to be extend in the future, but that's OK.

Let's please move IsNativeConvertSupported directly in cpu/ir_emitter.cc and call it unconditionally.

@kanvi-nervana Could you please implement the change above (or comment if you think that's not a good idea).

Another issue: IsNativeConvertSupported tests the CPU we're compiling on, not necessarily the target CPU. Could the two be different?

Another issue: IsNativeConvertSupported tests the CPU we're compiling on, not necessarily the target CPU. Could the two be different?

Yes, they could for AOT mode (although the current thunk runtime doesn't support this mode yet).

@kanvi-nervana Please check the features string from IrEmitter's target_machine_features() instead. You can get the string by calling getTargetFeatureString() from target_machine_.

Since we are adding a new function to check from feature string, it's probably good to add a small unit test for the function as well (feeding in feature strings with and without "amx-bf16" and "avxneconvert").

Thanks for the review, I will update the PR with the requested changes.

penpornk · 2024-09-25T16:14:38Z

xla/service/cpu/ir_emitter.cc

+#if defined(INTEL_MKL) && defined(ENABLE_ONEDNN_V3)
+                      !IsNativeConvertSupportedOnThisCPU()
+#else
+                      true
+#endif


Another issue: IsNativeConvertSupported tests the CPU we're compiling on, not necessarily the target CPU. Could the two be different?

Yes, they could for AOT mode (although the current thunk runtime doesn't support this mode yet).

@kanvi-nervana Please check the features string from IrEmitter's target_machine_features() instead. You can get the string by calling getTargetFeatureString() from target_machine_.

Since we are adding a new function to check from feature string, it's probably good to add a small unit test for the function as well (feeding in feature strings with and without "amx-bf16" and "avxneconvert").

kanvi-nervana · 2024-09-27T23:43:46Z

@penpornk I have addressed the comments. Please review. Thanks!

dimitar-asenov

Thanks for making the changes. It looks good. All new code must have a test, please so please add a small unit test. See Penporn's suggestion from a few days ago.

penpornk · 2024-10-08T15:45:59Z

xla/service/cpu/ir_emitter.cc

@@ -147,6 +148,16 @@ class IrEmitter::CpuElementalIrEmitter : public ElementalIrEmitter {
    return hlo_module_config_.debug_options().xla_cpu_enable_fast_min_max();
  }

+  bool IsNativeConvertSupportedOnThisCPU(IrEmitter* ir_emitter) {


Nit:

Suggested change

bool IsNativeConvertSupportedOnThisCPU(IrEmitter* ir_emitter) {

bool IsNativeConvertSupportedOnTargetCPU(IrEmitter* ir_emitter) {

penpornk · 2024-10-08T15:48:49Z

xla/service/cpu/ir_emitter.cc

@@ -147,6 +148,16 @@ class IrEmitter::CpuElementalIrEmitter : public ElementalIrEmitter {
    return hlo_module_config_.debug_options().xla_cpu_enable_fast_min_max();
  }

+  bool IsNativeConvertSupportedOnThisCPU(IrEmitter* ir_emitter) {


You can add a unit test for this function by making it take feature_string as a parameter instead. Then the unit test can just call the function with a few features strings to demonstrate that the function works fine.

penpornk

Thank you for the changes!

penpornk · 2024-10-10T21:01:06Z

xla/service/cpu/ir_emitter.cc

+  if (absl::StrContains(feature_string, "+avxneconvert") ||
+      absl::StrContains(feature_string, "+amx-bf16")) {
+    return true;
+  }
+  return false;


One more nit please, this can be simplified to:

Suggested change

if (absl::StrContains(feature_string, "+avxneconvert") ||

absl::StrContains(feature_string, "+amx-bf16")) {

return true;

}

return false;

return absl::StrContains(feature_string, "+avxneconvert") ||

absl::StrContains(feature_string, "+amx-bf16");

penpornk

Thank you again for the PR!

Imported from GitHub PR #17222 The performance for some workloads dropped and git bisect points to this [commit](c48011a) on XLA to be causing the drop. The comments indicate that LLVM optimizations are being suppressed when converting from FP32-BF16 and back since it may cause performance degradation on other cpu's. Since, some cpu's can handle BF16 efficiently, this is not required and can be bypassed. Copybara import of the project: -- 36d2883 by Kanvi Khanna <[email protected]>: allow convert natively -- a7f6f71 by Kanvi Khanna <[email protected]>: Address comments -- 1422583 by Kanvi Khanna <[email protected]>: Add test -- 9693d9e by Kanvi Khanna <[email protected]>: address commment Merging this change closes #17222 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17222 from Intel-tensorflow:kanvi/native_convert_support 9693d9e PiperOrigin-RevId: 683640752

Imported from GitHub PR openxla/xla#17222 The performance for some workloads dropped and git bisect points to this [commit](openxla/xla@c48011a) on XLA to be causing the drop. The comments indicate that LLVM optimizations are being suppressed when converting from FP32-BF16 and back since it may cause performance degradation on other cpu's. Since, some cpu's can handle BF16 efficiently, this is not required and can be bypassed. Copybara import of the project: -- 36d28839d38860dd4a222cba2da95f07083b74c1 by Kanvi Khanna <[email protected]>: allow convert natively -- a7f6f71a80e4848dce281aaf56cdc07de72cf7ee by Kanvi Khanna <[email protected]>: Address comments -- 14225831b897ffcc47baefccef98b9d925cdfea1 by Kanvi Khanna <[email protected]>: Add test -- 9693d9e90d89fef9c7c063aa9d9aa6c648915145 by Kanvi Khanna <[email protected]>: address commment Merging this change closes #17222 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#17222 from Intel-tensorflow:kanvi/native_convert_support 9693d9e90d89fef9c7c063aa9d9aa6c648915145 PiperOrigin-RevId: 683640752