Support complex types in sparksql hash and xxhash64 function (faceboo…

…kincubator#9414) Summary: Currently, sparksql hash functions only supports primitive types. This patch adds the implementation for complex types, including array, map and row. The expected results in UT are obtained from spark's output. Spark's implementation https://github.com/apache/spark/blob/a2b7050e0fc5db6ac98db57309e4737acd26bf3a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L536-L609 To support hashing for complex types and align with Spark's implementation, this patch uses a per-row virtual function call and the function is implemented as vector function rather than simple function. Below are some notes from the benchmark results: Virtual function call per-row vs. type-switch per row: The virtual function call performs 15% better due to having 20% fewer instructions. The switch statement involves more branch instructions but both methods have a similar branch misprediction rate of 2.8%. The switch statement doesn't show higher branch misprediction because its fixed pattern allows the BPU to handle it effectively. However, if the schema becomes very complex and exceeds the BPU's history track buffer (currently at 1000 levels), the misprediction rate may increase. VectorFunction vs. Simple Function: Since the function doesn't apply default null behavior, null judgment for each field occurs within the call per row when using a simple function. In contrast, a vector function first filters the null values per column, avoiding null judgments in the top-level loop. By evaluating the implementation across all null ratios for simple/vector functions, we observed that the simpler function can take up to 3.5 times longer than the vector function. Checking for null values row by row within the loop can lead to a high branch misprediction ratio due to the randomness of null values, while vector function can maintain a consistent branch misprediction ratio across all null ratios in vector processes. Pull Request resolved: facebookincubator#9414 Reviewed By: mbasmanova Differential Revision: D56783038 Pulled By: pedroerp fbshipit-source-id: 0238f0e88f7f395c41e976003a138cddba3bd093
amitkdutta · May 23, 2024 · 066a72f · 066a72f
1 parent 96c51ae
commit 066a72f
Show file tree

Hide file tree

Showing 7 changed files with 586 additions and 87 deletions.
diff --git a/velox/docs/functions/spark/binary.rst b/velox/docs/functions/spark/binary.rst
@@ -10,30 +10,21 @@ Binary Functions
 
     Computes the hash of one or more input values using seed value of 42. For
     multiple arguments, their types can be different.
-    Supported types are: BOOLEAN, TINYINT, SMALLINT, INTEGER, BIGINT, VARCHAR,
-    VARBINARY, REAL, DOUBLE, HUGEINT and TIMESTAMP.
-
 
 .. spark:function:: hash_with_seed(seed, x, ...) -> integer
 
     Computes the hash of one or more input values using specified seed. For
     multiple arguments, their types can be different.
-    Supported types are: BOOLEAN, TINYINT, SMALLINT, INTEGER, BIGINT, VARCHAR,
-    VARBINARY, REAL, DOUBLE, HUGEINT and TIMESTAMP.
 
 .. spark:function:: xxhash64(x, ...) -> bigint
 
     Computes the xxhash64 of one or more input values using seed value of 42.
     For multiple arguments, their types can be different.
-    Supported types are: BOOLEAN, TINYINT, SMALLINT, INTEGER, BIGINT, VARCHAR,
-    VARBINARY, REAL, DOUBLE, HUGEINT and TIMESTAMP.
 
 .. spark:function:: xxhash64_with_seed(seed, x, ...) -> bigint
 
     Computes the xxhash64 of one or more input values using specified seed. For
     multiple arguments, their types can be different.
-    Supported types are: BOOLEAN, TINYINT, SMALLINT, INTEGER, BIGINT, VARCHAR,
-    VARBINARY, REAL, DOUBLE, HUGEINT and TIMESTAMP.
 
 .. spark:function:: md5(x) -> varbinary