Skip to content

Commit

Permalink
Support complex types in sparksql hash and xxhash64 function (faceboo…
Browse files Browse the repository at this point in the history
…kincubator#9414)

Summary:
Currently, sparksql hash functions only supports primitive types.
This patch adds the implementation for complex types, including array, map and row.

The expected results in UT are obtained from spark's output.

Spark's implementation
https://github.com/apache/spark/blob/a2b7050e0fc5db6ac98db57309e4737acd26bf3a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L536-L609

To support hashing for complex types and align with Spark's implementation,
this patch uses a per-row virtual function call and the function is implemented
as vector function rather than simple function.
Below are some notes from the benchmark results:

Virtual function call per-row vs. type-switch per row:
The virtual function call performs 15% better due to having 20% fewer instructions.
The switch statement involves more branch instructions but both methods have a
similar branch misprediction rate of 2.8%. The switch statement doesn't show
higher branch misprediction because its fixed  pattern allows the BPU to handle it
effectively. However, if the schema becomes very complex and exceeds the BPU's
history track buffer (currently at 1000 levels), the misprediction rate may increase.

VectorFunction vs. Simple Function:
Since the function doesn't apply default null behavior, null judgment for each
field occurs within the call per row when using a simple function.
In contrast, a vector function first filters the null values per column, avoiding
null judgments in the top-level loop.
By evaluating the implementation across all null ratios for simple/vector functions,
we observed that the simpler function can take up to 3.5 times longer than the vector
function. Checking for null values row by row within the loop can lead to a high
branch misprediction ratio due to the randomness of null values, while vector function
can maintain a consistent branch misprediction ratio across all null ratios in vector
processes.

Pull Request resolved: facebookincubator#9414

Reviewed By: mbasmanova

Differential Revision: D56783038

Pulled By: pedroerp

fbshipit-source-id: 0238f0e88f7f395c41e976003a138cddba3bd093
  • Loading branch information
marin-ma authored and facebook-github-bot committed May 23, 2024
1 parent 96c51ae commit 066a72f
Show file tree
Hide file tree
Showing 7 changed files with 586 additions and 87 deletions.
9 changes: 0 additions & 9 deletions velox/docs/functions/spark/binary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,30 +10,21 @@ Binary Functions
Computes the hash of one or more input values using seed value of 42. For
multiple arguments, their types can be different.
Supported types are: BOOLEAN, TINYINT, SMALLINT, INTEGER, BIGINT, VARCHAR,
VARBINARY, REAL, DOUBLE, HUGEINT and TIMESTAMP.


.. spark:function:: hash_with_seed(seed, x, ...) -> integer
Computes the hash of one or more input values using specified seed. For
multiple arguments, their types can be different.
Supported types are: BOOLEAN, TINYINT, SMALLINT, INTEGER, BIGINT, VARCHAR,
VARBINARY, REAL, DOUBLE, HUGEINT and TIMESTAMP.

.. spark:function:: xxhash64(x, ...) -> bigint
Computes the xxhash64 of one or more input values using seed value of 42.
For multiple arguments, their types can be different.
Supported types are: BOOLEAN, TINYINT, SMALLINT, INTEGER, BIGINT, VARCHAR,
VARBINARY, REAL, DOUBLE, HUGEINT and TIMESTAMP.

.. spark:function:: xxhash64_with_seed(seed, x, ...) -> bigint
Computes the xxhash64 of one or more input values using specified seed. For
multiple arguments, their types can be different.
Supported types are: BOOLEAN, TINYINT, SMALLINT, INTEGER, BIGINT, VARCHAR,
VARBINARY, REAL, DOUBLE, HUGEINT and TIMESTAMP.

.. spark:function:: md5(x) -> varbinary
Expand Down
Loading

0 comments on commit 066a72f

Please sign in to comment.