Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX-512 support for RSA Signing #1273

Merged
merged 36 commits into from
Sep 17, 2024
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
b9088fc
Use IFMA_AVX512 when possible for modular exponentiation.
pittma Aug 7, 2023
e6269ff
Add test coverage for consttime_x2 mod exp function
pittma Oct 23, 2023
6d2ece9
Add fuzzer coverage for BN_mod_exp_mont_consttime_x2
pittma Oct 23, 2023
e0ad9da
prevent empty translation units for compilers that don't like them
pittma Oct 30, 2023
024a9ec
properly handle AVX-512 build conditions
pittma Oct 31, 2023
cd2a3d1
fips builds require subsections
pittma Oct 31, 2023
d4d89fc
fix disallowed interaction with `OPENSSL_ia32_cap_P` in fips mode
pittma Nov 2, 2023
a0f3737
reset sections when they change for variable declaration
pittma Nov 2, 2023
8e55af5
include avx512ifma flag
pittma Nov 3, 2023
7d1ea20
handle AVX-512 mask register usage in fips delocation process
pittma Nov 15, 2023
407df8d
address review comments
pittma Jan 30, 2024
e67bbda
regen generated source
pittma Feb 1, 2024
b33709e
regenerate delocate parser
pittma Feb 1, 2024
0e7c607
AVX-512 RSA Signing: address first PR review
pittma Apr 10, 2024
b2d1327
Merge remote-tracking branch 'origin/main'
pittma Apr 10, 2024
14fefe0
Still export the parallel mod_exp implementation
pittma Apr 12, 2024
5e1c7ee
second set of review comments and documentation
pittma Apr 24, 2024
73d389d
fix generated source conflict
pittma Apr 24, 2024
087bf5c
Merge branch 'main' of github.com:aws/aws-lc into pmain
pittma Jul 25, 2024
c439bf0
address review 3 comments
pittma Jul 25, 2024
abe1124
Merge branch 'main' of github.com:aws/aws-lc
pittma Aug 7, 2024
37b4a4a
Merge branch 'main' of github.com:aws/aws-lc into pmain
pittma Sep 5, 2024
e06d8d0
further review comments
pittma Sep 4, 2024
bf9fc29
add ABI tests for new RSA AVX-512 assmebly routines
pittma Sep 5, 2024
e626c2c
add dispatch tests for AVX-512 enabled RSA signing
pittma Sep 5, 2024
92b9e3f
fix dispatch test
pittma Sep 6, 2024
1055b42
Merge remote-tracking branch 'origin/main'
pittma Sep 6, 2024
58af762
Merge branch 'main' of github.com:aws/aws-lc
pittma Sep 9, 2024
56d8fd6
fix conditional build logic in dispatch test
pittma Sep 9, 2024
f925e7c
generated asm should properly exclude when using old assembler
pittma Sep 9, 2024
2473469
Merge branch 'main' of github.com:aws/aws-lc
pittma Sep 10, 2024
ef26ced
in ninja-based build, old assembler logic is already handled
pittma Sep 10, 2024
73b7b8f
Merge branch 'main' of github.com:aws/aws-lc
pittma Sep 10, 2024
506dced
Increasing the capacity of ubuntu2004_android_fips_static_release.
nebeid Sep 11, 2024
0dd53a1
Merge branch 'main' into main
nebeid Sep 11, 2024
f3715bb
Merge branch 'main' of github.com:aws/aws-lc
pittma Sep 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions .github/workflows/mingw.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: MinGW
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? Is this not covered by our existing intel SDE tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this file got into this PR. This appears to be a duplicate of CI tests we already have: https://github.com/aws/aws-lc/blob/main/.github/workflows/windows-alt.yml#L11

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must have come in with an intermediate merge somewhere along the way. I will remove it.

on:
pull_request:
branches: [ '*' ]
push:
branches: [ '*' ]

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number }}
cancel-in-progress: true
jobs:
mingw:
if: github.repository == 'aws/aws-lc'
runs-on: windows-latest
steps:
- name: Install NASM
uses: ilammy/[email protected]
- name: Checkout
uses: actions/checkout@v4
- name: Setup MinGW
uses: egor-tensin/[email protected]
id: setup_mingw
with:
static: 0
- name: Setup CMake
uses: threeal/[email protected]
with:
generator: Ninja
c-compiler: ${{ steps.setup_mingw.outputs.gcc }}
cxx-compiler: ${{ steps.setup_mingw.outputs.gxx }}
options: |
CMAKE_SYSTEM_NAME=Windows \
CMAKE_SYSTEM_PROCESSOR=x86_64 \
CMAKE_BUILD_TOOL=C:/ProgramData/chocolatey/lib/mingw/tools/install/mingw64/bin/ninja.exe \
CMAKE_FIND_ROOT_PATH=C:/ProgramData/chocolatey/lib/mingw/tools/install/mingw64 \
CMAKE_FIND_ROOT_PATH_MODE_PROGRAM=NEVER \
CMAKE_FIND_ROOT_PATH_MODE_LIBRARY=ONLY \
CMAKE_FIND_ROOT_PATH_MODE_INCLUDE=ONLY \
- name: Build Project
run: cmake --build ./build --target all
- name: Run tests
run: cmake --build ./build --target run_tests
2 changes: 1 addition & 1 deletion crypto/fipsmodule/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -393,7 +393,7 @@ if(FIPS_DELOCATE)
# The flags are not required for any other compiler we are running in the CI.
if (CLANG AND (CMAKE_ASM_COMPILER_ID MATCHES "Clang" OR CMAKE_ASM_COMPILER MATCHES "clang") AND
(CMAKE_C_COMPILER_VERSION VERSION_LESS "7.0.0") AND (ARCH STREQUAL "x86_64"))
set_source_files_properties(${CMAKE_CURRENT_BINARY_DIR}/bcm-delocated.S PROPERTIES COMPILE_FLAGS "-mavx512f -mavx512bw -mavx512dq -mavx512vl")
set_source_files_properties(${CMAKE_CURRENT_BINARY_DIR}/bcm-delocated.S PROPERTIES COMPILE_FLAGS "-mavx512f -mavx512bw -mavx512dq -mavx512vl -mavx512ifma")
endif()

add_library(
Expand Down
148 changes: 45 additions & 103 deletions crypto/fipsmodule/bn/asm/rsaz-2k-avx512.pl
Original file line number Diff line number Diff line change
Expand Up @@ -75,54 +75,20 @@
*STDOUT=*OUT;

if ($avx512ifma>0) {{{
@_6_args_universal_ABI = ("%rdi","%rsi","%rdx","%rcx","%r8","%r9");

$code.=<<___;
.text
.extern OPENSSL_ia32cap_P
.globl ossl_rsaz_avx512ifma_eligible
.type ossl_rsaz_avx512ifma_eligible,\@abi-omnipotent
.align 32
ossl_rsaz_avx512ifma_eligible:
leaq OPENSSL_ia32cap_P(%rip),%r11
mov 8(%r11),%r11d
xor %eax,%eax
and \$`1<<31|1<<21|1<<17|1<<16`, %r11d # avx512vl + avx512ifma + avx512dq + avx512f
cmp \$`1<<31|1<<21|1<<17|1<<16`, %r11d
cmove %r11d,%eax
ret
.size ossl_rsaz_avx512ifma_eligible, .-ossl_rsaz_avx512ifma_eligible
___
@_6_args_universal_ABI = $win64 ?
("%rcx","%rdx","%r8","%r9","%r10","%r11") :
("%rdi","%rsi","%rdx","%rcx","%r8","%r9");

###############################################################################
# Almost Montgomery Multiplication (AMM) for 20-digit number in radix 2^52.
#
# AMM is defined as presented in the paper [1].
#
# The input and output are presented in 2^52 radix domain, i.e.
# |res|, |a|, |b|, |m| are arrays of 20 64-bit qwords with 12 high bits zeroed.
# |k0| is a Montgomery coefficient, which is here k0 = -1/m mod 2^64
#
# NB: the AMM implementation does not perform "conditional" subtraction step
# specified in the original algorithm as according to the Lemma 1 from the paper
# [2], the result will be always < 2*m and can be used as a direct input to
# the next AMM iteration. This post-condition is true, provided the correct
# parameter |s| (notion of the Lemma 1 from [2]) is chosen, i.e. s >= n + 2 * k,
# which matches our case: 1040 > 1024 + 2 * 1.
#
# [1] Gueron, S. Efficient software implementations of modular exponentiation.
# DOI: 10.1007/s13389-012-0031-5
# [2] Gueron, S. Enhanced Montgomery Multiplication.
# DOI: 10.1007/3-540-36400-5_5
#
# void ossl_rsaz_amm52x20_x1_ifma256(BN_ULONG *res,
# void rsaz_amm52x20_x1_ifma256(BN_ULONG *res,
# const BN_ULONG *a,
# const BN_ULONG *b,
# const BN_ULONG *m,
# BN_ULONG k0);
###############################################################################
{
# input parameters ("%rdi","%rsi","%rdx","%rcx","%r8")
# input parameters
my ($res,$a,$b,$m,$k0) = @_6_args_universal_ABI;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me if this takes win64 into account.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, I don't think it is either! I've updated this with a ternary check.


my $mask52 = "%rax";
Expand Down Expand Up @@ -325,10 +291,10 @@ sub amm52x20_x1_norm {
$code.=<<___;
.text

.globl ossl_rsaz_amm52x20_x1_ifma256
.type ossl_rsaz_amm52x20_x1_ifma256,\@function,5
.globl rsaz_amm52x20_x1_ifma256
.type rsaz_amm52x20_x1_ifma256,\@function,5
.align 32
ossl_rsaz_amm52x20_x1_ifma256:
rsaz_amm52x20_x1_ifma256:
.cfi_startproc
endbranch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm my understanding, this is effectively the same as

#define _CET_ENDBR
right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When CET is available, yes. Here is its definition in GCC, where endbranch is the generalized version of endbr[64|32]. If CET is not available, endbranch is a nop.

push %rbx
Expand All @@ -343,7 +309,7 @@ sub amm52x20_x1_norm {
.cfi_push %r14
push %r15
.cfi_push %r15
.Lossl_rsaz_amm52x20_x1_ifma256_body:
.Lrsaz_amm52x20_x1_ifma256_body:

# Zeroing accumulators
vpxord $zero, $zero, $zero
Expand Down Expand Up @@ -396,10 +362,10 @@ sub amm52x20_x1_norm {
.cfi_restore %rbx
lea 48(%rsp),%rsp
.cfi_adjust_cfa_offset -48
.Lossl_rsaz_amm52x20_x1_ifma256_epilogue:
.Lrsaz_amm52x20_x1_ifma256_epilogue:
ret
.cfi_endproc
.size ossl_rsaz_amm52x20_x1_ifma256, .-ossl_rsaz_amm52x20_x1_ifma256
.size rsaz_amm52x20_x1_ifma256, .-rsaz_amm52x20_x1_ifma256
___

$code.=<<___;
Expand All @@ -414,27 +380,20 @@ sub amm52x20_x1_norm {
___

###############################################################################
# Dual Almost Montgomery Multiplication for 20-digit number in radix 2^52
#
# See description of ossl_rsaz_amm52x20_x1_ifma256() above for details about Almost
# Montgomery Multiplication algorithm and function input parameters description.
#
# This function does two AMMs for two independent inputs, hence dual.
#
# void ossl_rsaz_amm52x20_x2_ifma256(BN_ULONG out[2][20],
# const BN_ULONG a[2][20],
# const BN_ULONG b[2][20],
# const BN_ULONG m[2][20],
# const BN_ULONG k0[2]);
# void rsaz_amm52x20_x2_ifma256(BN_ULONG out[2][20],
# const BN_ULONG a[2][20],
# const BN_ULONG b[2][20],
# const BN_ULONG m[2][20],
# const BN_ULONG k0[2]);
###############################################################################

$code.=<<___;
.text

.globl ossl_rsaz_amm52x20_x2_ifma256
.type ossl_rsaz_amm52x20_x2_ifma256,\@function,5
.globl rsaz_amm52x20_x2_ifma256
.type rsaz_amm52x20_x2_ifma256,\@function,5
.align 32
ossl_rsaz_amm52x20_x2_ifma256:
rsaz_amm52x20_x2_ifma256:
.cfi_startproc
endbranch
push %rbx
Expand All @@ -449,7 +408,7 @@ sub amm52x20_x1_norm {
.cfi_push %r14
push %r15
.cfi_push %r15
.Lossl_rsaz_amm52x20_x2_ifma256_body:
.Lrsaz_amm52x20_x2_ifma256_body:

# Zeroing accumulators
vpxord $zero, $zero, $zero
Expand Down Expand Up @@ -514,27 +473,18 @@ sub amm52x20_x1_norm {
.cfi_restore %rbx
lea 48(%rsp),%rsp
.cfi_adjust_cfa_offset -48
.Lossl_rsaz_amm52x20_x2_ifma256_epilogue:
.Lrsaz_amm52x20_x2_ifma256_epilogue:
ret
.cfi_endproc
.size ossl_rsaz_amm52x20_x2_ifma256, .-ossl_rsaz_amm52x20_x2_ifma256
.size rsaz_amm52x20_x2_ifma256, .-rsaz_amm52x20_x2_ifma256
___
}

###############################################################################
# Constant time extraction from the precomputed table of powers base^i, where
# i = 0..2^EXP_WIN_SIZE-1
#
# The input |red_table| contains precomputations for two independent base values.
# |red_table_idx1| and |red_table_idx2| are corresponding power indexes.
#
# Extracted value (output) is 2 20 digit numbers in 2^52 radix.
#
# void ossl_extract_multiplier_2x20_win5(BN_ULONG *red_Y,
# void extract_multiplier_2x20_win5(BN_ULONG *red_Y,
# const BN_ULONG red_table[1 << EXP_WIN_SIZE][2][20],
# int red_table_idx1, int red_table_idx2);
#
# EXP_WIN_SIZE = 5
###############################################################################
{
# input parameters
Expand All @@ -553,9 +503,9 @@ sub amm52x20_x1_norm {
.text

.align 32
.globl ossl_extract_multiplier_2x20_win5
.type ossl_extract_multiplier_2x20_win5,\@abi-omnipotent
ossl_extract_multiplier_2x20_win5:
.globl extract_multiplier_2x20_win5
.type extract_multiplier_2x20_win5,\@abi-omnipotent
extract_multiplier_2x20_win5:
.cfi_startproc
endbranch
vmovdqa64 .Lones(%rip), $ones # broadcast ones
Expand Down Expand Up @@ -597,7 +547,7 @@ sub amm52x20_x1_norm {
$code.=<<___;
ret
.cfi_endproc
.size ossl_extract_multiplier_2x20_win5, .-ossl_extract_multiplier_2x20_win5
.size extract_multiplier_2x20_win5, .-extract_multiplier_2x20_win5
___
$code.=<<___;
.section .rodata
Expand Down Expand Up @@ -707,47 +657,39 @@ sub amm52x20_x1_norm {

.section .pdata
.align 4
.rva .LSEH_begin_ossl_rsaz_amm52x20_x1_ifma256
.rva .LSEH_end_ossl_rsaz_amm52x20_x1_ifma256
.rva .LSEH_info_ossl_rsaz_amm52x20_x1_ifma256
.rva .LSEH_begin_rsaz_amm52x20_x1_ifma256
.rva .LSEH_end_rsaz_amm52x20_x1_ifma256
.rva .LSEH_info_rsaz_amm52x20_x1_ifma256

.rva .LSEH_begin_ossl_rsaz_amm52x20_x2_ifma256
.rva .LSEH_end_ossl_rsaz_amm52x20_x2_ifma256
.rva .LSEH_info_ossl_rsaz_amm52x20_x2_ifma256
.rva .LSEH_begin_rsaz_amm52x20_x2_ifma256
.rva .LSEH_end_rsaz_amm52x20_x2_ifma256
.rva .LSEH_info_rsaz_amm52x20_x2_ifma256

.section .xdata
.align 8
.LSEH_info_ossl_rsaz_amm52x20_x1_ifma256:
.LSEH_info_rsaz_amm52x20_x1_ifma256:
.byte 9,0,0,0
.rva rsaz_def_handler
.rva .Lossl_rsaz_amm52x20_x1_ifma256_body,.Lossl_rsaz_amm52x20_x1_ifma256_epilogue
.LSEH_info_ossl_rsaz_amm52x20_x2_ifma256:
.rva .Lrsaz_amm52x20_x1_ifma256_body,.Lrsaz_amm52x20_x1_ifma256_epilogue
.LSEH_info_rsaz_amm52x20_x2_ifma256:
.byte 9,0,0,0
.rva rsaz_def_handler
.rva .Lossl_rsaz_amm52x20_x2_ifma256_body,.Lossl_rsaz_amm52x20_x2_ifma256_epilogue
.rva .Lrsaz_amm52x20_x2_ifma256_body,.Lrsaz_amm52x20_x2_ifma256_epilogue
___
}
}}} else {{{ # fallback for old assembler
$code.=<<___;
.text

.globl ossl_rsaz_avx512ifma_eligible
.type ossl_rsaz_avx512ifma_eligible,\@abi-omnipotent
ossl_rsaz_avx512ifma_eligible:
xor %eax,%eax
ret
.size ossl_rsaz_avx512ifma_eligible, .-ossl_rsaz_avx512ifma_eligible

.globl ossl_rsaz_amm52x20_x1_ifma256
.globl ossl_rsaz_amm52x20_x2_ifma256
.globl ossl_extract_multiplier_2x20_win5
.type ossl_rsaz_amm52x20_x1_ifma256,\@abi-omnipotent
ossl_rsaz_amm52x20_x1_ifma256:
ossl_rsaz_amm52x20_x2_ifma256:
ossl_extract_multiplier_2x20_win5:
.globl rsaz_amm52x20_x1_ifma256
.globl rsaz_amm52x20_x2_ifma256
.globl extract_multiplier_2x20_win5
.type rsaz_amm52x20_x1_ifma256,\@abi-omnipotent
rsaz_amm52x20_x1_ifma256:
rsaz_amm52x20_x2_ifma256:
extract_multiplier_2x20_win5:
.byte 0x0f,0x0b # ud2
ret
.size ossl_rsaz_amm52x20_x1_ifma256, .-ossl_rsaz_amm52x20_x1_ifma256
.size rsaz_amm52x20_x1_ifma256, .-rsaz_amm52x20_x1_ifma256
___
}}}

Expand Down
Loading
Loading