[openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms

Discussion:

Gueron, Shay via RT

2013-10-22 17:30:20 UTC

Permalink

Hello all,

This patch is a contribution to OpenSSL.

It offers an efficient and constant-time implementation of EC multiplication for NIST P256 curve.
It accelerated ECDSA (sign and verify) as well as ECDH (compute and generate key), for the P256 curve.

The implementation is based on Montgomery multiplication, specifically optimized for the P-256 prime, using x86-64 assembly. It is intended to run on any x86-64 CPU (running Linux).

We measured the performance benefit of this patch on Intel(R) CPUs Codename Haswell and Sandy Bridge, using 'openssl speed'. Here are the numbers.

Single thread, single core:

Sandy ***@3.4GHz:
ECDSA sign:
OpenSSL - 8,759 sign/sec
OpenSSL with NISTP enabled - 8,822 sign/sec
This patch - 17,111 sign/sec
1.95/1.94X speedup
ECDSA verify:
OpenSSL - 2,303 verify/sec
OpenSSL with NISTP enabled - 3,831 verify/sec
This patch - 6,193 verify/sec
2.69/1.62X speedup
ECDH:
OpenSSL - 2,795 op/sec
OpenSSL with NISTP enabled - 5,189 op/sec
This patch - 8,087 op/sec
2.89X/1.56X speedup

***@3.4GHz:
ECDSA sign:
OpenSSL - 10,212 sign/sec
OpenSSL with NISTP enabled - 10,499 sign/sec
This patch - 22,747 sign/sec
2.22X/2.16X speedup
ECDSA verify:
OpenSSL - 2,647 verify/sec
OpenSSL with NISTP enabled - 4,384 verify/sec
This patch - 7,622 verify/sec
2.88X/1.74X speedup
ECDH:
OpenSSL - 3,193 op/sec
OpenSSL with NISTP enabled - 5,932 op/sec
This patch - 10,288 op/sec
3.22X/1.73X speedup

Two threads, single core:

Sandy ***@3.4GHz:
ECDSA sign:
OpenSSL - 10,076 sign/sec
OpenSSL with NISTP enabled - 8,911 sign/sec
This patch - 19,901 sign/sec
2/2.2X speedup
ECDSA verify:
OpenSSL - 2,480 verify/sec
OpenSSL with NISTP enabled - 3,840 verify/sec
This patch - 7,056 verify/sec
2.8/1.8X speedup
ECDH:
OpenSSL - 2,983 op/sec
OpenSSL with NISTP enabled - 5,290 op/sec
This patch - 9,657 op/sec
3.2X/1.8X speedup

***@3.4GHz:
ECDSA sign:
OpenSSL - 11,004 sign/sec
OpenSSL with NISTP enabled - 10,153 sign/sec
This patch - 25,094 sign/sec
2.3X/2.5X speedup
ECDSA verify:
OpenSSL - 2,598 verify/sec
OpenSSL with NISTP enabled - 3,947 verify/sec
This patch - 7,705 verify/sec
3X/2X speedup
ECDH:
OpenSSL - 3,166 op/sec
OpenSSL with NISTP enabled - 5,387 op/sec
This patch - 10,438 op/sec
3.3X/1.9X speedup

Reference:
[1] S. Gueron, V. Krasnov: "Fast Prime Field Elliptic Curve Cryptography with 256 Bit Primes"

Developers and authors:
***************************************************************************
Shay Gueron (1, 2), and Vlad Krasnov (1)
(1) Intel Corporation, Israel Development Center, Haifa, Israel
(2) University of Haifa, Israel
***************************************************************************
Copyright(c) 2013, Intel Corp.

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-***@openssl.org
Automated List Manager ***@openssl.org

Gueron, Shay via RT

2013-10-23 02:38:34 UTC

Permalink

The patch file (in compressed form) is attached.
Shay Gueron
-----Original Message-----
From: The default queue via RT [mailto:***@openssl.org]
Sent: Tuesday, October 22, 2013 20:30
To: Gueron, Shay
Subject: [openssl.org #3149] AutoReply: [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms

Greetings,

This message has been automatically generated in response to the creation of a trouble ticket regarding:
"[patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms", a summary of which appears below.

There is no need to reply to this message right now. Your ticket has been assigned an ID of [openssl.org #3149].

Please include the string:

[openssl.org #3149]

in the subject line of all future correspondence about this issue. To do so, you may reply to this message.

Thank you,
***@openssl.org

-------------------------------------------------------------------------
Hello all,

This patch is a contribution to OpenSSL.

It offers an efficient and constant-time implementation of EC multiplication for NIST P256 curve.
It accelerated ECDSA (sign and verify) as well as ECDH (compute and generate key), for the P256 curve.

The implementation is based on Montgomery multiplication, specifically optimized for the P-256 prime, using x86-64 assembly. It is intended to run on any x86-64 CPU (running Linux).

We measured the performance benefit of this patch on Intel(R) CPUs Codename Haswell and Sandy Bridge, using 'openssl speed'. Here are the numbers.

Single thread, single core:

Sandy ***@3.4GHz:
ECDSA sign:
OpenSSL - 8,759 sign/sec
OpenSSL with NISTP enabled - 8,822 sign/sec
This patch - 17,111 sign/sec
1.95/1.94X speedup
ECDSA verify:
OpenSSL - 2,303 verify/sec
OpenSSL with NISTP enabled - 3,831 verify/sec
This patch - 6,193 verify/sec
2.69/1.62X speedup
ECDH:
OpenSSL - 2,795 op/sec
OpenSSL with NISTP enabled - 5,189 op/sec
This patch - 8,087 op/sec
2.89X/1.56X speedup

***@3.4GHz:
ECDSA sign:
OpenSSL - 10,212 sign/sec
OpenSSL with NISTP enabled - 10,499 sign/sec
This patch - 22,747 sign/sec
2.22X/2.16X speedup
ECDSA verify:
OpenSSL - 2,647 verify/sec
OpenSSL with NISTP enabled - 4,384 verify/sec
This patch - 7,622 verify/sec
2.88X/1.74X speedup
ECDH:
OpenSSL - 3,193 op/sec
OpenSSL with NISTP enabled - 5,932 op/sec
This patch - 10,288 op/sec
3.22X/1.73X speedup

Two threads, single core:

Sandy ***@3.4GHz:
ECDSA sign:
OpenSSL - 10,076 sign/sec
OpenSSL with NISTP enabled - 8,911 sign/sec
This patch - 19,901 sign/sec
2/2.2X speedup
ECDSA verify:
OpenSSL - 2,480 verify/sec
OpenSSL with NISTP enabled - 3,840 verify/sec
This patch - 7,056 verify/sec
2.8/1.8X speedup
ECDH:
OpenSSL - 2,983 op/sec
OpenSSL with NISTP enabled - 5,290 op/sec
This patch - 9,657 op/sec
3.2X/1.8X speedup

***@3.4GHz:
ECDSA sign:
OpenSSL - 11,004 sign/sec
OpenSSL with NISTP enabled - 10,153 sign/sec
This patch - 25,094 sign/sec
2.3X/2.5X speedup
ECDSA verify:
OpenSSL - 2,598 verify/sec
OpenSSL with NISTP enabled - 3,947 verify/sec
This patch - 7,705 verify/sec
3X/2X speedup
ECDH:
OpenSSL - 3,166 op/sec
OpenSSL with NISTP enabled - 5,387 op/sec
This patch - 10,438 op/sec
3.3X/1.9X speedup

Reference:
[1] S. Gueron, V. Krasnov: "Fast Prime Field Elliptic Curve Cryptography with 256 Bit Primes"

Developers and authors:
***************************************************************************
Shay Gueron (1, 2), and Vlad Krasnov (1)
(1) Intel Corporation, Israel Development Center, Haifa, Israel
(2) University of Haifa, Israel
***************************************************************************
Copyright(c) 2013, Intel Corp.

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Bodo Moeller via RT

2013-10-24 16:17:44 UTC

Permalink

Thanks for the submission!

It seems that the BN_MONT_CTX-related code (used in crypto/ecdsa for
constant-time signing) is entirely independent of the remainder of the patch,
and should be considered separately.

Regarding your reference 'S.Gueron and V.Krasnov, "Fast Prime Field Elliptic
Curve Cryptography with 256 Bit Primes"' for you NIST P-256 code, is that
document available? (Web search only pointed me back to your patch.)

I've noticed that for secret-independent constant-time memory access, your code
relies on the scattering approach. However
http://cryptojedi.org/peter/data/chesrump-20130822.pdf points out that
apparently this doesn't actually work as intended. (Dan Bernstein's earlier
references: Sections 14, 15 in http://cr.yp.to/papers.html#cachetiming;
http://cr.yp.to/mac/athlon.html.)

Note that in your code, OPENSSL_ia32cap_P-dependent initialization of global
variables is not done in a thread-safe way. How about entirely avoiding this
global state, and passing pointers down to the implementations?

Your ec_p256_points_mul implementation is much worse than necessary when then
input comprises many points (more precisely, more than one point other than the
group generator), because you call ec_p256_windowed_mul multiple times
separately and add the results. I'd suggest instead to implement this modeled
on ec_GFp_nistp256_points_mul instead to benefit from interleaved left-to-right
point multiplication. (This avoids the additional point-double operations from
the separate point multiplication algorithm executions going through each
additional scalar.) Your approach for precomputation also is different (using
fewer point operations based on a larger precomputed table than the one we
currently use in ec_GFp_nistp256_points_mul) -- that table size still seems
appropriate, so keeping that probably makes sense.

______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-***@openssl.org
Automated List Manager ***@openssl.org

Gueron, Shay via RT

2013-10-29 08:44:11 UTC

Permalink

Thanks you Bodo, for the comments.

Here are some quick answers

Post by Bodo Moeller via RT
It seems that the BN_MONT_CTX-related code

The optimization made for the computation of the modular inverse in the ECDSA sigh, is using const-time mod-exp.
Indeed, this is independent of the rest of the patch, and it can be used independently (for other usages of the library).
We included this addition in the patch for the particular usage in ECDSA.

The paper: it will be posted soon.

Post by Bodo Moeller via RT
Note that in your code, OPENSSL_ia32cap_P-dependent initialization of global variables is not done in a thread-safe way.

This initialization is used for selecting a code path that would use ADCX/ADOX instructions when the processor supports them. The outcome depends only on the appropriate CPUID bits. Therefore, there is no “thread-safe” issue (because any thread would select the same path).
Of course, feel free to use the patch code and modify this initialization to match OpenSSL conventions.

Post by Bodo Moeller via RT
Your ec_p256_points_mul implementation is much worse than necessary when then input comprises many points

Indeed right. However, this patch is intended to optimize ECDSA sign/verify (and ECDH). This usage does not require adding more than a single point. If there are interesting cases - optimized multi-point addition can be added.

Regards,
Shay Gueron

-----Original Message-----
From: Bodo Moeller via RT [mailto:***@openssl.org]
Sent: Thursday, October 24, 2013 19:18
To: Gueron, Shay
Cc: openssl-***@openssl.org
Subject: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms

Thanks for the submission!

It seems that the BN_MONT_CTX-related code (used in crypto/ecdsa for constant-time signing) is entirely independent of the remainder of the patch, and should be considered separately.

Regarding your reference 'S.Gueron and V.Krasnov, "Fast Prime Field Elliptic Curve Cryptography with 256 Bit Primes"' for you NIST P-256 code, is that document available? (Web search only pointed me back to your patch.)

I've noticed that for secret-independent constant-time memory access, your code relies on the scattering approach. However http://cryptojedi.org/peter/data/chesrump-20130822.pdf points out that apparently this doesn't actually work as intended. (Dan Bernstein's earlier
references: Sections 14, 15 in http://cr.yp.to/papers.html#cachetiming;
http://cr.yp.to/mac/athlon.html.)

Note that in your code, OPENSSL_ia32cap_P-dependent initialization of global variables is not done in a thread-safe way. How about entirely avoiding this global state, and passing pointers down to the implementations?

Your ec_p256_points_mul implementation is much worse than necessary when then input comprises many points (more precisely, more than one point other than the group generator), because you call ec_p256_windowed_mul multiple times separately and add the results. I'd suggest instead to implement this modeled on ec_GFp_nistp256_points_mul instead to benefit from interleaved left-to-right point multiplication. (This avoids the additional point-double operations from the separate point multiplication algorithm executions going through each additional scalar.) Your approach for precomputation also is different (using fewer point operations based on a larger precomputed table than the one we currently use in ec_GFp_nistp256_points_mul) -- that table size still seems appropriate, so keeping that probably makes sense.

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-***@openssl.org
Automated List Manager ***@openssl.org

Gueron, Shay

2013-10-29 08:43:49 UTC

Permalink

Thanks you Bodo, for the comments.

Here are some quick answers

Post by Bodo Moeller via RT
It seems that the BN_MONT_CTX-related code

Post by Bodo Moeller via RT
Note that in your code, OPENSSL_ia32cap_P-dependent initialization of global variables is not done in a thread-safe way.

Post by Bodo Moeller via RT
Your ec_p256_points_mul implementation is much worse than necessary when then input comprises many points

Indeed right. However, this patch is intended to optimize ECDSA sign/verify (and ECDH). This usage does not require adding more than a single point. If there are interesting cases - optimized multi-point addition can be added.

Regards,
Shay Gueron

-----Original Message-----
From: Bodo Moeller via RT [mailto:***@openssl.org]
Sent: Thursday, October 24, 2013 19:18
To: Gueron, Shay
Cc: openssl-***@openssl.org
Subject: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms

Thanks for the submission!

It seems that the BN_MONT_CTX-related code (used in crypto/ecdsa for constant-time signing) is entirely independent of the remainder of the patch, and should be considered separately.

Regarding your reference 'S.Gueron and V.Krasnov, "Fast Prime Field Elliptic Curve Cryptography with 256 Bit Primes"' for you NIST P-256 code, is that document available? (Web search only pointed me back to your patch.)

I've noticed that for secret-independent constant-time memory access, your code relies on the scattering approach. However http://cryptojedi.org/peter/data/chesrump-20130822.pdf points out that apparently this doesn't actually work as intended. (Dan Bernstein's earlier
references: Sections 14, 15 in http://cr.yp.to/papers.html#cachetiming;
http://cr.yp.to/mac/athlon.html.)

Note that in your code, OPENSSL_ia32cap_P-dependent initialization of global variables is not done in a thread-safe way. How about entirely avoiding this global state, and passing pointers down to the implementations?

Your ec_p256_points_mul implementation is much worse than necessary when then input comprises many points (more precisely, more than one point other than the group generator), because you call ec_p256_windowed_mul multiple times separately and add the results. I'd suggest instead to implement this modeled on ec_GFp_nistp256_points_mul instead to benefit from interleaved left-to-right point multiplication. (This avoids the additional point-double operations from the separate point multiplication algorithm executions going through each additional scalar.) Your approach for precomputation also is different (using fewer point operations based on a larger precomputed table than the one we currently use in ec_GFp_nistp256_points_mul) -- that table size still seems appropriate, so keeping that probably makes sense.

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
��H��7��m��
)z{,�� ޖ�fz{Lj)b��)z{,�ׯ��h�

Bodo Moeller via RT

2013-10-29 09:46:37 UTC

Permalink

Post by Gueron, Shay via RT
This initialization is used for selecting a code path that would use

ADCX/ADOX

Post by Gueron, Shay via RT
instructions when the processor supports them. The outcome depends only on
the appropriate CPUID bits. Therefore, there is no “thread-safe” issue

(because

Post by Gueron, Shay via RT
any thread would select the same path).

I understand that that's the idea, and would have considered the code to be
safe a while ago (and might have written the initialization just like that),
but actually compiler transformations that are legal with the C memory model
could break this. See
http://static.usenix.org/event/hotpar11/tech/final_files/Boehm.pdf for
inspiration.

Post by Gueron, Shay via RT

Post by Bodo Moeller via RT
Your ec_p256_points_mul implementation is much worse than necessary when

then input comprises many points

Post by Gueron, Shay via RT
Indeed right. However, this patch is intended to optimize ECDSA sign/verify

(and ECDH). This usage does not require adding more than a single point.

Sure, but there's no compelling reason to make the other (rarer) use cases
slower. Also, adapting the addition/subtraction chain used in the existing
crypto/ec/ecp_nistp256.c (modified Booth encoding instead of unsigned windows)
could make the new implementation even faster.

______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-***@openssl.org
Automated List Manager ***@openssl.org

Gueron, Shay via RT

2013-11-08 08:42:06 UTC

Permalink

OK Bodo,

Here is an updated version of the patch.

Addressing a) "pointer to the function" (to select ADCX/ADOX) and b) multiple points addition

There is (only) ~1% performance deterioration in due to the pointer being passed now, instead of (originally) being static. You can choose which style is preferable.

Shay

-----Original Message-----
From: Bodo Moeller via RT [mailto:***@openssl.org]
Sent: Tuesday, October 29, 2013 11:47
To: Gueron, Shay
Cc: openssl-***@openssl.org
Subject: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms

Post by Gueron, Shay via RT
This initialization is used for selecting a code path that would use

ADCX/ADOX

Post by Gueron, Shay via RT
instructions when the processor supports them. The outcome depends
only on the appropriate CPUID bits. Therefore, there is no
âthread-safeâ issue

(because

Post by Gueron, Shay via RT
any thread would select the same path).

I understand that that's the idea, and would have considered the code to be safe a while ago (and might have written the initialization just like that), but actually compiler transformations that are legal with the C memory model could break this. See http://static.usenix.org/event/hotpar11/tech/final_files/Boehm.pdf for inspiration.

Post by Gueron, Shay via RT

Post by Bodo Moeller via RT
Your ec_p256_points_mul implementation is much worse than necessary when

then input comprises many points

Post by Gueron, Shay via RT
Indeed right. However, this patch is intended to optimize ECDSA sign/verify

(and ECDH). This usage does not require adding more than a single point.

Sure, but there's no compelling reason to make the other (rarer) use cases slower. Also, adapting the addition/subtraction chain used in the existing crypto/ec/ecp_nistp256.c (modified Booth encoding instead of unsigned windows) could make the new implementation even faster.

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Bodo Moeller via RT

2013-11-08 10:08:40 UTC

Permalink

Post by Gueron, Shay via RT
Here is an updated version of the patch.
Addressing a) "pointer to the function" (to select ADCX/ADOX) and b)
multiple points addition
There is (only) ~1% performance deterioration in due to the pointer being
passed now, instead of (originally) being static. You can choose which
style is preferable.

Nico Williams

2013-11-08 18:48:34 UTC

Permalink

Post by Bodo Moeller via RT
Alternatives would be (a) using a new lock for safe static initialization,

Maybe you could try my patches on my thread_safety branch of my github
clone of OpenSSL? (https://github.com/nicowilliams/openssl)

Nico
--
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-***@openssl.org
Automated List Manager ***@openssl.org

Andy Polyakov via RT

2013-11-08 20:43:00 UTC

Permalink

Post by Bodo Moeller via RT

While if (functiona==NULL || functionb==NULL) { asssign functiona,
functionb } can be unsafe, I'd argue that if (functiona==NULL) { assign
functiona } followed by if (functionb) { assign functionb } is.

Post by Bodo Moeller via RT
Do you have any comment from Intel on the concerns regarding the scattering
technique (http://cryptojedi.org/peter/data/chesrump-20130822.pdf)?

As discussed off-list in this case the discrepancy is because so called
memory disambiguation logic attempting to move loads ahead of stores,
and failing when the least significant bits are same. Naturally load
ought to be given "opportunity" to *try* to get ahead. I mean if there
is enough "work done" between store and potentially conflicting load,
then load won't "try" to get ahead of the store and variation . And
indeed, if you add instructions to the test program the variation
disappears. On Intel CPUs amount "worth" 5 cycles appears to be
sufficient. For record, update for x86_64-mont modules is being prepared...

______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-***@openssl.org
Automated List Manager ***@openssl.org

Nico Williams

2013-11-09 01:26:24 UTC

Permalink

Post by Andy Polyakov via RT

Post by Bodo Moeller via RT
Alternatives would be (a) using a new lock for safe static initialization,
or (b) more code duplication to avoid the need for an explicit pointer
(there could be two separate implementations for the higher-level
routines). However, given the 1% performance penalty, that's a minor issue
at this point.

That's not thread-safe. There's no memory barrier so the writes can
be reordered. The compiler itself could reorder those instructions.

Nico
--
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-***@openssl.org
Automated List Manager ***@openssl.org

Andy Polyakov

2013-11-09 13:20:55 UTC

Permalink

Not sent to RT, to <openssl-dev> only.

Post by Nico Williams

Post by Andy Polyakov via RT

That's not thread-safe. There's no memory barrier so the writes can
be reordered. The compiler itself could reorder those instructions.

I want to remind that in this case assignments are constant in sense
that all threads will assign same value. So that disjoint if's work out
even if threads are reading/writing out of order. As for reorder by
compiler. No, it won't do that. If it actually would have this option,
then "if (ptr) *ptr" wouldn't be safe. If compiler [or hardware] does
that kind of speculation it is also obliged to verify the assumption and
act accordingly. Or do you mean that it could reorder it as
if-if-assign-assign? This is not a problem. The only *formally* valid
concerns are pointed by Bodo in another message, and they are that
function[ab] are misaligned (and therefore access is not atomic), and
that they are used for something else earlier. As for latter. If
compiler was to use explicitly initialized shared variable for something
else, it would have to do in a way invisible to program code. It is only
possible after *full* inter-procedural analysis for given *binary*. But
then analyzer would have to be aware of transition from single- to
multi-threaded operation and stop using variable for something else
before multi-threaded part [and even initialize it accordingly]. And as
for former. In the context one should recognize that we are not talking
about *universally* portable code. The code in question *is*
platform-specific and we are free to take into consideration such
implementation specifics as object alignment [in supported compilers].
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-***@openssl.org
Automated List Manager ***@openssl.org

Bodo Moeller via RT

2013-11-09 01:26:53 UTC

Permalink

Post by Andy Polyakov via RT
While if (functiona==NULL || functionb==NULL) { asssign functiona,
functionb } can be unsafe, I'd argue that if (functiona==NULL) { assign
functiona } followed by if (functionb) { assign functionb } is.

We're implicitly assuming here that (thanks to alignment, etc.) each
pointer can be accessed "atomically", which so far seems reasonable given
the particular platform this code is for. However, the C11 memory model
also allows the compiler to assume there's no write race, and it thus
could, for example, use the same memory location to hold other temporary
values, which could then be misinterpreted as the function pointer by
concurrent threads. See
http://static.usenix.org/event/hotpar11/tech/final_files/Boehm.pdf for
ideas how this might break -- maybe not right now, but possibly with future
compilers, possibly after this code has evolved a bit.

(I'm not promising that it will actually break, but thread-safety analysis
tools are likely to complain loudly. And at some point the code might
actually fail spectacularly.)

______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-***@openssl.org
Automated List Manager ***@openssl.org

Gueron, Shay via RT

2014-01-24 17:28:03 UTC

Permalink

---
Here is the ECC P-256 patch (rev 7).

Including some fixes and edits : register r14 in p256_div_by_2 cleared; MONT_CTX duplicated for thread-safe execution; some reformatting of the C code to match OpenSSL style (thanks to Adam Langley of Google, for his comments!).

Note: there are some performance changes, due to OpenSSLâs recent decision to disable the ârdrandâ engine by default (http://git.openssl.org/gitweb/?p=openssl.git;a=commit;h=8a1956f3eac8b164f8c741ff1a259008bab3bac1).
This affects the current ECC OpenSSL implementation as well, in a similar way.

Shay

Developers and authors:
***************************************************************************
Shay Gueron (1, 2), and Vlad Krasnov (1)
(1) Intel Corporation, Israel Development Center, Haifa, Israel
(2) University of Haifa, Israel
***************************************************************************
Copyright(c) 2014, Intel Corp.

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Andy Polyakov via RT

2013-11-09 11:08:32 UTC

Permalink

Post by Andy Polyakov via RT

Post by Bodo Moeller via RT
Do you have any comment from Intel on the concerns regarding the scattering
technique (http://cryptojedi.org/peter/data/chesrump-20130822.pdf)?

This might be misinterpreted. It's not necessarily that 5 cycles is
sufficient for load not to "try" to get ahead, but it's sufficient to
amortize eventual variations.

______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-***@openssl.org
Automated List Manager ***@openssl.org

Gueron, Shay via RT

2013-11-13 05:16:23 UTC

Permalink

Post by Gueron, Shay via RT

Do you have any comment from Intel on the concerns regarding the scattering technique
(http://cryptojedi.org/peter/data/chesrump-20130822.pdf)?

First, a comment: it is difficult to actually understand the precise claim by the authors, from these 6 slides. The code snippet does not include sufficient information to determine what the experiment really is. With this disclaimer, my understanding is this: the authors of the presentation claim that accessing different locations on a cache line shows different latencies – depending on the location (within the cache line). The graph shows a *significantly* larger latency for doubleword #0. As a result, they insinuate that the scattering-gathering method for accessing tables does not protect against side channel analysis.

This, however, seems to me to be incorrect.

Perhaps what the experiment shows is due to “cache line splits”, where the data that is being accessed does not reside completely in a single cahce line. For example, when aligned on 56 bytes, and reading 64-bit quadwords, the first one (say Q0) is in one cache line and the rest are in another. So, when flushing (only) the cache line that holds Q0, subsequent reading of Q0 would be “slow” but reading the rest would be fast.

It is not possible to see how the array was aligned in the discussed presentation, because the code snippet does not include this part.

Anyway, this phenomenon is not a vulnerability - it is only about a proper implementation of the gather/scatter method: the elements of the table simply need to be naturally aligned in a way that every single element belongs to the same cache line. The programmer needs make sure that the elements of the table are naturally aligned (e.g., aligned (64)). As long as there are no cache line splits, there should be no timing difference between accesses to a cache line.

Shay

-----Original Message-----
From: Bodo Moeller via RT [mailto:***@openssl.org]
Sent: Friday, November 08, 2013 12:09
To: Gueron, Shay
Cc: openssl-***@openssl.org
Subject: Re: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms

Post by Gueron, Shay via RT
Here is an updated version of the patch.
Addressing a) "pointer to the function" (to select ADCX/ADOX) and b)
multiple points addition
There is (only) ~1% performance deterioration in due to the pointer
being passed now, instead of (originally) being static. You can choose
which style is preferable.

Thanks!

Alternatives would be (a) using a new lock for safe static initialization, or (b) more code duplication to avoid the need for an explicit pointer (there could be two separate implementations for the higher-level routines). However, given the 1% performance penalty, that's a minor issue at this point.

Do you have any comment from Intel on the concerns regarding the scattering technique (http://cryptojedi.org/peter/data/chesrump-20130822.pdf)?

Bodo

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-***@openssl.org
Automated List Manager ***@openssl.org

Gueron, Shay via RT

2013-11-17 11:51:02 UTC

Permalink

Sending with the compressed patch (OpenSSL server rejects attachments > 500KB).

-----Original Message-----
From: Gueron, Shay
Sent: Sunday, November 17, 2013 12:32
To: ***@openssl.org
Cc: Gueron, Shay
Subject: RE: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms

This version is fixing a mistake in the ADCX/ADOX branch of these computations. Shay
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Gueron, Shay via RT

2013-11-24 19:15:21 UTC

Permalink

In ECDSA (and ECDH) the secret is used only once, so it is not clear how information collected (statistical?) as in "chesrump-20130822.pdf" would be useful in the discussed context. However, to avoid such discussions, here is another version of the patch: here, the elements from the lookup table are selected by accessing all elements of the table sequentially, and masking only the required value, but with a SIMD implementation to minimize the performance degradation. Indeed, when AVX2 is available, the performance hit is negligible, and with âSSE selectâ, it is ~5% (compared to the gather/scatter method).
So, for AVX2, the choice is clear. For "SSE select" - the choice between " gather/scatter" and "select" methods is a matter an option (left for the OpenSSL team).
The performance (with both "pointer to the function" & "SIMD select") is:

Ops/sec
3.4GHz sign Verify comp key
Sandy Bridge 16,153 5,949 7,901
Haswell 22,450 7,653 10,246

-- Shay

-----Original Message-----
From: Bodo Moeller via RT [mailto:***@openssl.org]
Sent: Friday, November 08, 2013 12:09
To: Gueron, Shay
Cc: openssl-***@openssl.org
Subject: Re: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms

Post by Gueron, Shay via RT
Here is an updated version of the patch.
Addressing a) "pointer to the function" (to select ADCX/ADOX) and b)
multiple points addition
There is (only) ~1% performance deterioration in due to the pointer
being passed now, instead of (originally) being static. You can choose
which style is preferable.

Gueron, Shay via RT

2013-11-25 18:29:22 UTC

Permalink

Here is an improved version of the patch (and also smaller tables ~150KB instead of 172KB).

The numbers:

Ops/sec
3.4GHz ECDSA sign ECDSA verify comp key
Sandy Bridge 17,144 6,330 8,394
Haswell 24,194 8,120 10,788

Shay

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Adam Langley via RT

2014-03-11 16:50:57 UTC

Permalink

I believe that the code submitted here is solid and it's working in
all our testing.

I've attached the version of the patch that we have used for testing.
There are a couple of tweaks to the .C code, but the biggest change is
that the Broadwell code is, sadly, effectively disabled behind an
INTEL_BROADWELL define because our assemblers don't recognise the
instructions. Additionally, the .set and .macros in the asm files have
been programmatically expanded because Clang doesn't support those
directives.

Gueron, Shay

2014-06-26 10:17:23 UTC

Permalink

Here is the updated patch: it is the same as Adamâs April 10 version that was used for testing, but this version has the Apache 2.0 license in it.
Shay

-----Original Message-----
From: Adam Langley via RT [mailto:***@openssl.org]
Sent: Tuesday, March 11, 2014 18:51
To: Gueron, Shay
Cc: openssl-***@openssl.org
Subject: [openssl.org #3149]

I believe that the code submitted here is solid and it's working in all our testing.

I've attached the version of the patch that we have used for testing.
There are a couple of tweaks to the .C code, but the biggest change is that the Broadwell code is, sadly, effectively disabled behind an INTEL_BROADWELL define because our assemblers don't recognise the instructions. Additionally, the .set and .macros in the asm files have been programmatically expanded because Clang doesn't support those directives.

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Adam Langley via RT

2014-04-10 16:35:28 UTC

Permalink

My last was a little premature. In order to prevent misbehavior in the
case where some field elements are less than 4 limbs long (including a
crash when dealing with the point at infinity), the following change
should also be applied.

Cheers

AGL

Bodo Moeller via RT

2014-04-11 06:57:00 UTC

Permalink

For the record, I have reviewed Adam's versions of the code before these were
posted here, and Adam has resolved the problems that I pointed out. As of the
latest patch, I think the code is suitable for inclusion in OpenSSL. The final
missing part is support that makes it easy to build with or without this NIST
P-256 implementation, depending on whether the target platform supports it,
similar to the "enable-ec_nistp_64_gcc_128" config option for the 64-bit
optimized implementations using type __uint128_t. (The current patch
unconditionally links in the new files, but we may not even be able to process
the new assembler files.)

Also, it would be nice to have Shay review our changes to his contribution (the
March 11 "patch.bz2" as further changed by the April 10 "patch") to make sure
that while fixing the problems we found, we didn't do unwanted changes.
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-***@openssl.org
Automated List Manager ***@openssl.org

Gueron, Shay via RT

2014-06-27 05:45:57 UTC

Permalink

Here is the updated patch: it is the same as Adam's April 10 version that was used for testing, but this version has the Apache 2.0 license in it.

Shay

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Andy Polyakov via RT

2014-10-01 21:51:06 UTC

Permalink

For public reference. The final version committed to repository with
following changes:

- nameing re-biased to ecp_nistz256, both filenames and functions;
- assembly optimized for processors other than Intel Core family;
- assembly modules adapted even for non-ELF platforms;
- some of higher level subroutines are moved to assembly to improve
performance even further;
- code adapted even for 32-bit platforms;

As for latter. Effort is ongoing to initially support ARM and x86 and
preliminary results indicate ~2x performance improvement. Point worth
mentioning in the context is that I'm considering switching back to
scatter-gather method. Basically for 32-bit sake, because current method
is a bit too slow on non-SIMD platforms. But then it would be simpler to
maintain same method in all cases including x86_64 one. Objection
against scatter-gather method was based on assertion in
http://cryptojedi.org/peter/data/chesrump-20130822.pdf that timing
within cache line varies. While phenomena is real, one has to recognize
that its effect on gather procedure is either 0 or at most few cycles.
This is because if conflict can occur it occurs only once per gather
procedure. And since it's only few cycles it's not given that you can
actually measure it, because tick counter resolution is actually limited
on contemporary processors. Not to mention that conflict can be avoided
by aligning stack frame in specific manner relative to scatter table
(method used in x86_64-mont5 module by the way).

______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-***@openssl.org
Automated List Manager ***@openssl.org