The radix 2^51 trick (www.chosenplaintext.ca)
from tedu@inks.tedunangst.com to inks@inks.tedunangst.com on 31 May 00:51
https://inks.tedunangst.com/l/5244

The obvious solution would be to break up each 256-bit number into four 64-bit pieces (commonly referred to as “limbs”).

The first reason is that adc is just slower to execute than a normal add on most popular x86 CPUs. Since adc has a third input (the carry flag), it’s a more complex instruction than add. It’s also used less often than add, so there is less incentive for CPU designers to spend chip area on optimizing adc performance.

The key insight here is that we can use this technique to delay carry propagation until the end. We can’t avoid carry propagation altogether, but we can avoid it temporarily. If we save up the carries that occur during the intermediate additions, we can propagate them all in one go at the end.

#cpu #math #perf #programming

#cpu #inks #math #perf #programming

threaded - newest