azorius.net

"Clean" Code, Horrible Performance in Rust (chrs.dev)
from chrs@programming.dev to rust@programming.dev on 08 Apr 2024 03:39
https://programming.dev/post/12503418

#rust

threaded - newest

Turun@feddit.de on 08 Apr 2024 06:17 next collapse

It would be interesting to see if an iterator instead of a manual for loop would increase the performance of the base case.

My guess is not, because the compiler should know they are equivalent, but would be interesting to check anyway.

Deebster@programming.dev on 08 Apr 2024 07:00 next collapse

I wonder if the compiler checks to see if the calls are pure and are therefore safe to run in parallel. It seems like the kind of thing the Rust compiler should be able to do.

TehPers@beehaw.org on 08 Apr 2024 10:13 collapse

If by parallel you mean across multiple threads in some map-reduce algorithm, the compiler will not do that automatically since that would be both extremely surprising behavior and in most cases, would make performance worse (it’d be interesting to see just how many shapes you’d need to iterate over before you start seeing performance benefits from map-reduce). If you’re referring to vectorization, then the Rust compiler does automatically do that in some cases, and I imagine it depends on how the area is calculated and whether the implementation can be inlined.

onlinepersona@programming.dev on 08 Apr 2024 08:03 collapse

Do you mean this for loop?

for shape in &shapes {
  accum += shape.area();
}

That does use an iterator

for-in-loops, or to be more precise, iterator loops, are a simple syntactic sugar over a common practice within Rust, which is to loop over anything that implements IntoIterator until the iterator returned by .into_iter() returns None (or the loop body uses break).

Anti Commercial AI thingy

CC BY-NC-SA 4.0

arendjr@programming.dev on 08 Apr 2024 08:44 collapse

I think they meant using for accumulating, like this:

shapes.iter().map(Shape::area).sum()

onlinepersona@programming.dev on 08 Apr 2024 09:01 next collapse

Oh, I see. That would be interesting to benchmark too 👍

Anti Commercial AI thingy

CC BY-NC-SA 4.0

sugar_in_your_tea@sh.itjust.works on 08 Apr 2024 18:58 collapse

Anti Commercial AI thingy

Off-topic, but does that actually work? I would assume OpenAI would just ignore it and you’d have to prove that they did so.

onlinepersona@programming.dev on 08 Apr 2024 21:09 collapse

Dunno if it works. AI has been tricked into revealing it’s training data, so it’s possible that it happens and they are sued for using copyrighted material.

This is my drop in the ocean.

Anti Commercial AI thingy

CC BY-NC-SA 4.0

sugar_in_your_tea@sh.itjust.works on 08 Apr 2024 21:58 collapse

Maybe I’ll join you. :)

onlinepersona@programming.dev on 08 Apr 2024 22:30 collapse

Welcome 🙂 A drop more.

Btw, if you’re using linux and X11, you can bind a keyboard shortcut to the following shell-script (probably will need to install xte).

#!/usr/bin/env bash
sleep 0.5
xte "str ::: spoiler Anti Commercial AI thingy"
xte "key Return"
xte "str [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)"
xte "key Return"
xte "str :::"

Anti Commercial AI thingy

CC BY-NC-SA 4.0

sugar_in_your_tea@sh.itjust.works on 09 Apr 2024 01:37 collapse

I’m on Wayland, but I’m sure I can figure something out.

I do most of my lemmy-ing on mobile, so I’ll probably make a bot to auto-edit my posts or something.

onlinepersona@programming.dev on 09 Apr 2024 07:09 collapse

Have fun! I’m curious how you’ll do it. If you figure out a way on Wayland, it would be great to read about it!

Anti Commercial AI thingy

CC BY-NC-SA 4.0

Turun@feddit.de on 08 Apr 2024 18:25 collapse

Yes. That’s what I meant.

Though I heavily expect the rust compiler to produce identical assembly for both types of iteration.

BB_C@programming.dev on 08 Apr 2024 08:04 next collapse

struct Shapes<const N: usize>([Shape; N])

impl<const N: usize> Shapes<N> {
 const fn area(&self) -> f64 { /* ... */ }
}

Bad article 🤨

zweieuro@lemmy.world on 08 Apr 2024 08:22 next collapse

Correct me if I am wrong but isn’t “loop unrolling/unwinding” something that the c++ and rust compilers do? Why does the loop here not get unwound?

Giooschi@lemmy.world on 08 Apr 2024 18:37 collapse

Loop unrolling is not really the speedup, autovectorization is. Loop unrolling does often help with autovectorization, but is not enough, especially with floating point numbers. In fact the accumulation operation you’re doing needs to be associative, and floating point numbers addition is not associative (i.e. (x + y) + z is not always equal to (x + (y + z)). Hence autovectorizing the code would change the semantics and the compiler is not allowed to do that.

bonus_crab@lemmy.world on 08 Apr 2024 19:49 collapse

so if (somehow) the accumulator was an integer, this loop would autovectorize and the performance differences would be smaller ?

Giooschi@lemmy.world on 09 Apr 2024 10:17 collapse

Very likely yes

gedhrel@lemmy.world on 08 Apr 2024 09:49 next collapse

Casey’s video is interesting, but his example is framed as moving from 35 cycles/object to 24 cycles/object being a 1.5x speedup.

Another way to look at this is, it’s a 12-cycle speedup per object.

If you’re writing a shader or a physics sim this is a massive difference.

If you’re building typical business software, it isn’t; that 10,000-line monster method does crop up, and it’s a maintenance disaster.

I think extracting “clean code principles lead to a 50% cost increase” is a message that needs taking with a degree of context.

sugar_in_your_tea@sh.itjust.works on 08 Apr 2024 18:56 next collapse

Yup. If that 12-cycle speedup is in a hot loop, then yeah, throw a bunch of comments and tests around it and perhaps keep the “clean” version around for illustrative purposes, and then do the fast thing. Perhaps throw in a feature flag to switch between the “clean” and “fast but a little sketchy” versions, and maybe someone will make a method to memoize pure functions generically so the “clean” version can be used with minimal performance overhead.

Clean code should be the default, optimizations should come later as necessary.

coloredgrayscale@programming.dev on 10 Apr 2024 07:43 collapse

Keeping the clean version around seems dangerous advice.

You know it won’t get maintained if there are changes / fixes. So by the time someone may needs to rewrite the part, or application many years later (think migration to different language) it will be more confusing than helping.

sugar_in_your_tea@sh.itjust.works on 10 Apr 2024 08:05 collapse

Easy solution: write tests to ensure equivalent behavior.

bonus_crab@lemmy.world on 08 Apr 2024 19:59 collapse

For what its worth , the cache locality of Vec<Box<Dyn trait>> is terrible in general, i feel like if youre iterating over a large array of things and applying a polymorphic function you’re making a mistake.

Cache locality isnt a problem when youre only accessing something once though.

So imo polymorphism has its place for non iterative-compute type work, ie web server handler functions and event driven systems.

TehPers@beehaw.org on 08 Apr 2024 09:57 next collapse

I agree with the conclusion, and the exploration is interesting enough that I think it was worth sharing. Still, while the author seemingly knows this already based on their conclusion, it’s still worth stressing: these kinds of microbenchmarks rarely reflect real world performance.

This toy case doesn’t have many (if any) real world performance-sensitive applications. At best, using shapes in games comes to mind, but shapes there are often represented as meshes, and if you really need the area that much, you might find that precalculating the area once is more impactful on the performance than optimizing how fast the area is calculated.

Still, the author seems aware, and it seems to just be the author sharing their fun experiment.

fzz@programming.dev on 08 Apr 2024 19:05 next collapse

Did author knows about difference between static and dynamic dispatch? 🤦🏻‍♂️

steventrouble@programming.dev on 10 Apr 2024 23:32 collapse

So the problem discovered is that the compiler can’t optimize the loops by unrolling them (due to floating point issues?). The proposed solution is to unroll the loop manually.

But wouldn’t it be much cleaner and more performant to just have a macro like #![allow lossy floating point optimizations] and let the compiler figure out the rest?

Compilers exist to compile and optimize things. Manually unrolling loops might improve performance on one machine, but completely break the program on another. It’s much safer and more performant in the long term to allow the compiler the leeway to optimize things the optimal way.