Skip to content

Missed optimization: multiple instances of a small struct don't reuse the stack allocation #141649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ohadravid opened this issue May 27, 2025 · 11 comments
Labels
A-codegen Area: Code generation A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html A-mir-opt Area: MIR optimizations C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-opsem Relevant to the opsem team

Comments

@ohadravid
Copy link
Contributor

ohadravid commented May 27, 2025

When creating multiple instances of a small struct, each instance will be allocated separately on the stack even if they are known never to overlap.

Example: the following code will generate two alloca calls that are not optimized away by LLVM:

(Godbolt)

pub struct WithOffset<T> {
    pub data: T,
    pub offset: usize,
}

#[inline(never)]
pub fn use_w(w: WithOffset<&[u8; 16]>) {
    std::hint::black_box(w);
}

#[inline(never)]
pub fn peek_w(w: &WithOffset<&[u8; 16]>) {
    std::hint::black_box(w);
}

pub fn offsets(buf: [u8; 16]) {
    let w = WithOffset {
        data: &buf,
        offset: 0,
    };

    peek_w(&w);
    use_w(w);

    let w2 = WithOffset {
        data: &buf,
        offset: 1,
    };

    peek_w(&w2);
    use_w(w2);
}

LLVM IR:

; playground::offsets
; Function Attrs: noinline nounwind
define internal fastcc void @playground::offsets(ptr noalias nocapture noundef nonnull readonly align 1 dereferenceable(16) %buf) unnamed_addr #0 {
start:
  %w2 = alloca [16 x i8], align 8
  %w = alloca [16 x i8], align 8
  store ptr %buf, ptr %w, align 8
  %0 = getelementptr inbounds nuw i8, ptr %w, i64 8
  store i64 0, ptr %0, align 8
; call playground::peek_w
  call fastcc void @playground::peek_w(ptr noalias noundef readonly align 8 dereferenceable(16) %w) #88
; call playground::use_w
  call fastcc void @playground::use_w(ptr noalias noundef readonly align 1 dereferenceable(16) %buf, i64 noundef 0) #88
  store ptr %buf, ptr %w2, align 8
  %1 = getelementptr inbounds nuw i8, ptr %w2, i64 8
  store i64 1, ptr %1, align 8
; call playground::peek_w
  call fastcc void @playground::peek_w(ptr noalias noundef readonly align 8 dereferenceable(16) %w2) #88
; call playground::use_w
  call fastcc void @playground::use_w(ptr noalias noundef readonly align 1 dereferenceable(16) %buf, i64 noundef 1) #88
  ret void
}

It seems like a call to @llvm.lifetime.{start,end}.p0 is missing. If we instead use:

pub fn closures(buf: [u8; 16]) {
    (|| {
        let w = WithOffset {
            data: &buf,
            offset: 0,
        };

        peek_w(&w);
        use_w(w);
    })();

    (|| {
        let w2 = WithOffset {
            data: &buf,
            offset: 1,
        };

        peek_w(&w2);
        use_w(w2);
    })();
}

We do get them and the second alloca is optimized away (see the Godbolt link).

I encountered this when working on memorysafety/rav1d#1402, where this misoptimization results in over 100 bytes of extra allocations in a specific function, which slows down the entire binary by ~0.5%.

This might also be related to #138544

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label May 27, 2025
@ohadravid ohadravid changed the title Missed optimization: creating multiple instances of a small struct results don't reuse the stack allocation Missed optimization: multiple instances of a small struct don't reuse the stack allocation May 27, 2025
@saethlin
Copy link
Member

This might also be related to #138544

Can you explain how? That issue is about a missing MIR optimization, and this issue is about an LLVM optimization.

@ohadravid
Copy link
Contributor Author

I thought it might be related to that issue because there's a "missing" StorageDead call - but maybe I just got confused 🙂

@lolbinarycat lolbinarycat added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels May 27, 2025
@hanna-kruppe
Copy link
Contributor

I suspect the desired optimization depends on unresolved details of Rust's operational semantic (e.g., rust-lang/unsafe-code-guidelines#188). Does moving out of w to call use_w have any special significance, in particular, does it guarantee the storage can be deallocated eagerly since nothing appears to re-initialize it? What if WithOffset were Copy? It's possible that stacked borrows or tree borrows can justify it by saying that the reference that escaped into peek_w can't be used anymore after the function returns (but I wouldn't bet on it either). However, it's hard to justify optimizations based on an aliasing model that isn't agreed upon yet.

@hanna-kruppe
Copy link
Contributor

For example, miri (without any flags, at least) doesn't take any issue with this program that stashes a pointer to x1 in the first call to peek and then reads through that pointer in the second call to peek:

use std::ptr;
use std::cell::Cell;

thread_local !{
    static LAST_X: Cell<*const u32> = const { Cell::new(ptr::null()) };
} 

fn peek(x: &u32) -> u32 {
    let last_x: *const u32 = LAST_X.get();
    let result: u32 = if last_x.is_null() {
        *x
    } else {
        unsafe { *last_x }
    };
    LAST_X.set(x);
    result
}


fn main() {
    let x1 = 5;
    dbg!(peek(&x1));
    let x2 = 8;
    dbg!(peek(&x2));
}

@saethlin
Copy link
Member

The discussion in #138544 about storage markers is not related. It is very easy to think you've found a pattern in MIR that can be optimized on with trivial analysis but write a pass that is unsound. I've done it too.

To this specific issue, I think the storage markers are what we want in -Zmir-opt-level=0, and yet the assembly seems unchanged. @hanna-kruppe can you check me on that?

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented May 27, 2025

In the MIR I'm looking at the relevant locals in offsets are _2 and _9, both of which are live until right before the return in bb4. (_8 and _15 are short-lived temporaries only live around during the use_w calls.) This is what I would expect: matching the surface Rust semantics before any optimizations, but would mean they have to go in disjoint stack slots because they have overlapping lifetimes.

@saethlin
Copy link
Member

Ah! I was looking at the wrong locals. You are right.

@ohadravid
Copy link
Contributor Author

ohadravid commented May 27, 2025

@hanna-kruppe that's a very surprising example 😮
BTW - in the Rav1d case, it is a Copy struct.

I think there's something more going on. If I use blocks like this:

pub fn offsets(buf: [u8; 16]) {
    {
        let w = WithOffset {
            data: &buf,
            offset: 0,
        };

        peek_w(&w);
        use_w(w);
    }

    {
        let w2 = WithOffset {
            data: &buf,
            offset: 1,
        };

        peek_w(&w2);
        use_w(w2);
    }
}

There are still be two alloca calls, but Miri will complain about:

fn main() {
    {
        let x1 = 5;
        dbg!(peek(&x1));
    }
    {
        let x2 = 8;
        dbg!(peek(&x2));
    }
}

so I maybe in this case the semantics are defined?

@hanna-kruppe
Copy link
Contributor

Yeah, I haven't looked into it any further, but I would expect that adding blocks so that the first local is dead before the second one is introduced would allow overlapping their storage.

@tmiasko
Copy link
Contributor

tmiasko commented May 27, 2025

In a variant with separate scopes #141649 (comment) overlapping storage of w and w2 would be valid. At the moment there are two MIR passes that inhibit such opimization by removing storage markers: CopyProp and GVN. For example, compare builds with -O and -O -Zmir-enable-passes=-CopyProp,-GVN.

@tmiasko tmiasko added the A-mir-opt Area: MIR optimizations label May 27, 2025
@nikic nikic added A-mir-opt Area: MIR optimizations and removed A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-mir-opt Area: MIR optimizations labels May 27, 2025
@saethlin saethlin added A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html T-opsem Relevant to the opsem team labels May 27, 2025
@saethlin
Copy link
Member

I don't think it's fair to call this just a mir-opt bug, because even without any optimizations we still generate MIR that is incompatible with the desired optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html A-mir-opt Area: MIR optimizations C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-opsem Relevant to the opsem team
Projects
None yet
Development

No branches or pull requests

7 participants