Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compiler: use the LLVM Attributor to move allocations to the stack #4582

Open
eliasnaur opened this issue Nov 4, 2024 · 3 comments
Open
Labels
core enhancement New feature or request

Comments

@eliasnaur
Copy link
Contributor

eliasnaur commented Nov 4, 2024

As suggested by @aykevl, the LLVM Attributor's HeapToStack optimization pass potentially replaces TinyGo's limited ability to prove allocations safe to move to the stack.

I've played with the attributor in LLVM 18.1 and so far have limited luck. First and foremost, merely replacing the function-attrs pass with the attributor pass,

diff --git a/transform/optimizer.go b/transform/optimizer.go
index 54f9762b..f600e1e5 100644
--- a/transform/optimizer.go
+++ b/transform/optimizer.go
@@ -53,7 +51,7 @@ func Optimize(mod llvm.Module, config *compileopts.Config) []error {
                // Run some preparatory passes for the Go optimizer.
                po := llvm.NewPassBuilderOptions()
                defer po.Dispose()
-               optPasses := "globaldce,globalopt,ipsccp,instcombine<no-verify-fixpoint>,adce,function-attrs"
+               optPasses := "globaldce,globalopt,ipsccp,instcombine<no-verify-fixpoint>,adce,attributor"
                if llvmutil.Version() < 18 {
                        // LLVM 17 doesn't have the no-verify-fixpoint flag.
                        optPasses = "globaldce,globalopt,ipsccp,instcombine,adce,function-attrs"

results in significant slowdown in compilation of a medium sized test program, from a few seconds to almost 2 minutes. I don't know whether the attributor can be run incrementally to amortize the time cost. There is also the much faster attributor-light pass, but I haven't seen it eliminate any allocations, not even for LLVM's own heap_to_stack.ll tests. I suspect the H2S pass doesn't work or is disabled in attributor-light.

Then, I failed to eliminate allocations for "append" style functions (append, strconv.Append* etc.). For example:

declare dso_local noalias nonnull ptr @runtime.alloc(i64 %size, ptr %layout) local_unnamed_addr allockind("alloc,zeroed") allocsize(0)

declare fastcc void @runtime.printuint8(i8 %n)

declare fastcc void @runtime.memmove(ptr nocapture nofree %dst, ptr nocapture nofree %src) 

define internal fastcc { ptr, i64, i64 } @runtime.sliceGrow(ptr %oldBuf) {
entry:
  %0 = insertvalue { ptr, i64, i64 } zeroinitializer, ptr %oldBuf, 0
  ret { ptr, i64, i64 } %0
}

define fastcc void @escaping_slice() unnamed_addr {
entry:
  %makeslice = call ptr @runtime.alloc(i64 noundef 20, ptr inttoptr (i64 3 to ptr))

  %elemsBuf.priv = alloca i8, align 1

  %1 = call fastcc { ptr, i64, i64 } @runtime.sliceGrow(ptr %makeslice)

  %2 = extractvalue { ptr, i64, i64 } %1, 0
  call fastcc void @runtime.memmove(ptr %2, ptr %elemsBuf.priv)

  ; inlined and simplified print.
  %4 = load i8, ptr %makeslice, align 1
  call fastcc void @runtime.printuint8(i8 %4)
  ret void
}

define fastcc void @stack_slice() unnamed_addr {
entry:
  %makeslice = call ptr @runtime.alloc(i64 noundef 20, ptr inttoptr (i64 3 to ptr))

  %elemsBuf.priv = alloca i8, align 1

  ; This is runtime.sliceGrow inlined, the only difference to escaping_slice.
  %1 = insertvalue { ptr, i64, i64 } zeroinitializer, ptr %makeslice, 0

  %2 = extractvalue { ptr, i64, i64 } %1, 0
  call fastcc void @runtime.memmove(ptr %2, ptr %elemsBuf.priv)

  ; inlined and simplified print.
  %4 = load i8, ptr %makeslice, align 1
  call fastcc void @runtime.printuint8(i8 %4)
  ret void
}

is a reduced form of the translation of the program

package main

func main() {
	buf := make([]byte, 20)
	buf = append(buf[:0], 'g')
	for _, c := range buf {
		println(c)
	}
}

The result of running the same optimization passes that TinyGo runs:

$ opt -S -passes='globaldce,globalopt,ipsccp,instcombine<no-verify-fixpoint>,attributor' ~/Downloads/alloca.ll
; ModuleID = '/Users/a/Downloads/alloca.ll'
source_filename = "/Users/a/Downloads/alloca.ll"

; Function Attrs: allockind("alloc,zeroed") allocsize(0)
declare dso_local noalias nonnull ptr @runtime.alloc(i64, ptr) local_unnamed_addr #0

declare fastcc void @runtime.printuint8(i8) local_unnamed_addr

declare fastcc void @runtime.memmove(ptr nocapture nofree, ptr nocapture nofree) local_unnamed_addr

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define internal fastcc { ptr, i64, i64 } @runtime.sliceGrow(ptr noalias nofree noundef nonnull readnone dereferenceable(20) %oldBuf) unnamed_addr #1 {
entry:
  %0 = insertvalue { ptr, i64, i64 } zeroinitializer, ptr %oldBuf, 0
  ret { ptr, i64, i64 } %0
}

define fastcc void @escaping_slice() unnamed_addr {
entry:
  %makeslice = call nonnull dereferenceable(20) ptr @runtime.alloc(i64 noundef 20, ptr noundef nonnull inttoptr (i64 3 to ptr))
  %elemsBuf.priv = alloca i8, align 1
  %0 = call fastcc { ptr, i64, i64 } @runtime.sliceGrow(ptr noalias nofree noundef nonnull readnone dereferenceable(20) %makeslice) #3
  %1 = extractvalue { ptr, i64, i64 } %0, 0
  call fastcc void @runtime.memmove(ptr %1, ptr nocapture noundef nonnull %elemsBuf.priv)
  %2 = load i8, ptr %makeslice, align 1
  call fastcc void @runtime.printuint8(i8 %2)
  ret void
}

define fastcc void @stack_slice() unnamed_addr {
entry:
  %makeslice.h2s = alloca i8, i64 20, align 1
  call void @llvm.memset.p0.i64(ptr %makeslice.h2s, i8 0, i64 20, i1 false)
  %elemsBuf.priv = alloca i8, align 1
  call fastcc void @runtime.memmove(ptr nocapture nofree noundef nonnull %makeslice.h2s, ptr nocapture noundef nonnull %elemsBuf.priv)
  %0 = load i8, ptr %makeslice.h2s, align 1
  call fastcc void @runtime.printuint8(i8 %0)
  ret void
}

; Function Attrs: nocallback nofree nounwind willreturn memory(argmem: write)
declare void @llvm.memset.p0.i64(ptr nocapture writeonly, i8, i64, i1 immarg) #2

attributes #0 = { allockind("alloc,zeroed") allocsize(0) }
attributes #1 = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) }
attributes #2 = { nocallback nofree nounwind willreturn memory(argmem: write) }
attributes #3 = { nounwind memory(none) }

Note how stack_slice has its allocation eliminated, whereas escaping_slice hasn't. The only difference between the functions is the call to sliceGrow.

@eliasnaur
Copy link
Contributor Author

#4604 doesn't seem to me that H2S can fix, because it needs Go's concept of escape analysis of variable contents.

@dgryski
Copy link
Member

dgryski commented Nov 26, 2024

@eliasnaur Did the added pass remove allocations? I wonder if that's a flag we'd like to allow users to enable despite the significant hit to compile times?

@eliasnaur
Copy link
Contributor Author

I didn't do the experiments to determine whether the attributor H2S pass removed allocations. I'm under the impression that for the H2S pass to be effective, LLVM needs further work or that every pointer parameter must be marked nofree. See #4512 (comment).

@deadprogram deadprogram added enhancement New feature or request core labels Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants