Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault on a linear regression model (arm & x86 chips) #500

Closed
freddycct opened this issue Oct 10, 2022 · 6 comments · Fixed by #509
Closed

Segmentation fault on a linear regression model (arm & x86 chips) #500

freddycct opened this issue Oct 10, 2022 · 6 comments · Fixed by #509
Labels
gc garbage collection

Comments

@freddycct
Copy link

There are mainly two issues here.

  1. Segmentation fault
  2. autodiff messes up the mapreduce functions. This is less of an issue because we can avoid using mapreduce.

This is the MWE

using Lux
using Enzyme
using Random
using Optimisers

function lossMapReduce(model, X, Y, ps, st)
	# this doesn't work
	mapreduce(+, zip(X, Y)) do (x,y)
		yhat = Lux.apply(model, x, ps, st)[1]
		(yhat[1] - y)^2
	end
end

function loss(model, X, Y, ps, st)
	# this works but causes segmentation fault after running for a few times
	ll = 0.0f0
	for (x, y) in zip(X, Y)
		yhat = Lux.apply(model, x, ps, st)[1]
		ll += (yhat[1] - y)^2
	end
	return ll
end

function generateToyData(rng, K, N)
	W = randn(rng, Float32, K) # this is the "real" parameters
	b = randn(rng, Float32)    # bias
	
	X = map(x -> rand(rng, Float32, K), 1:N) # features
	Y = map(x -> W' * x + b, X) # ground truth
	return X, Y, W, b
end

function main()
	rng = Random.default_rng()
	Random.seed!(rng, 0)
	
	# lux specific codes
	model = Dense(16, 1)
	ps, st = Lux.setup(rng, model) # ps is the parameters, st is the state

	# generate some toy data
	X, Y, W, b = generateToyData(rng, 16, 1000)

	# setup optimisers
	optRule = Optimisers.Adam()
	optState = Optimisers.setup(optRule, ps)  # optimiser state based on model parameters

	# println("ps = ", ps)
	totalLoss = loss(model, X, Y, ps, st)
	println("0/100: loss = $(totalLoss)")

	totalEpochs1 = 200
	totalEpochs2 = totalEpochs1 + 10

	# this causes a segmentation fault after some 50+ epochs
	for epoch=1:totalEpochs1
		# zero the cache
		grads = Lux.fmap(zero, ps)

		# calculate gradients
		autodiff(Reverse, loss, Active, Const(model), Const(X), Const(Y), Duplicated(ps, grads), Const(st))

		# gradient update using adam optimizer
		optState, ps = Optimisers.update!(optState, ps, grads)

		totalLoss = loss(model, X, Y, ps, st)
		println("$(epoch)/$(totalEpochs1): loss = $(totalLoss)")
	end

	# this uses a different loss function (observe that the loss doesn't reduce)
	for epoch=totalEpochs1+1:totalEpochs2
		# zero the cache
		grads = Lux.fmap(zero, ps)

		# calculate gradients
		autodiff(Reverse, lossMapReduce, Active, Const(model), Const(X), Const(Y), Duplicated(ps, grads), Const(st))

		# gradient update using adam optimizer
		optState, ps = Optimisers.update!(optState, ps, grads)

		totalLoss = lossMapReduce(model, X, Y, ps, st)
		println("$(epoch)/$(totalEpochs2): lossMapReduce = $(totalLoss)")
	end
end

main()

Here's the error msgs:
On Apple Silicon

signal (11): Segmentation fault: 11
in expression starting at /Users/freddy/enzyme_demo.jl:80
gc_setmark_pool_ at /Users/freddy/apps/julia/src/gc.c:0 [inlined]
gc_setmark_pool at /Users/freddy/apps/julia/src/gc.c:827 [inlined]
gc_setmark at /Users/freddy/apps/julia/src/gc.c:834 [inlined]
gc_mark_loop at /Users/freddy/apps/julia/src/gc.c:2771
_jl_gc_collect at /Users/freddy/apps/julia/src/gc.c:3098
ijl_gc_collect at /Users/freddy/apps/julia/src/gc.c:3327
maybe_collect at /Users/freddy/apps/julia/src/gc.c:903 [inlined]
jl_gc_pool_alloc_inner at /Users/freddy/apps/julia/src/gc.c:1247 [inlined]
jl_gc_pool_alloc_noinline at /Users/freddy/apps/julia/src/gc.c:1306
jl_gc_alloc_ at /Users/freddy/apps/julia/src/./julia_internal.h:369 [inlined]
ijl_box_int64 at /Users/freddy/apps/julia/src/datatype.c:1181
Allocations: 72478261 (Pool: 72422195; Big: 56066); GC: 34

On x86

signal (11): Segmentation fault
in expression starting at /home/freddy/enzyme_demo.jl:86
page_metadata at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gc.h:450 [inlined]
gc_setmark_pool at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gc.c:827 [inlined]
gc_setmark at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gc.c:834 [inlined]
gc_mark_loop at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gc.c:2771
_jl_gc_collect at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gc.c:3098
ijl_gc_collect at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gc.c:3327
maybe_collect at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gc.c:903 [inlined]
jl_gc_pool_alloc_inner at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gc.c:1247 [inlined]
jl_gc_pool_alloc_noinline at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gc.c:1306 [inlined]
jl_gc_alloc_ at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/julia_internal.h:369 [inlined]
jl_gc_alloc at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/gc.c:3372
_new_array_ at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/array.c:134 [inlined]
_new_array at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/array.c:198 [inlined]
ijl_alloc_array_1d at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-8/src/array.c:436
unknown function (ip: 0x7f9ecb4b1ff3)
Allocations: 70127916 (Pool: 70074613; Big: 53303); GC: 32
Segmentation fault
@wsmoses wsmoses added the gc garbage collection label Oct 10, 2022
@vchuravy
Copy link
Member

I am trying to reproduce this, but for future reference it would be ideal if the MWE would be smaller and use less external packages.

@freddycct
Copy link
Author

This is the smallest I can get. I hope it helps.

using Enzyme

function loss(X, Y, ps, bs)
	ll = 0.0f0
	for (x, y) in zip(X, Y)
		yhat = ps * x .+ bs
		ll += (yhat[1] - y)^2
	end
	return ll
end

function main()
	ps = randn(Float32, (1, 5))
	bs = randn(Float32)

	X = map(x->rand(Float32, 5), 1:1000)
	Y = map(x->rand(Float32), 1:1000)

	grads = zero(ps)
	for epoch=1:1000
		println("$(epoch)")
		fill!(grads, 0)
		autodiff(Reverse, loss, Const(X), Const(Y), Duplicated(ps, grads), Active(bs))
	end

end

main()

@vchuravy
Copy link
Member

Thanks that was very helpful.

(rr) p jl_(vt)
Enzyme.Compiler.EnzymeTape{1024, NamedTuple{(Symbol("1"), Symbol("2"), Symbol("3"), Symbol("4"), Symbol("5")), Tuple{NamedTuple{(Symbol("1"), Symbol("2"), Symbol("3"), Symbol("4"), Symbol("5"), Symbol("6"), Symbol("7")), Tuple{Core.LLVMPtr{Float32, 0}, Core.LLVMPtr{Float32, 0}, UInt64, UInt8, UInt32, Core.LLVMPtr{Float32, 0}, Core.LLVMPtr{Float32, 0}}}, Any, Any, UInt64, Bool}}}
(rr) p vt->size
$9 = 8

That's clearly off, and it should have hit an assert to begin with.

@vchuravy
Copy link
Member

NT = NamedTuple{(Symbol("1"), Symbol("2"), Symbol("3"), Symbol("4"), Symbol("5")), Tuple{NamedTuple{(Symbol("1"), Symbol("2"), Symbol("3"), Symbol("4"), Symbol("5"), Symbol("6"), Symbol("7")), Tuple{Core.LLVMPtr{Float32, 0}, Core.LLVMPtr{Float32, 0}, UInt64, UInt8, UInt32, Core.LLVMPtr{Float32, 0}, Core.LLVMPtr{Float32, 0}}}, Any, Any, UInt64, Bool}}
NamedTuple{(Symbol("1"), Symbol("2"), Symbol("3"), Symbol("4"), Symbol("5")), Tuple{NamedTuple{(Symbol("1"), Symbol("2"), Symbol("3"), Symbol("4"), Symbol("5"), Symbol("6"), Symbol("7")), Tuple{Core.LLVMPtr{Float32, 0}, Core.LLVMPtr{Float32, 0}, UInt64, UInt8, UInt32, Core.LLVMPtr{Float32, 0}, Core.LLVMPtr{Float32, 0}}}, Any, Any, UInt64, Bool}}

julia> sizeof(NT)
80

julia> sizeof(NTuple{80, NT})
6400

julia> sizeof(Ref{NTuple{80, NT}}())
6400

But EnzymeTape is off...

@vchuravy
Copy link
Member

Even worse the allocation site is thinking sz should be 81920

#2  0x00007f866ccc0dc2 in jl_gc_alloc (ptls=0x56286f6cbe70, sz=81920, ty=0x7f864c6355f0)

So no-one agrees what the size of this type should be.

@freddycct
Copy link
Author

Related issue: #510

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gc garbage collection
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants