Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCIS 7 crashing at regular intervals #10785

Open
mcarbonne opened this issue Dec 18, 2024 · 6 comments
Open

OCIS 7 crashing at regular intervals #10785

mcarbonne opened this issue Dec 18, 2024 · 6 comments
Assignees
Labels
Priority:p2-high Escalation, on top of current planning, release blocker Type:Bug

Comments

@mcarbonne
Copy link

Describe the bug

I upgrade my ocis instance from 5 to 7 (I followed this)
Since the update, my instance is always crashing before 10 minutes of runtime.
I've got a very long stack track, listing all routines, but the most relevant log is this Dec 18 17:40:13 : fatal error: concurrent map writes.

Dec 18 17:37:25 : 2024-12-18T17:37:25Z ERR error happened in MultiHostReverseProxy error="dial tcp 127.0.0.1:33877: connect: connection refused" fragment= m>
Dec 18 17:37:31 : 2024-12-18T17:37:31Z ERR error happened in MultiHostReverseProxy error="dial tcp 127.0.0.1:33877: connect: connection refused" fragment= m>
Dec 18 17:37:31 : 2024-12-18T17:37:31Z ERR error happened in MultiHostReverseProxy error="dial tcp 127.0.0.1:33877: connect: connection refused" fragment= m>
Dec 18 17:40:13 : 2024-12-18T17:40:13Z ERR failed to authenticate the request error="failed to verify access token: token has invalid claims: token is expir>
Dec 18 17:40:13 : 2024-12-18T17:40:13Z ERR failed to authenticate the request error="failed to verify access token: token has invalid claims: token is expir>
Dec 18 17:40:13 : fatal error: concurrent map writes
Dec 18 17:40:13 : 
Dec 18 17:40:13 : goroutine 7053 [running]:
Dec 18 17:40:13 : go-micro.dev/v4/registry/cache.(*cache).isValid(0xc0021158c0, {0xc016f470a0, 0x1, 0x18?}, {0xc008219ce0?, 0x0?, 0x5bc37cd2bd40?})
Dec 18 17:40:13 :         go-micro.dev/[email protected]/registry/cache/cache.go:103 +0x1aa
Dec 18 17:40:13 : go-micro.dev/v4/registry/cache.(*cache).get(0xc0021158c0, {0x5bc378aa4fef, 0x18})
Dec 18 17:40:13 :         go-micro.dev/[email protected]/registry/cache/cache.go:145 +0x136
Dec 18 17:40:13 : go-micro.dev/v4/registry/cache.(*cache).GetService(0x5bc378aa4fef?, {0x5bc378aa4fef?, 0x5bc37a310f90?}, {0x1?, 0x0?, 0xf27e04afc8fd965?})
Dec 18 17:40:13 :         go-micro.dev/[email protected]/registry/cache/cache.go:462 +0x18
Dec 18 17:40:13 : github.com/cs3org/reva/v2/pkg/rgrpc/todo/pool.(*Selector[...]).Next(0x5bc37acfb630, {0x0, 0x0, 0x682e6ff35b9cca4f})
Dec 18 17:40:13 :         github.com/cs3org/reva/[email protected]/pkg/rgrpc/todo/pool/selector.go:117 +0x2b4
Dec 18 17:40:13 : github.com/cs3org/reva/v2/internal/grpc/services/sharesstorageprovider.(*service).fetchAcceptedShares(0xc0098d75c0, {0x5bc37ad23ec0, 0xc00>
Dec 18 17:40:13 :         github.com/cs3org/reva/[email protected]/internal/grpc/services/sharesstorageprovider/sharesstorageprovider.go:1167 +0x72
Dec 18 17:40:13 : github.com/cs3org/reva/v2/internal/grpc/services/sharesstorageprovider.(*service).ListStorageSpaces(0xc0098d75c0, {0x5bc37ad23ec0, 0xc000e>
Dec 18 17:40:13 :         github.com/cs3org/reva/[email protected]/internal/grpc/services/sharesstorageprovider/sharesstorageprovider.go:411 +0x6ab
Dec 18 17:40:13 : github.com/cs3org/go-cs3apis/cs3/storage/provider/v1beta1._SpacesAPI_ListStorageSpaces_Handler.func1({0x5bc37ad23ec0?, 0xc000eeef60?}, {0x>
Dec 18 17:40:13 :         github.com/cs3org/[email protected]/cs3/storage/provider/v1beta1/spaces_api_grpc.pb.go:177 +0xcb
Dec 18 17:40:13 : github.com/cs3org/reva/v2/internal/grpc/interceptors/prometheus.interceptorFromConfig.func1({0x5bc37ad23ec0?, 0xc000eeef60?}, {0x5bc37aaac>
Dec 18 17:40:13 :         github.com/cs3org/reva/[email protected]/internal/grpc/interceptors/prometheus/prometheus.go:72 +0x3d
Dec 18 17:40:13 : github.com/cs3org/reva/v2/pkg/rgrpc.(*Server).getInterceptors.ChainUnaryServer.func8.1({0x5bc37ad23ec0?, 0xc000eeef60?}, {0x5bc37aaacb00?,>
Dec 18 17:40:13 :         github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
Dec 18 17:40:13 : github.com/cs3org/reva/v2/internal/grpc/interceptors/auth.NewUnary.func2({0x5bc37ad23ec0, 0xc000eee8a0}, {0x5bc37aaacb00, 0xc006210bd0}, 0>
Dec 18 17:40:13 :         github.com/cs3org/reva/[email protected]/internal/grpc/interceptors/auth/auth.go:160 +0x9ef
Dec 18 17:40:13 : github.com/cs3org/reva/v2/pkg/rgrpc.(*Server).getInterceptors.ChainUnaryServer.func8.1({0x5bc37ad23ec0?, 0xc000eee8a0?}, {0x5bc37aaacb00?,>
Dec 18 17:40:13 :         github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
Dec 18 17:40:13 : github.com/grpc-ecosystem/go-grpc-middleware/recovery.UnaryServerInterceptor.func1({0x5bc37ad23ec0?, 0xc000eee8a0?}, {0x5bc37aaacb00?, 0xc>
Dec 18 17:40:13 :         github.com/grpc-ecosystem/[email protected]/recovery/interceptors.go:33 +0xb0

(the trace is very very long and contains all goroutines).

I also noticed some strange logs before (error happened in MultiHostReverseProxy error="dial tcp 127.0.0.1:33877: connect: connection refused).
I don't now if I can just ignore them, but just after the upgrade, when I tried to browse my files in the web ui, I was somethings getting "no resource founds" when opening my spaces. A simple refresh was enough to get my files backup (it happened several times). Is there any background migration task when upgrading from 5 to 7 ? Most likely, this isn't related to my main issue.

I also noticed a regular pattern, crash seems to occurs at regular intervals:

$ journalctl -u ocis --since "2 hours ago" |grep "concurrent" -A5
Dec 18 17:30:11 ocis[1214005]: fatal error: concurrent map writes
Dec 18 17:30:11 ocis[1214005]: 
Dec 18 17:30:11 ocis[1214005]: goroutine 35569 [running]:
Dec 18 17:30:11 ocis[1214005]: go-micro.dev/v4/registry/cache.(*cache).isValid(0xc0060fe120, {0xc00d159ed8, 0x1, 0x18?}, {0xc00c6e2e80?, 0x0?, 0x55e36608cd40?})
Dec 18 17:30:11 ocis[1214005]:         go-micro.dev/[email protected]/registry/cache/cache.go:103 +0x1aa
Dec 18 17:30:11 ocis[1214005]: go-micro.dev/v4/registry/cache.(*cache).get(0xc0060fe120, {0x55e361e05fef, 0x18})
--
Dec 18 17:35:12 ocis[1218239]: fatal error: concurrent map writes
Dec 18 17:35:12 ocis[1218239]: fatal error: concurrent map writes
Dec 18 17:35:12 ocis[1218239]: fatal error: concurrent map writes
Dec 18 17:35:12 ocis[1218239]: 
Dec 18 17:35:12 ocis[1218239]: goroutine 5497 [running]:
Dec 18 17:35:12 ocis[1218239]: go-micro.dev/v4/registry/cache.(*cache).isValid(0xc00931d560, {0xc00babd1b8, 0x1, 0x18?}, {0xc007ee0040?, 0x0?, 0x643d29b0bd40?})
Dec 18 17:35:12 ocis[1218239]:         go-micro.dev/[email protected]/registry/cache/cache.go:103 +0x1aa
Dec 18 17:35:12 ocis[1218239]: go-micro.dev/v4/registry/cache.(*cache).get(0xc00931d560, {0x643d25884fef, 0x18})
--
Dec 18 17:40:13 ocis[1220216]: fatal error: concurrent map writes
Dec 18 17:40:13 ocis[1220216]: 
Dec 18 17:40:13 ocis[1220216]: goroutine 7053 [running]:
Dec 18 17:40:13 ocis[1220216]: go-micro.dev/v4/registry/cache.(*cache).isValid(0xc0021158c0, {0xc016f470a0, 0x1, 0x18?}, {0xc008219ce0?, 0x0?, 0x5bc37cd2bd40?})
Dec 18 17:40:13 ocis[1220216]:         go-micro.dev/[email protected]/registry/cache/cache.go:103 +0x1aa
Dec 18 17:40:13 ocis[1220216]: go-micro.dev/v4/registry/cache.(*cache).get(0xc0021158c0, {0x5bc378aa4fef, 0x18})
--
Dec 18 17:49:14 ocis[1222037]: fatal error: concurrent map writes
Dec 18 17:49:14 ocis[1222037]: fatal error: concurrent map writes
Dec 18 17:49:14 ocis[1222037]: 
Dec 18 17:49:14 ocis[1222037]: goroutine 8954 [running]:
Dec 18 17:49:14 ocis[1222037]: go-micro.dev/v4/registry/cache.(*cache).isValid(0xc0092bb290, {0xc0093a7680, 0x1, 0x16?}, {0xc00b500300?, 0x0?, 0x639321992d40?})
Dec 18 17:49:14 ocis[1222037]:         go-micro.dev/[email protected]/registry/cache/cache.go:103 +0x1aa
Dec 18 17:49:14 ocis[1222037]: go-micro.dev/v4/registry/cache.(*cache).get(0xc0092bb290, {0x63931d7013b1, 0x16})
--
Dec 18 17:52:15 ocis[1224786]: fatal error: concurrent map writes
Dec 18 17:52:15 ocis[1224786]: 
Dec 18 17:52:15 ocis[1224786]: goroutine 4142 [running]:
Dec 18 17:52:15 ocis[1224786]: go-micro.dev/v4/registry/cache.(*cache).isValid(0xc0079938c0, {0xc013ee4348, 0x1, 0x16?}, {0xc0081626c0?, 0x0?, 0x61fc6abe1d40?})
Dec 18 17:52:15 ocis[1224786]:         go-micro.dev/[email protected]/registry/cache/cache.go:103 +0x1aa
Dec 18 17:52:15 ocis[1224786]: go-micro.dev/v4/registry/cache.(*cache).get(0xc0079938c0, {0x61fc669503b1, 0x16})
--
Dec 18 17:55:15 ocis[1225967]: fatal error: concurrent map writes
Dec 18 17:55:15 ocis[1225967]: 
Dec 18 17:55:15 ocis[1225967]: goroutine 2001 [running]:
Dec 18 17:55:15 ocis[1225967]: go-micro.dev/v4/registry/cache.(*cache).isValid(0xc007ca1440, {0xc0128c9f10, 0x1, 0x16?}, {0x64134f045f3b?, 0x0?, 0x641355c35d40?})
Dec 18 17:55:15 ocis[1225967]:         go-micro.dev/[email protected]/registry/cache/cache.go:103 +0x1aa
Dec 18 17:55:15 ocis[1225967]: go-micro.dev/v4/registry/cache.(*cache).get(0xc007ca1440, {0x6413519a4409, 0x16})

Steps to reproduce

I haven't tried to reproduce the whole path but here is what I've done

  1. a running ocis 5 instance (+ keycloak)
  2. upgrading to ocis 7 following the migration manual
  3. now, ocis is crashing every ~10 minutes.

Expected behavior

ocis shall not crash without any reasons.

Actual behavior

ocis restarting every few minutes.

Setup

I'm upgrading from the latest ocis 5 to ocis 7 with keycloak SSO + podman (instead of docker).
My setup is very similar to https://owncloud.dev/ocis/deployment/ocis_keycloak/
Regarding data, I have 2 main spaces with ~95Gb and ~25Gb of data ( + a few other small spaces)

@mcarbonne
Copy link
Author

mcarbonne commented Dec 18, 2024

I've done some research. The issue seems to be inside go-micro:


func (c *cache) get(service string) ([]*registry.Service, error) {
	// read lock
	c.RLock()

	// check the cache first
	services := c.cache[service]
	// get cache ttl
	ttl := c.ttls[service]
	// make a copy
	cp := util.Copy(services)

	// got services, nodes && within ttl so return cache
	if c.isValid(cp, ttl) {
		c.RUnlock()
		// return services
		return cp, nil
	}

A read-only lock is taken, but inside isValid, a delete (so a write) might occur: delete(c.nttls, s.Name).

EDIT: I opened an issue micro/go-micro#2743

@micbar micbar added the Priority:p2-high Escalation, on top of current planning, release blocker label Dec 18, 2024
@micbar micbar moved this from Qualification to Prio 2 in Infinite Scale Team Board Dec 18, 2024
@kobergj kobergj moved this from Prio 2 to In progress in Infinite Scale Team Board Dec 19, 2024
@kobergj
Copy link
Collaborator

kobergj commented Dec 19, 2024

@mcarbonne thanks for finding this! We are working on a fix in go-micro. Meanwhile, if you are not using a distributed environment you could try using inmemory registry MICRO_REGISTRY=memory

@mcarbonne
Copy link
Author

@mcarbonne thanks for finding this! We are working on a fix in go-micro. Meanwhile, if you are not using a distributed environment you could try using inmemory registry MICRO_REGISTRY=memory

I can confirm that this workaround is effective => no crash after 1 hour and no more strange logs (the one with connection refused)

@2403905
Copy link
Contributor

2403905 commented Dec 20, 2024

@mcarbonne Could you please give me more context about your case?
The problem is clear but I wouldn't like to fix it blindly. Unfortunately, I can't reproduce.
Are any clients (desktop to mobile) pointed to your instance?
How many web clients could use the instance concurrently?

Hoe did I test
I added several spaces and ran several culr requesters in a loop

IDM_ADMIN_PASSWORD=admin \
IDM_CREATE_DEMO_USERS=true \
PROXY_ENABLE_BASIC_AUTH=true \
OCIS_INSECURE=true ./ocis/bin/ocis server
for i in {1..10000}
do
curl -XGET 'https://localhost:9200/graph/v1beta1/me/drives?%24orderby=name+asc&%24filter=driveType+eq+project' -k -uadmin:admin
curl -XGET 'https://localhost:9200/graph/v1beta1/me/drives?%24orderby=name+asc&%24filter=driveType+eq+project' -k -umoss:vista

curl -XGET 'https://localhost:9200/graph/v1beta1/me/drive/sharedWithMe' -k -uadmin:admin
curl -XGET 'https://localhost:9200/graph/v1beta1/me/drive/sharedWithMe' -k -umoss:vista

done

@mcarbonne
Copy link
Author

I have multiple clients (android, desktop app, Web).
Maybe try to parallelize your curl commands. As this is clearly a thread safety issue, sequential curls will most likely not trigger the bug.

@2403905
Copy link
Contributor

2403905 commented Dec 23, 2024

Here is an optimistic fix but it sill should be validated 2403905/go-micro@6307b41

@2403905 2403905 moved this from In progress to blocked in Infinite Scale Team Board Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority:p2-high Escalation, on top of current planning, release blocker Type:Bug
Projects
Status: blocked
Development

No branches or pull requests

4 participants