Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do a trace and pprof run on Gaia #205

Open
Skarlso opened this issue Aug 25, 2019 · 19 comments
Open

Do a trace and pprof run on Gaia #205

Skarlso opened this issue Aug 25, 2019 · 19 comments
Assignees

Comments

@Skarlso
Copy link
Member

Skarlso commented Aug 25, 2019

No description provided.

@Skarlso Skarlso changed the title It's high time. Do a trace and pprof run on Gaia Do a trace and pprof run on Gaia Aug 25, 2019
@Skarlso
Copy link
Member Author

Skarlso commented Aug 25, 2019

Actually, I know there are more immediate issues out there which need attention, but man I ran into a race condition with the tests again today. So I'm a bit afraid that we should optimise the routines a bit.

@michelvocks
Copy link
Member

Yeah, sounds like a good idea. Especially the scheduler is critical and could potentially have some leaking routines.

@Skarlso
Copy link
Member Author

Skarlso commented Sep 3, 2019

Aye. I agree. I'll get into it, once the worker rebuild is finished. I'm really eager to trace things. :)

@Skarlso Skarlso self-assigned this Sep 5, 2019
@Skarlso
Copy link
Member Author

Skarlso commented Sep 7, 2019

Alright, found my first data race already. Going to work to fix this one first as it's the easiest to find and reproduce.

==================
WARNING: DATA RACE
Read at 0x00c0003a8768 by goroutine 97:
  bufio.(*Writer).Flush()
      /usr/local/Cellar/go/1.13/libexec/src/bufio/bufio.go:590 +0x79
  github.com/gaia-pipeline/gaia/plugin.(*GoPlugin).FlushLogs()
      /Users/hannibal/goprojects/gaia/gaia/plugin/plugin.go:322 +0x50
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).executeScheduler.func1()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:556 +0x178

Previous write at 0x00c0003a8768 by goroutine 26:
  bufio.(*Writer).Write()
      /usr/local/Cellar/go/1.13/libexec/src/bufio/bufio.go:638 +0x415
  github.com/hashicorp/go-plugin.(*Client).logStderr()
      /Users/hannibal/golang/pkg/mod/github.com/hashicorp/[email protected]/client.go:759 +0x321

Goroutine 97 (running) created at:
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).executeScheduler()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:551 +0x393
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).executeScheduledJobs()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:504 +0xd7
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).prepareAndExec()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:222 +0x11c8
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).work()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:147 +0x1b0

Goroutine 26 (running) created at:
  github.com/hashicorp/go-plugin.(*Client).Start()
      /Users/hannibal/golang/pkg/mod/github.com/hashicorp/[email protected]/client.go:554 +0x1723
  github.com/hashicorp/go-plugin.(*Client).Client()
      /Users/hannibal/golang/pkg/mod/github.com/hashicorp/[email protected]/client.go:296 +0x63
  github.com/gaia-pipeline/gaia/plugin.(*GoPlugin).Init()
      /Users/hannibal/goprojects/gaia/gaia/plugin/plugin.go:169 +0xbac
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).prepareAndExec()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:206 +0xbf2
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).work()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:147 +0x1b0
==================

@Skarlso
Copy link
Member Author

Skarlso commented Sep 8, 2019

@michelvocks At the top of your head, do you know the reason for implementing Plugin with pointers? I saw no reason in particular. Plugins aren't shared as far as I can see.

The Flush operation is causing a race condition through the scheduler. Apparently more than one routine at the same time can't flush. This is the most obvious race in Gaia.

Using values would fix it since plugins wouldn't be shared through out gaia. I don't know yet what that would mean in terms of memory usage though. I haven't tested it yet. I plan on doing so. :)

@michelvocks
Copy link
Member

@Skarlso Yes. One plugin instance is always one plugin run. We need to have an instance since we need to run multiple jobs synchronously via this instance. If we don't have a plugin instance, we would have to start the plugin for every pipeline job which is very insufficient.

Additionally, a user potentially wants to share data/information between jobs within a pipeline run. If we create a new plugin instance for every job, you couldn't share in-memory data between jobs and you are literally forced to write information to disk or to a remote cache/database which is more work.

@Skarlso
Copy link
Member Author

Skarlso commented Sep 9, 2019

That could be achieved by giving back a modified version of the struct, instead of passing it around as pointers, not?

@michelvocks
Copy link
Member

I don't know how but maybe we are talking past each other. Could you elaborate a bit more on the problem/solution?

In my opinion, we could easily add a mutex to the Plugin struct and add a lock to the FlushLogs method?

@Skarlso
Copy link
Member Author

Skarlso commented Sep 9, 2019 via email

@Skarlso
Copy link
Member Author

Skarlso commented Sep 9, 2019

That said, it can occur that a mutex IS the solution. But I would like to make sure, I'm not just patching a sinking boat. :)

@Skarlso
Copy link
Member Author

Skarlso commented Sep 9, 2019

I'm gonna show you in a draft pr what I meant and like'd your opinion on it. :)

@Skarlso
Copy link
Member Author

Skarlso commented Sep 9, 2019

Ah, yes. It's way too much work and ends up being ugly. Mutex it is I guess. :)

@Skarlso
Copy link
Member Author

Skarlso commented Sep 9, 2019

Hah no, I succeeded, but get this. Because of other things, like Job and such, I still got a data race at flush! But this is a good thing. Because now I verified that indeed a mutex is a good approach at this point. :) Awesome.

@Skarlso
Copy link
Member Author

Skarlso commented Sep 10, 2019

Get this! :) Just so I could ease my mind, I got further and re-wrote a large chunk of passing around the plugin to values instead of pointers. And when there is a mutation, I pass the mutated value back instead of the pointer semantics. I got rid of the race. :) However, I don't like the resulting code that much. It's not intuitive enough and sometime it passes it back sometimes it doesn't. So I'm not going to go with that. Not to mention that I don't have data about the memory allocation increase because of the copy. Aw well. 🤷‍♂

@Skarlso
Copy link
Member Author

Skarlso commented Sep 10, 2019

Huh, so guess what... The mutex didn't solve the data race.

@Skarlso
Copy link
Member Author

Skarlso commented Sep 10, 2019

Oh I know why.

@Skarlso
Copy link
Member Author

Skarlso commented Sep 10, 2019

Well this is problematic. Gaia is racing with go-plugin's logStderr. c.config.Stderr.Write([]byte(line)) since we have stderr replaced with our buffer. We can't just lock the flush. The flush is racing with go-plugin. Eh.

@Skarlso
Copy link
Member Author

Skarlso commented Sep 10, 2019

So the flush happens so that the logs are displayed in the log window, if I see this correctly. But the plugin is also trying to write / read to the buffer we are providing it. This is problematic, because even if we lock on our end the plugin will not lock of course, because it's using a simple Write. Which means a data race on the buffer.

@Skarlso
Copy link
Member Author

Skarlso commented Sep 10, 2019

Okay, so the solution must be that we implement our own writer using an underlying buffer and io writer. On which the Write is basically a bufio.Write but with a lock in it. And we then pass that to go-plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants