Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential for an open connection leak #8654

Closed
shashankmehra opened this issue Apr 1, 2020 · 18 comments
Closed

Potential for an open connection leak #8654

shashankmehra opened this issue Apr 1, 2020 · 18 comments
Labels
bug Used to indicate a potential bug core/client

Comments

@shashankmehra
Copy link

I was trying to analyze memory usage of my application when I noticed the following pprof output:

list RawRequestWithContext
Total: 180.79MB
ROUTINE ======================== github.com/hashicorp/vault/api.(*Client).RawRequestWithContext in github.com/hashicorp/vault/api/client.go
    1.12MB    42.63MB (flat, cum) 23.58% of Total
         .          .    795:	}
         .          .    796:
         .          .    797:	var result *Response
         .          .    798:	resp, err := client.Do(req)
         .          .    799:	if resp != nil {
         .     4.45MB    800:		result = &Response{Response: resp}
         .          .    801:	}
         .          .    802:	if err != nil {
         .          .    803:		if strings.Contains(err.Error(), "tls: oversized") {
         .          .    804:			err = errwrap.Wrapf(
         .          .    805:				"{{err}}\n\n"+
         .          .    806:					"This error usually means that the server is running with TLS disabled\n"+
         .          .    807:					"but the client is configured to use TLS. Please either enable TLS\n"+
         .          .    808:					"on the server or run the client with -address set to an address\n"+
         .          .    809:					"that uses the http protocol:\n\n"+
         .          .    810:					"    vault <command> -address http://<address>\n\n"+
         .          .    811:					"You can also set the VAULT_ADDR environment variable:\n\n\n"+
         .          .    812:					"    VAULT_ADDR=http://<address> vault <command>\n\n"+
         .          .    813:					"where <address> is replaced by the actual address to the server.",
         .          .    814:				err)
         .          .    815:		}
         .          .    816:		return result, err
         .          .    817:	}
         .          .    818:
         .          .    819:	// Check for a redirect, only allowing for a single redirect
         .          .    820:	if (resp.StatusCode == 301 || resp.StatusCode == 302 || resp.StatusCode == 307) && redirectCount == 0 {
         .     4.32MB    821:		// Parse the updated location
         .          .    822:		respLoc, err := resp.Location()
         .     3.43MB    823:		if err != nil {
         .          .    824:			return result, err
         .          .    825:		}
         .          .    826:
         .          .    827:		// Ensure a protocol downgrade doesn't happen
         .          .    828:		if req.URL.Scheme == "https" && respLoc.Scheme != "https" {
         .          .    829:			return result, fmt.Errorf("redirect would cause protocol downgrade")
         .          .    830:		}
         .          .    831:
         .          .    832:		// Update the request
         .          .    833:		r.URL = respLoc
         .          .    834:
         .          .    835:		// Reset the request body if any
         .          .    836:		if err := r.ResetJSONBody(); err != nil {
         .          .    837:			return result, err
         .          .    838:		}
         .          .    839:
    1.12MB     1.12MB    840:		// Retry the request
         .          .    841:		redirectCount++
         .          .    842:		goto START
         .          .    843:	}
         .    29.31MB    844:
         .          .    845:	if err := result.Error(); err != nil {
         .          .    846:		return result, err
         .          .    847:	}
         .          .    848:
         .          .    849:	return result, nil

I am trying to understand why 29.31MB was accumulated at line 844. My hypothesis is that at line 846 https://github.com/hashicorp/vault/blob/master/api/client.go#L846 in function RawRequestWithContext a response object is created whose body is to be closed by the caller of this function.

However consider the case of a redirect where at line 890 the statement goto START is executed. The line 846 will be executed again creating another response object. The previously created response object's body is never closed causing a connection and memory leak for every redirect.

@tyrannosaurus-becks tyrannosaurus-becks added bug Used to indicate a potential bug waiting-for-response labels Apr 1, 2020
@tyrannosaurus-becks
Copy link
Contributor

Hi! Thanks for the report! Can you provide further steps to reproduce this? Information such as what Vault version you're on, what API calls you're making, and any further information that might help us reproduce this locally on our own machines. Also, are you seeing ongoing growth in memory usage, or does memory tend to stabilize like this?

@lawliet89
Copy link
Contributor

lawliet89 commented May 4, 2020

I am observing the same memory leak on my Vault 1.4.0 server running on Kubernetes.

You can see the memory usage graph below:

image

It might be useful to note that this replica became the active server some time on April 22.

I have a bunch of secret engines mounted:

  • GCP
  • AWS
  • KV
  • Postgres

And the following auth mounts:

  • Kubernetes
  • LDAP

I am also having Prometheus constantly scrape Vault server for metrics.

Is there anything more I can do to provide more information?

The pod will probably OOM soon and be killed by Kubernetes.

@kalafut
Copy link
Contributor

kalafut commented May 4, 2020

@lawliet89 You can run Vault debug (https://www.vaultproject.io/docs/commands/debug) which will, among other things, capture heap usage stats. We can help analyze these logs (they may be emailed or otherwise shared outside of GitHub).

@lawliet89
Copy link
Contributor

Thanks @kalafut

May I know who I should email the logs to?

@kalafut
Copy link
Contributor

kalafut commented May 4, 2020

@lawliet89 You can send them to me (email is in Vault's commits).

@kalafut
Copy link
Contributor

kalafut commented May 4, 2020

@lawliet89 Thanks for sending those. There is almost 1GB in use by the write buffer for GCS storage (in a function in the GCS sdk). I'm not familiar enough with the GCS storage details to know why, however. I do wonder what effect (if any) https://www.vaultproject.io/docs/configuration/storage/google-cloud-storage#chunk_size would have.

image

@lawliet89
Copy link
Contributor

lawliet89 commented May 4, 2020

@kalafut

I haven't really looked into it. but maybe this can be fixed with a bump to the Google Cloud SDK.

An update was recently reverted in ea1564b due to etcd-io/etcd#11154 of which the fix should be released shortly.

I will try halving the chunk size to see if it makes any difference.

FWIW, I'm in the process of migrating to Raft storage. This might not be a fix for future readers, however.

@lawliet89
Copy link
Contributor

lawliet89 commented May 6, 2020

image

Maybe this is too early to say, but halving the chunk size seems to have fixed the memory leak issue. I will report back if this is not the case next week.

EDIT: One whole week of data. Looks good.

image

@kalafut
Copy link
Contributor

kalafut commented May 6, 2020

Thanks, that is a useful datapoint.

@sidewinder12s
Copy link

I may also be seeing similar memory usage growth, though we've been running into memory exhaustion issues generally that make the graphs not super clear. Though I am on version 1.1.0.

@lawliet89 I assume you halved it to 4096?

@lawliet89
Copy link
Contributor

@sidewinder12s Yes.

@rokrokss
Copy link

스크린샷 2020-08-13 오후 10 01 51

https://github.com/hashicorp/vault/blob/master/api/logical.go#L132

experienced the same issue and it was created by the write api calls. read didn't make this.

@sidewinder12s
Copy link

For other folks, I suspect this issue may be compounding any performance issues I'm having with GCS: #8761

I've got 1000s of new nodes doing GCP logins daily that probably are not helping my GCS performance.

@elsesiy
Copy link
Contributor

elsesiy commented Nov 3, 2020

@kalafut I'm investigating a somewhat related memory leak and noticed the following using vault 1.5.4:

(pprof) list wrapGenericHandler
Total: 30.41GB
ROUTINE ======================== github.com/hashicorp/vault/http.wrapGenericHandler.func1 in vault/http/handler.go
      27MB     9.16GB (flat, cum) 30.12% of Total
         .          .    267:		maxRequestDuration = props.ListenerConfig.MaxRequestDuration
         .          .    268:		maxRequestSize = props.ListenerConfig.MaxRequestSize
         .          .    269:	}
         .          .    270:	if maxRequestDuration == 0 {
         .          .    271:		maxRequestDuration = vault.DefaultMaxRequestDuration
         .   396.62MB    272:	}
         .          .    273:	if maxRequestSize == 0 {
         .          .    274:		maxRequestSize = DefaultMaxRequestSize
         .          .    275:	}
         .          .    276:	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
         .          .    277:		// Set the Cache-Control header for all the responses returned
         .          .    278:		// by Vault
         .          .    279:		w.Header().Set("Cache-Control", "no-store")
         .          .    280:
         .   184.52MB    281:		// Start with the request context
         .          .    282:		ctx := r.Context()
         .          .    283:		var cancelFunc context.CancelFunc
         .          .    284:		// Add our timeout, but not for the monitor endpoint, as it's streaming
       8MB       50MB    285:		if strings.HasSuffix(r.URL.Path, "sys/monitor") {
         .          .    286:			ctx, cancelFunc = context.WithCancel(ctx)
      19MB       60MB    287:		} else {
         .   512.12kB    288:			ctx, cancelFunc = context.WithTimeout(ctx, maxRequestDuration)
         .   407.57MB    289:		}
         .          .    290:		// Add a size limiter if desired
         .          .    291:		if maxRequestSize > 0 {
         .          .    292:			ctx = context.WithValue(ctx, "max_request_size", maxRequestSize)
         .          .    293:		}
         .          .    294:		ctx = context.WithValue(ctx, "original_request_path", r.URL.Path)
         .          .    295:		r = r.WithContext(ctx)
         .          .    296:		r = r.WithContext(namespace.ContextWithNamespace(r.Context(), namespace.RootNamespace))
         .          .    297:
         .          .    298:		switch {
         .          .    299:		case strings.HasPrefix(r.URL.Path, "/v1/"):
         .          .    300:			newR, status := adjustRequest(core, r)
         .          .    301:			if status != 0 {
         .          .    302:				respondError(w, status, nil)
         .          .    303:				cancelFunc()
         .          .    304:				return
         .          .    305:			}
         .          .    306:			r = newR
         .          .    307:
         .     8.09GB    308:		case strings.HasPrefix(r.URL.Path, "/ui"), r.URL.Path == "/robots.txt", r.URL.Path == "/":
         .          .    309:		default:
         .          .    310:			respondError(w, http.StatusNotFound, nil)
         .          .    311:			cancelFunc()
         .          .    312:			return
         .          .    313:		}

Seems like the string matching is very expensive here

@calvn
Copy link
Contributor

calvn commented Nov 4, 2020

     .   184.52MB    281:		// Start with the request context

The line references on mem usage like the one above seem to be off, can you do pprof with the -source_path flag and point it to the local vault repository location with the commit for 1.5.4 checked out (should be 1a730771ec70149293efe91e1d283b10d255c6d1)?

@elsesiy
Copy link
Contributor

elsesiy commented Nov 5, 2020

@calvn my bad, I was on master.

Total: 30.41GB
ROUTINE ======================== github.com/hashicorp/vault/http.wrapGenericHandler.func1 in vault/http/handler.go
      27MB     9.16GB (flat, cum) 30.12% of Total
         .          .    267:		maxRequestSize = DefaultMaxRequestSize
         .          .    268:	}
         .          .    269:	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
         .          .    270:		// Set the Cache-Control header for all the responses returned
         .          .    271:		// by Vault
         .   396.62MB    272:		w.Header().Set("Cache-Control", "no-store")
         .          .    273:
         .          .    274:		// Start with the request context
         .          .    275:		ctx := r.Context()
         .          .    276:		var cancelFunc context.CancelFunc
         .          .    277:		// Add our timeout, but not for the monitor endpoint, as it's streaming
         .          .    278:		if strings.HasSuffix(r.URL.Path, "sys/monitor") {
         .          .    279:			ctx, cancelFunc = context.WithCancel(ctx)
         .          .    280:		} else {
         .   184.52MB    281:			ctx, cancelFunc = context.WithTimeout(ctx, maxRequestDuration)
         .          .    282:		}
         .          .    283:		// Add a size limiter if desired
         .          .    284:		if maxRequestSize > 0 {
       8MB       50MB    285:			ctx = context.WithValue(ctx, "max_request_size", maxRequestSize)
         .          .    286:		}
      19MB       60MB    287:		ctx = context.WithValue(ctx, "original_request_path", r.URL.Path)
         .   512.12kB    288:		r = r.WithContext(ctx)
         .   407.57MB    289:		r = r.WithContext(namespace.ContextWithNamespace(r.Context(), namespace.RootNamespace))
         .          .    290:
         .          .    291:		switch {
         .          .    292:		case strings.HasPrefix(r.URL.Path, "/v1/"):
         .          .    293:			newR, status := adjustRequest(core, r)
         .          .    294:			if status != 0 {
         .          .    295:				respondError(w, status, nil)
         .          .    296:				cancelFunc()
         .          .    297:				return
         .          .    298:			}
         .          .    299:			r = newR
         .          .    300:
         .          .    301:		case strings.HasPrefix(r.URL.Path, "/ui"), r.URL.Path == "/robots.txt", r.URL.Path == "/":
         .          .    302:		default:
         .          .    303:			respondError(w, http.StatusNotFound, nil)
         .          .    304:			cancelFunc()
         .          .    305:			return
         .          .    306:		}
         .          .    307:
         .     8.09GB    308:		h.ServeHTTP(w, r)
         .          .    309:
         .          .    310:		cancelFunc()
         .          .    311:		return
         .          .    312:	})
         .          .    313:}

I'm happy to provide more debugging context and info on the issue, can I email you directly?

@calvn
Copy link
Contributor

calvn commented Nov 5, 2020

@elsesiy sure thing! You can send it to calvin{at}hashicorp{dot}com.

@ncabatoff
Copy link
Collaborator

Looks like there were a couple of different investigations going on here. I'm not sure if there were any resolutions or not, but it's been a long time and I suspect people have moved on. If people think they're seeing connection or memory leaks in current Vault versions, please open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug core/client
Projects
None yet
Development

No branches or pull requests

10 participants