Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receive: poor memory efficiency when (un)marshaling data #5751

Open
fpetkovski opened this issue Oct 3, 2022 · 11 comments
Open

Receive: poor memory efficiency when (un)marshaling data #5751

fpetkovski opened this issue Oct 3, 2022 · 11 comments

Comments

@fpetkovski
Copy link
Contributor

fpetkovski commented Oct 3, 2022

Is your proposal related to a problem?

I took a profile of a bloated Thanos Receiver, and noticed that a huge amount of memory is being allocated for just marshaling and unmarshaling data. This is a big surprise as I would expect the majority of the memory usage would come from TSDBs.

Describe the solution you'd like

We should investigate why marshaling data has such a significant impact on memory.

Describe alternatives you've considered

N/A

Additional context

Here is a link to the profile: https://pprof.me/ccc9a28/

image

image

@bwplotka
Copy link
Member

bwplotka commented Oct 4, 2022

Agree, thanks! 💪🏽 We see similar on our side I think @matej-g @philipgough

@douglascamata
Copy link
Contributor

@fpetkovski: Why would TSDBs allocate more memory than marshalling? Aren't TSDBs just pointing to memory previously allocated by the marshalling? 🤔

@fpetkovski
Copy link
Contributor Author

TSDBs hold memory for a longer time period, and I assume we don't marshal everything from them during series requests.

@matej-g
Copy link
Collaborator

matej-g commented Oct 5, 2022

I think we've seen similar numbers in our profiles, for us the biggest issue was switch to snappy compression on 0.28.0, as I was mentioning in #5575 (comment), which after we switched back to no compression we got back to numbers similar to 0.27.0. @philipgough has now run a couple more tests comparing with / without compression and the issue seems to be especially pronounced on instances where we ingest big requests. @douglascamata @philipgough feel free to add more from your observations.

But speaking solely on (un)marshalling, yes, that could be the next thing to look at. Although I wonder where / what to start with 🤔.

@douglascamata
Copy link
Contributor

One of the question I have in mind at the moment is: is marshalling's high in-use memory numbers due to few huge gRPC messages? Does it get better in the scenario with more numerous but smaller messages?

@matej-g
Copy link
Collaborator

matej-g commented Oct 6, 2022

Another related issue, I'm wondering what kind of improvement we'd see if / when we move to the Vitess framework #4557

@fpetkovski
Copy link
Contributor Author

I was going to suggest the same, we might be able to start pooling requests once that is merged.

@fpetkovski
Copy link
Contributor Author

Could be related to klauspost/compress#442

@juanrh
Copy link
Contributor

juanrh commented Oct 14, 2022

Hi,

I'm also seeing this in a prod cluster where in the memory profile just before the OOM the top for -sample_index=inuse_space has 97% of the memory allocated for:

google.golang.org/grpc.(*parser).recvMsg
github.com/thanos-io/thanos/pkg/store/storepb.(*Chunk).Unmarshal
github.com/thanos-io/thanos/pkg/store/storepb.(*Series).Unmarshal
github.com/thanos-io/thanos/pkg/query.removeExactDuplicates
github.com/thanos-io/thanos/pkg/store/storepb.(*AggrChunk).Unmarshal

so most of that is gRPC serialization, but query.removeExactDuplicates is also allocating 1.94GB (flat 12.53%) for ret := make([]storepb.AggrChunk, 0, len(chks)). For what I see removeExactDuplicates is only called as s.currChunks = removeExactDuplicates(s.currChunks) in func (s *promSeriesSet) Next() bool, so a complementary optimization could be removing duplicates it in place, for example as follows:

$ git diff
diff --git a/pkg/query/iter.go b/pkg/query/iter.go
index 6e9e051fe..2b49e6174 100644
--- a/pkg/query/iter.go
+++ b/pkg/query/iter.go
@@ -77,17 +77,14 @@ func removeExactDuplicates(chks []storepb.AggrChunk) []storepb.AggrChunk {
        if len(chks) <= 1 {
                return chks
        }
-
-       ret := make([]storepb.AggrChunk, 0, len(chks))
-       ret = append(ret, chks[0])
-
+       i := 0
        for _, c := range chks[1:] {
-               if ret[len(ret)-1].Compare(c) == 0 {
-                       continue
+               if chks[i].Compare(c) != 0 {
+                       i++
+                       chks[i] = c
                }
-               ret = append(ret, c)
        }
-       return ret
+       return chks[:i+1]
 }
 
 func (s *promSeriesSet) At() storage.Series {

What do you think?

@matej-g
Copy link
Collaborator

matej-g commented Oct 17, 2022

@juanrh definitely! Could you open (even if a draft) PR with the proposed changes? We could discuss it in full there and perhaps look at some benchmarks.

@juanrh
Copy link
Contributor

juanrh commented Oct 17, 2022

@matej-g I just sent #5795 for that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants