Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: optimize guid generation #663

Merged
merged 1 commit into from
Oct 1, 2015
Merged

Conversation

ploxiln
Copy link
Member

@ploxiln ploxiln commented Sep 29, 2015

Humble idea inspired by #658

Current-nanosecond-time and 64-bit division operations
appear to be slow on some ARM systems, so minimize them:

  • share time.Time from each loop of idPump() with NewGUID()
  • avoid adding a second to lastError on each iteration,
    by making it nextErrorLog instead
  • replace division by 1000000 with a bitshift
    equivalent to division by 1048576

My benchmarks on darwin / x86_64 laptop (which will depend on temperature and other factors of course) show a negligible improvement, but it will be interesting to see the results for an ARM system (I have a raspberry pi I could dig out and try, maybe later).

before:

PUB: [bench_writer] 2015/09/29 19:52:57 duration: 10.000785444s - 42.458mb/s - 222602.516ops/s - 4.492us/op
SUB: [bench_reader] 2015/09/29 19:53:07 duration: 10.01740169s - 21.631mb/s - 113407.751ops/s - 8.818us/op

after:

PUB: [bench_writer] 2015/09/29 19:41:31 duration: 10.001047234s - 43.300mb/s - 227016.226ops/s - 4.405us/op
SUB: [bench_reader] 2015/09/29 19:41:41 duration: 10.023406992s - 23.684mb/s - 124172.051ops/s - 8.053us/op

@ploxiln ploxiln mentioned this pull request Sep 29, 2015
@ploxiln
Copy link
Member Author

ploxiln commented Sep 30, 2015

btw test failure is just go1.4.2 on i386, TestEphemeralTopicsAndChannels got non-zero number of topics at the end

@ploxiln
Copy link
Member Author

ploxiln commented Sep 30, 2015

OK, I feel pretty silly now for not seemingly not reading idPump() - this won't make much of a difference if there isn't a constant stream of errors from NewGUID() (well, except for the int64 division trick)

@mreiferson
Copy link
Member

@ploxiln any of this worth landing? I'm a little concerned about micro-optimizations like bit-shift-for-division, but maybe that one?

@ploxiln
Copy link
Member Author

ploxiln commented Sep 30, 2015

I'm not sure this should be merged as-is, I expect we'll want to tweak it. We may not need the subjective uglyness of the additional argument to NewGUID() - it seems that errors shouldn't be that common ...

I like the shift optimization - it makes sense that it could be noticeably helpful on ARM, I'll check.

@mreiferson
Copy link
Member

Let's drop the time stuff and leave the shift optimization and land it.

@mreiferson
Copy link
Member

Thanks @ploxiln

@ploxiln
Copy link
Member Author

ploxiln commented Oct 1, 2015

I've reduced this PR to just the shift optimization. But I haven't tested it on arm yet, and I don't think there's any rush.

I'm in the process of getting stuff set up on a raspberrypi 2 which has 4 Cortex A7 cores, which are lower-power and in-order rather than out-of-order but from the same generation as the Cortex A15. And it's also an interesting target in any case.

@ploxiln
Copy link
Member Author

ploxiln commented Oct 1, 2015

It's probably worth mentioning that the time-part of the id will jump back 18577 hours. I don't think there's much chance of overlap; that's about as long as nsq has existed.

func main() {
    ns := time.Now().UnixNano()
    ms := ns / 1e6
    pms := ns >> 20
    fmt.Printf("milliseconds = %d\n", ms);
    fmt.Printf("pseudo-milliseconds = %d\n", pms);
    fmt.Printf("difference = %d\n", ms - pms);
    fmt.Printf("hours = %d\n", (ms - pms) / (1000 * 3600));
}
[pierce@plo-pro current]$ go run test.go
milliseconds = 1443665428226
pseudo-milliseconds = 1376786640383
difference = 66878787843
hours = 18577

@ploxiln
Copy link
Member Author

ploxiln commented Oct 1, 2015

(maybe I should adjust twepoch?)

@ploxiln
Copy link
Member Author

ploxiln commented Oct 1, 2015

Trying to use the ubuntu-packaged go1.3.3 on arm, the build fails with "undefined: atomic.Value". Accidental compatibility slip, or just an arm limitation?

Anyway, continuing with testing with a binary build of go 1.4.2 for arm I got from http://dave.cheney.net/unofficial-arm-tarballs (seems legit, I even think I've heard of this guy...). I may be able to build and test go 1.5.1 but that won't be until late tomorrow if I do.

@ploxiln
Copy link
Member Author

ploxiln commented Oct 1, 2015

Did have a noticeable improvement for pub:

Switched to branch 'master'
# using --mem-queue-size=1000000 --data-path= --size=200 --batch-size=200
PUB: [bench_writer] 2015/10/01 01:25:05 duration: 10.004711496s - 5.838mb/s - 30605.580ops/s - 32.674us/op
SUB: [bench_reader] 2015/10/01 01:25:15 duration: 10.016315311s - 1.698mb/s - 8900.379ops/s - 112.355us/op
Switched to branch 'time_div_optim'
# using --mem-queue-size=1000000 --data-path= --size=200 --batch-size=200
PUB: [bench_writer] 2015/10/01 01:26:25 duration: 10.004926446s - 6.470mb/s - 33923.288ops/s - 29.478us/op
SUB: [bench_reader] 2015/10/01 01:26:35 duration: 10.019001199s - 1.698mb/s - 8904.979ops/s - 112.297us/op
Switched to branch 'master'
# using --mem-queue-size=1000000 --data-path= --size=200 --batch-size=200
PUB: [bench_writer] 2015/10/01 01:27:31 duration: 10.004017781s - 5.941mb/s - 31147.486ops/s - 32.105us/op
SUB: [bench_reader] 2015/10/01 01:27:41 duration: 10.025512223s - 1.694mb/s - 8879.945ops/s - 112.613us/op
Switched to branch 'time_div_optim'
# using --mem-queue-size=1000000 --data-path= --size=200 --batch-size=200
PUB: [bench_writer] 2015/10/01 01:28:36 duration: 10.004349052s - 6.558mb/s - 34385.046ops/s - 29.082us/op
SUB: [bench_reader] 2015/10/01 01:28:46 duration: 10.015671236s - 1.712mb/s - 8977.831ops/s - 111.385us/op

Also: many ARM SoC have multiple cores, and GOMAXPROCS=2 almost eliminates the difference this branch makes, for this simple bench.sh (pretty consistent pub perf of 10.5 mb/s -> 10.8 mb/s)

@ploxiln
Copy link
Member Author

ploxiln commented Oct 1, 2015

There were a few options for twepoch, I decided to keep it almost exactly the same but update the comment.

I've done as much half-hearted ARMv7 benchmarking as I'm inclined to do. fwiw

64-bit division operations appear to be slow on some
ARM systems, so replace division by 1000000 with a
bitshift equivalent to division by 1048576.

adapt guid twepoch comment for revised guid timestamp scheme
@mreiferson
Copy link
Member

@ploxiln correct me if I'm wrong, but won't this also change the theoretical number of unique IDs we're able to generate per second?

@ploxiln
Copy link
Member Author

ploxiln commented Oct 1, 2015

Yes, there will be 5% fewer unique IDs per second.

>>> (10**9 / 2**20) * 1.0 / 1e3 
0.953

@ploxiln
Copy link
Member Author

ploxiln commented Oct 1, 2015

So, it reduces the theoretical max messages per workerID per second from 4,096,000 to 3,903,488

mreiferson added a commit that referenced this pull request Oct 1, 2015
nsqd: optimize guid generation
@mreiferson mreiferson merged commit ba7c4b4 into nsqio:master Oct 1, 2015
@ploxiln ploxiln deleted the time_div_optim branch April 22, 2016 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants