Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: add unique_rowid_nonserial() builtin #7186

Closed
vivekmenezes opened this issue Jun 13, 2016 · 13 comments
Closed

sql: add unique_rowid_nonserial() builtin #7186

vivekmenezes opened this issue Jun 13, 2016 · 13 comments
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Milestone

Comments

@vivekmenezes
Copy link
Contributor

unique_rowid() creates values in the sequential order of time. This can lead to a write hotspot on writes on a table. We should provide another builtin unique_rowid_nonserial() that spreads new unique values around. Placing the nodeid in the higher order bits and the timestamp in the lower order bits is a solution--albeit not a very good one--because it spreads new values across a few buckets (number of nodes),

@danhhz
Copy link
Contributor

danhhz commented Jun 13, 2016

alternate proposal: implement this function by hashing the output of unique_rowid

@bdarnell
Copy link
Contributor

Why not just random?

@danhhz
Copy link
Contributor

danhhz commented Jun 13, 2016

Right, of course. +1 for random

@maddyblue
Copy link
Contributor

Random makes sense to me.

@tbg
Copy link
Member

tbg commented Jun 13, 2016

On a somewhat related note, how do we handle the situation in which there is "morally" a unique primary key but ordering is explicitly not desired? For example, ingesting many rows which are only ever going to be accessed individually is a bit tricky.

You could assign random uuids, but then you must save that uuid for later retrieving the value (or a secondary index is needed, which again has hot spots). Instead, the reasonable thing to do is to chose as the "real" primary key a hash of the "moral" primary key. For example, if a table is used to look up users by email, then one would want a primary key column int64 DEFAULT some_hash(email).

That makes sure that writes are spread out over the keyspace, but there is no overhead to look up data (though the lookup would have to query WHERE pk = some_hash($email)). This effectively emulates how hashing data distribution works.

Do we have something like that already?

@bdarnell
Copy link
Contributor

We already have a random() function that returns a float between 0 and 1, so it sounds like all we need is a random_int64() function. (-1 on giving it a name like unique_rowid_nonserial()).

@tschottdorf it sounds reasonable to me to use id BYTES DEFAULT sha256(email) PRIMARY KEY, but we don't currently allow references to other columns in default values (nor do postgres and mysql). But since the application would need to do the hashing in their WHERE clauses as well, I'm not sure it's worth trying to make the default more magical.

Another possibility would be to make the hashing a part of the index definition, so the SQL would just see the raw email, but the index would encode things using the hash.

@tbg
Copy link
Member

tbg commented Jun 13, 2016

The latter option has the benefit of fitting in nicely as a special case of user-defined index functions (for example a secondary index on LOWER(username) should work much the same as a primary key HASH(username)).

@maddyblue maddyblue removed their assignment Jul 6, 2016
@knz knz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. labels Oct 27, 2016
@knz
Copy link
Contributor

knz commented Nov 8, 2016

Supposing we had such a random function, would it make sense to use it as the default for newly created table from that point forward? Since SQL does not guarantee that rows are placed in the same order in storage (more specifically, that order on select is not guaranteed unless there's an explicit "order by" clause) as they were inserted, this would be semantically correct -- and wouldn't it buy us protection against hotspots during large batched inserts?

@maddyblue
Copy link
Contributor

I think we discussed this and decided that sometimes you do want rows to be inserted generally near each other, so it's not clear that this is always a good idea.

@bdarnell
Copy link
Contributor

Yeah, my sense is that approximate locality based on time of insertion is desired more often than random distribution. That's the whole reason behind the design of the unique_rowid function. It's not suitable in all cases but we have to pick one way or the other to be the default and I think we made the right choice.

@knz
Copy link
Contributor

knz commented May 9, 2018

I am providing the integer random function in #25388.
This together with computed columns, which can be used as index keys, satisfy the various use cases that were discussed in the history of this issue.

@knz
Copy link
Contributor

knz commented May 9, 2018

(Also since then we have built-in hash functions.)

@knz
Copy link
Contributor

knz commented May 14, 2018

@knz knz closed this as completed May 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

7 participants