Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make / behave in SQL as it does in R #1862

Closed
iangow opened this issue May 26, 2016 · 8 comments
Closed

Make / behave in SQL as it does in R #1862

iangow opened this issue May 26, 2016 · 8 comments

Comments

@iangow
Copy link

iangow commented May 26, 2016

See #1799 (now closed).

There we discussed the fact that / will behave differently in SQL from how it behaves in R. This breaks a nice feature of dplyr, which is that analysis can be moved from R to SQL as it proceeds with little or no changes to code. I suggested that / should get translated so as to behave the same in both places:


@hadley responded: "Hmmm, that's not a bad idea, but what should it get translated to?"


I suggested that translating / as %* 1.0 /% might work:

> pg <- src_postgres()
> stuff <- tbl(pg, sql("SELECT 5 AS a, 2 AS b"))
> stuff
Source: postgres 9.5.2 [igow@localhost:5432/crsp]
From: <derived table> [?? x 2]

       a     b
   (int) (int)
1      5     2
..   ...   ...
> stuff %>% mutate(c = a %/% b)
Source: postgres 9.5.2 [igow@localhost:5432/crsp]
From: <derived table> [?? x 3]

       a     b     c
   (int) (int) (int)
1      5     2     2
..   ...   ...   ...
> stuff %>% mutate(c = a %* 1.0 /% b)
Source: postgres 9.5.2 [igow@localhost:5432/crsp]
From: <derived table> [?? x 3]

       a     b     c
   (int) (int) (dbl)
1      5     2   2.5
..   ...   ...   ...

I'm not sure if it was intended to close this issue along with the original "too esoteric" issue from #1799, so I am making this a new one (perhaps to be closed again) just in case.

@hadley
Copy link
Member

hadley commented Jun 7, 2016

This just feels like a step too far to me. I can't put it precisely into words, but my intuition tells me its the wrong thing to do.

@hadley hadley closed this as completed Jun 7, 2016
@iangow
Copy link
Author

iangow commented Jun 7, 2016

I think this XKCD applies here too. That is, this is perhaps best left as a matter of documentation in the databases vignette (e.g., / gets translated(?) into / in SQL, which does truncated division if the two arguments are integers, so SELECT 3/2 yields 1) or

> tbl(pg, sql("SELECT 3 AS a, 2 AS b")) %>% mutate(c=a/b)
Source: postgres 9.6.0 [igow@localhost:5432/crsp]
From: <derived table> [?? x 3]

       a     b     c
   (int) (int) (int)
1      3     2     1
..   ...   ...   ...

@krlmlr
Copy link
Member

krlmlr commented Jun 8, 2016

I would expect at least the basic operators to behave identically in all backends. However, if we change the behavior now, it's a truly breaking change: It affects both the results and the data type of the output of a division, and also all operations that are based on the result.

How about introducing and supporting a new %//% operator that works like R's / in both SQL and data frame? Then we could gradually phase out usage of the original division operator (by first giving a warning, later perhaps an error), or simply leave matters unchanged.

@iangow
Copy link
Author

iangow commented Jun 8, 2016

I think this may just be a matter of dividing the world into SQL and non-SQL backends. In SQL, / has it's well-defined meaning (i.e., truncated division for integers) and in non-SQL, it has the regular R meaning. Hadley's intuition may be that there are enough differences between SQL and non-SQL that it might be too much to try to stamp them all out (e.g., sum with NA/NULL values), as elegant as it would be if it could be done. (Perhaps it could be that there's a option that can be set to choose greater conformity.)

> pg <- src_postgres()
> df <- tbl(pg, sql("SELECT UNNEST(ARRAY[1,2,3,NULL]) AS a"))
> df
Source: postgres 9.5.3 [igow@localhost:5432/crsp]
From: <derived table> [?? x 1]

       a
   (int)
1      1
2      2
3      3
4     NA
..   ...
> df %>% summarize(sum(a))
Source: postgres 9.5.3 [igow@localhost:5432/crsp]
From: <derived table> [?? x 1]

   sum(a)
    (dbl)
1       6
..    ...
> df %>% collect() %>% summarize(sum(a))
Source: local data frame [1 x 1]

  sum(a)
   (int)
1     NA

(Note that sum quietly changes type from input to output in PostgreSQL.)

Of course, // could be confusing for Python (3) users (meanings would be flipped).

2009-Mac-Pro:~ igow$ ipython3
Python 3.4.4 (default, Mar  2 2016, 03:31:27) 
Type "copyright", "credits" or "license" for more information.

IPython 4.2.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: 3/2
Out[1]: 1.5

In [2]: 3//2
Out[2]: 1

In [3]: exit
2009-Mac-Pro:~ igow$ ipython2
Python 2.7.11 (default, Mar  1 2016, 18:40:10) 
Type "copyright", "credits" or "license" for more information.

IPython 4.2.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: 3/2
Out[1]: 1

In [2]: 3//2
Out[2]: 1

@hadley
Copy link
Member

hadley commented Jun 8, 2016

It's simply not possible for all operators to act identically - even basic things like mean and sum work differently. This is a delicate balancing act, but I just don't think it's worth it in this case.

@krlmlr
Copy link
Member

krlmlr commented Jun 8, 2016

I thought it's possible at least for operators, but perhaps not. How about printing a message if a backend's behavior is incompatible with different from R (e.g., using / with integer operators)?

@hadley
Copy link
Member

hadley commented Jun 8, 2016

I'm not sure even that's true - consider NA vs NaN vs NULL etc. Are the semantics always going to be identical?

@krlmlr
Copy link
Member

krlmlr commented Jun 8, 2016

Correct roundtrip should be done by the DBI backends. There are still rough edges, but I think valid input should give correct output, or warn.

With sum() and mean(): Here, a message could appear if the user doesn't explicitly state na.rm = TRUE, but that's probably overkill.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants