Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitize table names in copy_to() #611

Closed
richierocks opened this issue Apr 11, 2017 · 6 comments
Closed

Sanitize table names in copy_to() #611

richierocks opened this issue Apr 11, 2017 · 6 comments

Comments

@richierocks
Copy link
Contributor

Dots in destination table names cause errors.

library(dplyr)
library(sparklyr)

local_tbl <- data_frame(
  x = runif(5)
)
spark_conn <- spark_connect("local")
spark_tbl <- copy_to(spark_conn, local_tbl, "foo.bar")
## Error: org.apache.spark.sql.AnalysisException: 
##    It is not allowed to add database prefix `foo` for the TEMPORARY view name.
spark_tbl <- copy_to(spark_conn, local_tbl, "foo.1")
##  Error: org.apache.spark.sql.catalyst.parser.ParseException: 
##    extraneous input '.1' expecting <EOF>(line 1, pos 3)

It would be nice if the table names were checked before they were sent to Spark. I'm not sure if the best behavior is to throw a helpful error or to fix the name.

@kevinushey
Copy link
Contributor

Good idea -- we currently sanitize column names (https://github.com/rstudio/sparklyr/blob/7b1c6d8ff864612d8f5becacbc24fcd9d131ef21/R/utils.R#L116); it might make sense to use a similar transformation for table names.

@chezou
Copy link
Contributor

chezou commented Apr 15, 2017

I guess database name handling affects this problem. Is it usual thing in R world to name a table with dot? In SQL, dot is a separater syntax between database and table name.

@richierocks
Copy link
Contributor Author

@chezou Dots in variable names are allowed in R, and it wasn't clear to me that they weren't allowed in Spark.

If I've understood the DBI package documentation correctly, I think that sparklyr just needs an improved method for dbQuoteIndentifier().

@chezou
Copy link
Contributor

chezou commented Apr 17, 2017

Thanks for the explanation. To be honest, I sometimes saw variables with dots in R, but I was not sure it is usual or not.

I heard Hadley is trying to decide how to handle DB identifier. r-dbi/DBI#24 It would solve this problem. In the SQL world (not only Spark world), it is very rare to use dots for a table name, and I didn't realize that problem. Not unlike usual DB, such as PostgreSQL, we usually use Spark to join with other tables.

@chezou
Copy link
Contributor

chezou commented Jun 21, 2017

With dplyr 0.7.0, we now have in_schema() for handling table name appropriately. We should change the way to hande table names.

@yitao-li
Copy link
Contributor

yitao-li commented Mar 30, 2021

Closing this issue because I don't think there is any sensible way to move forward with any of the proposed changes:

  • Possibility 1: replacing "." with some other valid character in view name -- I don't think that's a good idea, because if the name of the view is supposed to be some pre-determined string, and the name has "." in it, then it is better to fail (as Spark won't accept "." in the name) rather than replacing "." with something else and pretend the predetermined table/view name is acceptable for Spark

  • Possibility 2: improving quoting to make it work -- I don't think this is possible with Spark at the moment. See examples below:

library(sparklyr)

sc <- spark_connect(master = "local")

DBI::dbGetQuery(sc, "CREATE VIEW `foobar` AS SELECT 1") #works
DBI::dbGetQuery(sc, "CREATE VIEW `foo.bar` AS SELECT 1") # will throw an error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants