Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling column names directly? #6

Closed
adamroyjones opened this issue Dec 8, 2018 · 17 comments
Closed

Handling column names directly? #6

adamroyjones opened this issue Dec 8, 2018 · 17 comments

Comments

@adamroyjones
Copy link

Typically, a csv file will have a header row (e.g. carat,cut,color,clarity,depth,table,price,x,y,z for the diamonds data set from ggplot2).

Is there a chance of adding an option to perform RBQL queries directly with the columns (e.g. carat) instead of having to reference them by their index (e.g. a1)?

@mechatroner
Copy link
Owner

mechatroner commented Dec 9, 2018

Hi @Docbiz !

There will be a lot of corner cases:

  1. What if column names are long?
  2. What if column names contain spaces or other non-alphanumeric chars? OK we can handle this by adding a rule that all non-alphanumeric should be replaced with an underscore.
  3. Some tables do not have a header row.
  4. Many result set tables wouldn't have a header row even if the source table had it.

BTW maybe the root problem why we may need this feature is because it is hard to associate a1,a2... vars with columns? I mean it should be trivial for VSCode version since there is an alligned table, but for standalone rbql console apps it is really hard to tell what is the number of a specific column? In this case it is possible to add some terminal user interface to rbql-js and rbql-py.
So basically, what is your use case? Why using column names from header for you is more convenient than a1,a2,... vars?

@adamroyjones
Copy link
Author

It's certainly true that always and only referencing columns by their header row (where one exists) would be very error-prone. But I think it would be a nice option—I stress 'option'—to permit queries that reference columns by their header name. (The header names could be restricted to, say, not include the file's separator or `; in this way, you could reference column names that contained spaces by wrapping it in ` characters (much like in R).)

The advantage of queries that use the names of the columns is that they can be understood on their own without the user having to keep flicking back and forth between the query and the table(s). That is: yes, it can be a bit frustrating to associate a1, a2, etc. with specific columns!

(My personal use case is wanting to casually reshape data without firing up something like R, but I think this would be a generally useful feature.)

If you want, I'll try to find some time over the next few weeks to come up with a pull request?

@mechatroner
Copy link
Owner

I like your idea with wrapping column names in backtick chars, it is better than underscore replacement which I had in mind.
OK, go ahead with a pull request. Please implement it just for one backend (Python or JS whichever one you like more) and then we will merge it into the other one.

@RobertGiesecke
Copy link

Please don't use some weird mysql-specific stuff.
In SQL, you quote identifiers by enclosing them in double quotes, which also makes them case-sensitive.

@mechatroner
Copy link
Owner

There are actually some aspects of this problem that I didn't think about before.

  1. If column names are used in query, header row should be skipped by record iterator.
  2. Probably combining column indexes (a1, a2 etc) and column names should be disallowed in a single query because column indexes have header-agnostic semantic (do not suppose to skip the header row)
  3. Switching between column names and column indexes can be done with a special flag in RBQL user interface.
  4. There should be a way to distinguish between column names in the main table A and join table B when these tables share columns with the same names.
  5. Column names should be easily and reliable detectable by a simple parser. I.e. it should be impossible to mix them up with JS or Python string literals and/or reserved keywords. Maybe scheme like a.columnname and b.columname can be used where columname part is a column name with all non-alphanumeric characters removed. Or it can be a."Colum name" or something like that.

Anyway it looks like any possible implementation of this feature would be very tricky because of all these considerations.

@mechatroner
Copy link
Owner

Maybe there is actually an interesting trick to achieve this, currently RBQL uses a1, a2, ... notation. But if we declare special a and b objects in Python and JS templates and initialize them with the current record, then the following 3 valid Python/JS syntaxes would be automatically available with these special objects if properly implemented:

  1. a[1], a[2], ....,b[1], .... - array notation, looks ugly for SQL
  2. a.name, a.age, a.columnname, ... , b.phonenumber, ... - attribute notation, looks like valid SQL. Downside is that not all column names can be used as attributes, but a simple rule can be used - just discard all non-alphanumeric chars.
  3. a["name"], a["age"], a["Column name"], .... , b["Phone number"] - dictionary notation, looks ugly for SQL

@peheje
Copy link

peheje commented Jun 29, 2019

Another solution could be to translate the query string with column names into a query string with a1, a2, an. Then feed that into the current engine.

select name, age where age > 2
->
select a1, a2 where a2 > 2

By reading header you can build dictionary {name:1,age:2}

No core engine requirements changed.

@mechatroner
Copy link
Owner

@peheje This will not work because we can't reliably "translate" the query string. Think about name collisions between column names and built-in python/js keywords, identifiers and popular modules e.g. columns named "if", "sys", "os", "or", "and" etc. So we can't just replace all tokens in a query that match column names in the csv file. One way to work around this is to require all column names to be wrapped in backticks as @Docbiz proposed, then we can make the translation work reliably. But I think it is possible to totally avoid the translation by combining notations 2 and 3.

@peheje
Copy link

peheje commented Jun 29, 2019 via email

@mechatroner
Copy link
Owner

OK, That's how I understand your proposal (and problems associated with it):

let's say we have a table:

name,age,JSON
Andy,30,"[100, 150]"
Maria,20,"[200, 125]"

And the query (in javascript) is:

SELECT JSON.stringify({'person_name': name, 'person_age': age})
  1. we read the header and build the following dictionary:
    {name: a1, age: a2, JSON: a3}

  2. we patch the query using the dictionary and get:

SELECT a3.stringify({'person_name': a1, 'person_age': a2})

which is obviously an error, a3. stringify(...) will throw an exception because a3 is a string that doesn't have stringify() function.

So how do you suggest to alter that algorithm to avoid the problem?

@peheje
Copy link

peheje commented Jun 30, 2019

I admit, I didn't think of that. However the user could be forced to specify where the column is coming from: a or b. Required in the case of join anyway.

Maybe this could be a solution?

String.prototype.replaceAll = function(search, replacement) {
    return this.replace(new RegExp(search, 'g'), replacement);
};

var dic = { "a.name": "a1", "a.age": "a2", "a.JSON": "a3" };
var query = "SELECT JSON.stringify({'person_name': a.name, 'person_age': a.age}) order by a.JSON";
var patched = query;

for (prop in dic) {
    patched = patched.replaceAll(prop, dic[prop]);
}

console.log(patched);
// SELECT JSON.stringify({'person_name': a1, 'person_age': a2}) order by a3

I'm know sure how to handle spaces in column names though, maybe unsupported?

@mechatroner
Copy link
Owner

Yep, I think that a.name, a.age, b.name, etc notations are the way to go.
But there is no need to do replaceAll, because it is unreliable even with "a." and "b." prefixes. Original user query should be altered as little as possible. One case where it will definetely fail are string literals: should we replace a.name inside string literals? In case of SQL keywords ("SELECT", "WHERE", etc) the answer is no, and RBQL isolates string literals before it starts searching for the keywords. But in case of a.name the answer is not so obvious. Here is an absolutely artificial and contrived example, but it will fail with replaceAll:

SELECT `a.name = ${a.name}`

The output would be a1 = Andy instead of a.name = Andy. I agree that such situation would probably never happen in practice, but my general position still holds: original query should be altered as little as possible. This is also important for exceptions/errors that contain failed query expression, if the query was significantly altered it wouldn't be recognizable by the user, and it would be harder to correct that error.

So to implement this "a.name" etc notations I suggest to initialize special "a" and "b" objects (if either "a." or "b." were detected somewhere in the query). This should be relatively easy to do - almost the same code that currently initializes a1, a2, b1 etc variables. "a" and "b" can also support square bracket access like a["person name"]. Those who think that this is super ugly and non-SQLish can pretend that this feature doesn't exist and only columns without whitespaces/special characters are supported. I don't have 100% guarantee that this approach would work, but I currently don't see why it wouldn't, and the only way to be sure is to try and implement this.

@fahnzmode
Copy link

Doesn't address this issue directly, but something that might be worth considering is to simply allow skipping of the header row when the query is performed.

@mechatroner
Copy link
Owner

@fahnzmode I was planning to do this at some point. Implementing an additional "--skip-header" option should be easy, but it has to be supported in all user interfaces - this is the hard part. Also I would like to add support to NL variable simultaneously - which would be at least NR + 1 in this case (or even greater if some records span multiple lines). BTW until it's implemented you can use where NR > 1 to skip the header line.

@fahnzmode
Copy link

Thanks for the where NR > 1 tip @mechatroner !

@mechatroner
Copy link
Owner

So, I wrote the code to support a.good_alphanumeric_column_name and a["arbitrary column name!"] notation, it's available in the master branch. Took me almost 150 commits [sigh]
Soon I will start deploying it to the apps.
And I would really appreciated any feedback, guys! Please, let me know what you think.

PS I rewrote JS implementation: replaced all callbacks with async/await so it should be much easier to read now.

@mechatroner
Copy link
Owner

Column name support was added for all RBQL apps that I am supporting.
For VSCode version and rbql.org I also added variable autocomplete feature - when user types "a." the app would show all possible column names. This autocomplete is in early development stage: e.g. typing "b." prefix would not show anything which is obviously wrong.

Many thanks to everyone who participated in this discussion!

I am not sure if RBQL would ever support column names without "a." and "b." prefixes, still we can open a new issue for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants