Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string interpolation, juxtaposition & multiline strings #2301

Closed
5 of 6 tasks
StefanKarpinski opened this issue Feb 14, 2013 · 47 comments
Closed
5 of 6 tasks

string interpolation, juxtaposition & multiline strings #2301

StefanKarpinski opened this issue Feb 14, 2013 · 47 comments
Labels
breaking This change will break code needs decision A decision on this change is needed
Milestone

Comments

@StefanKarpinski
Copy link
Member

Here's what I think we should do:

  • Keep string interpolation. It's just too convenient and taking away popular features for language purity is obnoxious.
  • Implement interpolation in the parser. Currently, it's implemented as a macro. This leads to problems like Quotes and string interpolation #455. Instead, we should implement string interpolation for normal strings in the parser, which will allow handling this just fine since it will know when a string is or isn't finished.
  • Non-standard string literals are on their own. This means that non-standard string literals cannot actually fully emulate the behavior of standard string literals. They can use interp_parse to get close, but they won't be able to handle string literals in interpolated expressions [Quotes and string interpolation #455]. That's fine. You can either have full-blown parser-supported string interpolation or you have a custom string literals, but not both at the same time. This is an acceptable compromise.
  • Use Allow string literal juxtaposition for concatenation. Merge this branch and make string literal juxtaposition the a syntax for string concatenation. If you want to concatenate two variables as strings, you can do foo "" bar, which parses as string(foo,bar) rather than string(foo,"",bar). If you need to reference the string concatenation operation as a function, you use string. This is not really any different than having special syntax for e.g. indexing.
  • Deprecate * for string concatenation. This was a cute experiment, but it's bad. The biggest problem is the Char*Char issue, which we should not deprecate, but rather just remove (and perhaps make Char a proper integer type again?).
  • Parse multiline string literals as macro calls. When an unprefixed multiline string literal (i.e. triple-quoted string) is encountered, handle interpolation in the parser and emit it as a macro call (maybe @mstr?) for further processing. When a prefixed multiline string literal is encountered, no interpolation is done, but the string is passed off to an appropriate macro.

Regarding the last point, consider the following:

# note the indentation
  """
  Four score and $n years ago...
  """

Should be equivalent to:

@mstr("\n  Four score and ", n, " years ago...\n  ")

The @mstr macro can handle indentation stripping a la #70 by looking at the trailing whitespace of the last string literal macro argument and stripping that from the indentation of all the string literal arguments.

Prefixed triple-quoted strings should emit macro calls just like prefixed single-quoted strings do, although I'm not sure whether they should call a different macro or the same one. Having to have define different macros to support both normal and triple strings is annoying, but on the other hand, you might want to handle triple-quoted literals specially.

@aviks
Copy link
Member

aviks commented Feb 14, 2013

There was a point, I think, where string concatenation created RopeString types. Now string(a,b) creates a bytestring (ASCII or UTF8 as the case may be). Is this the intended behaviour? Is RopeString meant to be used only explicitly?

@StefanKarpinski
Copy link
Member Author

@JeffBezanson hates rope strings ;-). Currently, yes, I believe RopeStrings are only created explicitly. The string stuff is going to need an overhaul at some point for greater performance, so I wouldn't depend on the precise representation of strings in general. Fortunately, you should be able to write code to generic strings, no?

@aviks
Copy link
Member

aviks commented Feb 14, 2013

That's fine, just wanted to clarify the intended behavior. The only time I'd care is when building up large strings dynamically in many parts. In such situations, it is fine to create a rope explicitly.

@JeffBezanson
Copy link
Member

This is a good proposal. All I'd say is we should also get rid of interp_parse, since otherwise we have two different forms of string interpolation that work slightly differently, which seems awful.

RopeStrings are the wrong default, since if you use them to concatenate small strings performance is terrible. In the worst case where each RopeString node has 1 character, it uses 30x more memory.

@nolta
Copy link
Member

nolta commented Feb 14, 2013

What? Who is this? "This was a cute experiment, but it's bad."? Like the @RealStefanKarpinski would ever say something like that, please.

"You'll pry * for string cat out of my cold, dead, algebraically correct hands", that's what @RealStefanKarpinski would say.

@JeffBezanson
Copy link
Member

I think he's going to have to create that github account now :)

@pao
Copy link
Member

pao commented Feb 14, 2013

...I was liking * (and it's extension ^ for repetition). And a "" b is even weirder than that, and doesn't give us an obvious string repetition operator. How is it better?

@StefanKarpinski
Copy link
Member Author

That's a fair point, @pao. But repetition isn't very common. There are a number of bad problems with * for strcat, which I'll get into in a bit if @JeffBezanson doesn't beat me to it...

@pao
Copy link
Member

pao commented Feb 14, 2013

Yeah, I've seen some of the problems with extension to Char in other issues, so I know it's not all roses. But we already catch people out on our unusual choice of concatenation operator. At least we can say "hey look at this theory!" and what we're using looks like an operator. This would make "" the de facto concatenation operator, which has certain usability issues in that it's even less obvious what it should mean.

@StefanKarpinski
Copy link
Member Author

So one of the major issues is 'a'*'b'. You can make this do string concatenation, but it's at odds with the very useful interpretation of characters as integers via their code point, as in 'b'-'a' or '0'+3. It's weird to have both + and * defined for characters but not with the usual relationship that + and * have. But the trouble goes deeper and lack of associativity seems to be at the heart of it. Now string concatenation is associative and numerical multiplication is associative (for numbers or matrices or whatever), so what's the issue? The problem arises when you mix the two possible meanings of * as in 2*3*"x" – what should that do? Currently, it's actually a no method error, which is really the best you can do, but it seems like it could either mean "23x" or "6x". And can you guess what 2*'x' does? Does it produce "2x" or 240? Yeah, so this is really kind of a disaster currently. The fact that the best behavior for 2*3*"x" is a no method error is a pretty strong hint that we're trying to cram two very different meanings into a single operator.

The a "" b is a bit odd, but it actually just falls out naturally just from allowing juxtaposition with a string literal to imply concatenation, which C also does, so it's not really without precedent. The fact that a "" b can be parsed as string(a,b) instead of string(a,"",b) is really just an implementation detail since they both produce the same string in the end. Better suggestions for a string concatenation operator would be welcomed, but so far I haven't been able to come up with any. Please no one suggest + because that's got even more problems than * – think about adding chars.

@JeffBezanson
Copy link
Member

We might want to go the rest of the way and make Char not a subtype of Integer. Interestingly, this patch seems to cause no problems:

--- a/base/char.jl
+++ b/base/char.jl
@@ -44,7 +44,7 @@ promote_rule(::Type{Char}, ::Type{Uint128}) = Uint128
 ## character operations & comparisons ##

 -(x::Char, y::Char) = int(x)-int(y)
-+(x::Char   , y::Char   ) = char(int(x)+int(y)) # TODO: delete me
++(x::Char   , y::Char   ) = error("no method")

But if we stopped using * for stringcat we could go the other way and make it a fully functional integer.

@StefanKarpinski
Copy link
Member Author

@JeffBezanson: I'm fine with deleting interp_parse although I must confess I will be a little sad to see that bit of my handiwork go. It has served it's purpose, which was to sneak string interpolation into the language despite your deep aversion to it :-)

@StefanKarpinski
Copy link
Member Author

Insisting that Char is not an integer just seems really wrong to me. It's annoying and causes all sorts of problems. The basic problem is that string concatenation and multiplication are very different things. You might want to multiply the values of two characters or you might want to concatenate two characters into a string.

@nolta
Copy link
Member

nolta commented Feb 14, 2013

Insisting that Char is not an integer just seems really wrong to me.

Why? 'b' - 'a'or '0' + 3 or 'a' < ',' etc. just seem like gibberish to me.

@JeffBezanson
Copy link
Member

True. My point is more that Char should either fully be an integer or fully not be. Right now it is a subtype of Integer but doesn't act that way, which is just buggy.

Having Char be a subtype of Ordinal (along with Ptr) is perfectly defensible; 'a'*'b' would just be an error like it is for pointers, which is fine.

@diegozea
Copy link
Contributor

If Char is not an Integer any more, maybe + would be an option ? And what about using nothing more than simply juxtaposition of strings ?

Wiki: http://en.wikipedia.org/wiki/Concatenation

In many programming languages, string concatenation is a binary infix operator. 
The "+" operator is often overloaded to denote concatenation for string arguments: 
"Hello, " + "World"; has the value "Hello, World".

Concatenation of sets of strings
...
In this definition, the string vw is the ordinary concatenation of strings v and w as defined in the introductory section.
In this context, sets of strings are often referred to as formal languages.
There is typically no explicit concatenation operator, simply juxtaposition (as with multiplication).

@diegozea
Copy link
Contributor

http://en.wikipedia.org/wiki/Comparison_of_programming_languages_%28strings%29#Concatenation
+ looks to be the more popular
What about using a & b ?

@pao
Copy link
Member

pao commented Feb 14, 2013

...but it actually just falls out naturally just from allowing juxtaposition with a string literal to imply concatenation, which C also does, so it's not really without precedent.

I've never seen anyone use "" to concatenate string variables in C. It may fall out, but I'd be curious to know if it is in fact intuitive. My intuition is it isn't, but I'd be happy to be shown wrong here.

@StefanKarpinski
Copy link
Member Author

Why? 'b' - 'a'or '0' + 3 or 'a' < ',' etc. just seem like gibberish to me.

These aren't gibberish at all. They get used all the time in C-style string parsing and printing code. See here or here for example. Comparison of characters as integers is implicit in all string ordering.

@nolta
Copy link
Member

nolta commented Feb 14, 2013

I'm just saying that Char's integral behavior is a side-effect of the implementation. They're not intrinsically integers. By "gibberish", i mean that the result of 'a' < ';' is encoding dependent -- there's no right answer.

@StefanKarpinski
Copy link
Member Author

That's not true at all – Chars don't have an encoding, they are code points, which are completely encoding-independent – that's the whole point. Unicode code points are intrinsically integers.

@toivoh
Copy link
Contributor

toivoh commented Feb 14, 2013

+1 to string interpolation in the parser.
+1 to chars as ordinals, too! You could also view chars as something like an affine space, I guess that amounts to pretty much the same thing. I think that all the meaningful integer operations and involving chars, and definitely all the ones in your example above, Stefan, go under this heading. I also don't se a real problem with keeping to use * and ^ as we do now (and I do like them!), since they are not meaningful for ordinals.

Also, it seems nice and consistent that when juxtaposition is meaningful, it coincides with multiplication.

@GunnarFarneback
Copy link
Contributor

I dislike the a "" b concatenation approach and would be equally happy with * or +. What I don't understand is the expectation that you should be able to concatenate chars in the same way as strings. 'a''b' should either be integer multiplication or undefined. Require an explicit string('a','b') if you need to concatenate chars or turn them into strings separately first. Likewise 2'a' should either be integer multiplication or undefined. 2_3_"x" should be "6x" in accordance with operators.jl line 44.

@pao
Copy link
Member

pao commented Feb 14, 2013

Unicode code points are intrinsically integers.

Unicode code points serve to enumerate characters. The ordering of Unicode characters is a convention, and in the least significant bits doesn't necessarily mean anything.

However, it is also true that the values of code points have some meaning, particularly in the most significant bits (separating code planes). So Char is conceptually overloaded.

We may want access to both interpretations, but which should be the default? And should it have anything at all to do with how we deal with String? (I am truly unsure of how best to answer either question.)

@kmsquire
Copy link
Member

Regarding *, I complained vociferously about it in a very early discussion... but now I kind of like it (sheepish grin). It's just one of the things in Julia that took a little getting used to.

I also don't think it's necessary to handle every combination of * of different types. 2 * 3 * 'a' is nonsensical, unless maybe you take it to mean "aaaaaa", as Python does (but we already have ^ for that, if * keeps its current meaning). If you want to treat 'a' as an integer, cast it. It should be rare enough that inconveniencing the use of Char as an integer isn't a big deal.

(Of course, an alternative is just to break down and use + for concatenation (ducks) and * for repeat, and arbitrarily ignore, disallow, or work around all of the problems.)

@diegozea
Copy link
Contributor

( I like the python way of + for concatenation and * for repeat, I found it very intuitive. )

@StefanKarpinski
Copy link
Member Author

For now I'm retracting the deprecation of * for string concatenation. It's largely incidental to this issue, most of which can be done without deprecating that usage of *. We can work the concatenation thing out later.

nolta added a commit that referenced this issue Mar 2, 2013
Partially implements #2301. Also 'x"""..."""' now maps to
'@x_mstr """..."""' instead of '@x_str'. Slightly annoying, but it
allows 'L"""..."""' to not strip whitespace.
@nolta
Copy link
Member

nolta commented Mar 3, 2013

Ok, this should all be finished, except for string juxtaposition.

I had to create separate @*_mstr macros for prefixed triple-quoted strings, since whitespace stripping is handled by these macros. Do we want this? One alternative would be to move the stripping code into the parser.

In order to remove interp_parse, I"...", b"...", and B"..." interpolation is handled by the parser. I don't mind creating an exception for I, but b and B might be pushing it. Should we interpolate b strings? I only see one use of this (in extras/image.jl), and it doesn't really require interpolation:

    ss = sort(b"$s")

@JeffBezanson
Copy link
Member

I'm not even sure what B"..." is for. Its only use seems to be making invalid utf-8 strings. And multiline b"""...""" literals are a bit odd, since that feature is clearly for text and not binary data. It's fine if it works, but no need to bend over backwards to support it.

@JeffBezanson
Copy link
Member

I once more attempted the exercise of making Char not an Integer. What happens is you need to duplicate a lot of the scalar definitions in number.jl. Then there is a lot of code that does things like 7 <= c <= 13, and c & 0x3F. It might make sense to have some kind of Scalar type with Number and Ordinal below it, but there are just too many cases where a Char is treated like an integer. So now I feel the most convenient thing is to make it a proper integer, and not use * for concatenating Chars.

@JeffBezanson
Copy link
Member

I will also add that I think *, string juxtaposition, and interpolation is too many syntaxes.

@StefanKarpinski
Copy link
Member Author

We should probably have a bikeshedding session about the non-standard string literal prefixes in Base. There are too many of them now and they're too hard to remember and the behavior of L"..." is a bit questionable since there are certain things you simply can't express (although it's also fairly handy).

@StefanKarpinski
Copy link
Member Author

I agree that making Char not an Integer feels really contrived and awkward. I would be cool with discarding * for string concatenation in general at this point and embracing juxtaposition, interpolation and explicit use of the string function for concatenation. That's plenty of ways to skin this cat.

@pao
Copy link
Member

pao commented Mar 6, 2013

Just make sure we don't lose a replacement for string^n. I do actually use that.

@pao
Copy link
Member

pao commented Mar 6, 2013

Although I still haven't figured out why Char creates a problem for string concatenation. And juxtaposition and * are equivalent elsewhere in Julia.

@StefanKarpinski
Copy link
Member Author

Yes, repeating strings is definitely necessary. I think some kind of rep function that repeats vectors or strings or whatever iterable in general makes sense.

@StefanKarpinski
Copy link
Member Author

@pao: the issue is whether 'a'*'b' is "ab" or 9506. If Char is an Integer, then the latter is the correct answer but if * is the string concatenation operator then it should also work for Chars, making "ab" the correct answer.

@nolta
Copy link
Member

nolta commented Mar 6, 2013

Shall we get rid of B"..." strings?

@JeffBezanson
Copy link
Member

Yes I'd say so.

nolta added a commit that referenced this issue Mar 6, 2013
Per the discussion in #2301.
@GunnarFarneback
Copy link
Contributor

I still don't understand why the string concatenation operator must also work for Chars. Why is it so bad to require that you first make strings out of your Chars if you want to concatenate them?

@JeffBezanson
Copy link
Member

Not a problem any more. Char now behaves like an integer, * and all.

@pao
Copy link
Member

pao commented Mar 6, 2013

if * is the string concatenation operator then it should also work for Chars

That's the assertion I'm challenging (as @GunnarFarneback notes).

@JeffBezanson
Copy link
Member

Well now * is only string cat, and does not concatenate Chars, so that's where we ended up.
My personal preference would be to use only string() and string juxtaposition, and eliminate * and interpolation. But I'm outvoted on that, and having 3 syntaxes for it is crazy, so here we are.

@kmsquire
Copy link
Member

kmsquire commented Mar 7, 2013

String juxtaposition with "" just looks wierd.

... of course, I said the same thing when I saw * used for concatenation.

I do agree with Jeff that 3 syntaxes is silly. I propose that, having heard everyone's views, Stefan and/or Jeff just make an executive decision and let everyone just get used to things.

(Please just make a good choice. ;-) )

@StefanKarpinski
Copy link
Member Author

Now that we have call overloading, it occurred to me that we can do this:

julia> Base.call(s::String, args...) = join(args, s)
call (generic function with 855 methods)

julia> a, b, c = "foo", "bar", "baz"
("foo","bar","baz")

julia> ", "(a, b, c)
"foo, bar, baz"

I'm not saying we should necessarily do this, but we could. The reason I was thinking this is that if we used juxtaposition for string concatenation, then you would write a "" b to concatenate a and b. But that leaves one wondering how to pass the concatenation operation as an object, say to a higher order function. But the empty string could serve that purpose:

julia> words = [a, b, c]
3-element Array{ASCIIString,1}:
 "foo"
 "bar"
 "baz"

julia> reduce("", words)
"foobarbaz"

Slightly weird but it does have a certain internal consistency.

@andyferris
Copy link
Member

I was surprised that string literal juxtaposition doesn't work. I was trying to copy-paste some C++ code that had

std::string points =
        "248258.441322 7417253.63825 44.2832223546\n"
        "248258.909841 7417253.42727 44.066906061\n"
        "248258.985642 7417253.11483 44.5358143357\n"
...
        "248267.816489 7417238.83666 44.6165076596\n";

and thought that might work in Julia (with some bracketing).

Should we support this? It's the same as multiplication juxtaposition, no? Unlike numbers, it "looks" more like the true result than 6 = 2 3 (which is also unsupported).

@freeboson
Copy link
Contributor

freeboson commented Nov 3, 2017

@andyferris I also think it's weird that you can't concatenate string literals at the parser level. If you do

points = "248258.441322 7417253.63825 44.2832223546\n" *
        "248258.909841 7417253.42727 44.066906061\n" *
        "248258.985642 7417253.11483 44.5358143357\n" * #...
        "248267.816489 7417238.83666 44.6165076596\n"

you are actually creating machine instructions for each *(::String, ::String). Obviously in this case you can use """ since you want newlines, but it's typical to use juxtaposition to break up long literals (without newlines) in C, C++, and Python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code needs decision A decision on this change is needed
Projects
None yet
Development

No branches or pull requests