Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace function should support multiple replacements for strings #29849

Closed
s-celles opened this issue Oct 30, 2018 · 15 comments
Closed

replace function should support multiple replacements for strings #29849

s-celles opened this issue Oct 30, 2018 · 15 comments
Labels
strings "Strings!"

Comments

@s-celles
Copy link
Contributor

Hello,

It will be nice if

replace("The quick fox is beautiful.", Dict("quick" => "slow", "fox" => "turtle"))

could be done instead of

replace(replace("The quick fox is beautiful.", "quick" => "slow"), "fox" => "turtle")

unfortunately replace function currently doesn't support AbstractDict

Kind regards

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Oct 30, 2018

If replace supported multiple replacement pairs (I thought it already did, but it doesn't seem to; it only supports a single replacement pair), then this would just work:

replace("The quick fox is beautiful.", Dict("quick" => "slow", "fox" => "turtle")...)

I think that's preferable to starting to support various data structures to describe replacements.

@fredrikekre
Copy link
Member

If replace supported multiple replacement pairs (I thought it already did, but it doesn't seem to; it only supports a single replacement pair),

ref #25732

@s-celles s-celles changed the title replace function for strings should also support AbstractDict as replacement parameter replace function should support multiple replacement for strings Oct 30, 2018
@s-celles s-celles changed the title replace function should support multiple replacement for strings replace function should support multiple replacements for strings Oct 30, 2018
@StefanKarpinski
Copy link
Member

Ah, I had forgotten that #25732 never got merged.

@o314
Copy link
Contributor

o314 commented Dec 22, 2018

That's simply work :

Base.replace(s::String, oldnews::Pair...) = foldl(replace, oldnews, init=s)

using Test
@test Base.replace("a & b > c", "&"=>"&",">"=>">") == "a & b > c"

PS: edited to use foldl instead of reduce

@antoine-levitt
Copy link
Contributor

There seems to be a couple of closed PR (#25732, #35414) about this, the main objection being that it's tricky to implement in a performant way. This is very frustrating for people like me who really don't care about performance in strings but like easy convenience functions to do their scripting, and for whom this really feels like an API oversight. Is it really that bad to have an unoptimal implementation in Base? People serious about string performance can use a profiler and figure out that's the bottleneck, and work around it if needed.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented May 26, 2020

I agree that it would be good to have this as an API even if the implementation isn't the fastest possible and then once that exists, people can use it and if we improve the performance their code gets faster automatically. Seems like a fine situation here. The main question is just the semantics. Presumably replace(thing, a =>b, c => d) should be equivalent to doing replace(replace(thing, a => b), c => d) but ideally only doing a single pass.

@vcanaran
Copy link

vcanaran commented Aug 2, 2020

A simple use cause is DNA to RNA Transcription where codes are replaced

RNAT3(s)=replace(replace(replace(replace(s,"G" => "C"),"C" => "G"),"T" => "A"),"A" => "U")

would become

RNAT4=replace(s, "G"=>"C", "C"=>"G", "T"=>"A", "A"=>"U")

@jakewilliami
Copy link

jakewilliami commented Dec 1, 2020

Is this implemented yet? It doesn't quite seem to work with my use case.

function _latex_escape_string(str::AbstractString)
	subs = Dict{AbstractString, AbstractString}(
		"&" => "\\&",
		" - " => "---",
		"_" => "\\_",
		"#" => "\\#"
	)
	
        return  replace(lstrip(str), subs...)
end
ERROR: LoadError: MethodError: no method matching similar(::SubString{String}, ::Type{Any})

@StefanKarpinski
Copy link
Member

It is not — the issue would be closed if it was implemented.

@jakewilliami
Copy link

@StefanKarpinski good point, my bad. FWIW, +1 regarding @antoine-levitt's comment.

@dgkf
Copy link

dgkf commented Dec 19, 2020

This topic came up in the Julia slack recently, and I just wanted to add my crack at an implementation in case it helps someone think about the problem:

function replace(x::AbstractString, old_new::Pair...; count::Union{Int, Nothing} = nothing)
    count isa Nothing && (count = Inf)

    # find all matches of "old" within x; Array of "new" => replacement range
    locs = vcat([new .=> findall(old, x) for (old, new) = old_new]...)

    # sort replacement locations by start and length
    locs = sort(locs, by = i -> (first(i.second), -last(i.second)))
    
    # reduce replacement locations, only applies replacement if the match
    #   - is found in the original string "x"
    #   - starts before any other overlapping matches
    #   - is the longest of any other overlapping matches
    #   - is matched by the first positional argument of all identical matches
    str, n = reduce(locs; init = ("", 1)) do (str, n), (new, loc)
        if count > 0 && first(loc) >= n
            count -= 1
            (str * x[n:first(loc)-1] * new, last(loc) + 1)
        else
            (str, n)
        end
    end

    return convert(typeof(x), str * x[n:end])
end

What I like about this implementation is that it only applies replacements found in the original string, not intermediate strings as replacements are sequentially applied. In my opinion, applying replacements to these intermediate results is best done via reduce, where it's clear that they're being applied sequentially. Restricting the behavior to applying only to the original string serves a unique purpose that isn't just a shorthand for a reduce call.

@Seelengrab
Copy link
Contributor

Seelengrab commented Dec 21, 2020

Since this has come up a few times already on both Slack and Discourse, I want to add that I think both behaviors (only replacing on the original string and replacing on intermediates) make intuitive sense and seem equal in usefulness to me. Favoring one over the other seems like a weird choice to me.

Another point that should be decided is what to do when two replacements match at the same position in the input string - which should prevail? E.g. replace("aabbaa", "aabb" => "cc", "aa" => "zz") could result in either ccaa or zzbbaa. The result is implementation dependent, which immediately leaks the abstraction of the loop and would break with a potential docstring of "replace each occurence of the first element of each pair with the second element of the same pair".

However, if the API were restricted to replace(s::AbstractString, Vararg{Pair{Char,Char}}), no such conflict could occur, as Char can't span multiple indices of Chars (just as it works in the multi-arg replace on arrays, which also only ever works on one index at a time).

@oscardssmith
Copy link
Member

Looking back at one of the issues that started this, I would like to point out that we have already defined an API here for non-strings: replace([1, 2, 1, 3], 1=>0, 0=>4)===[0, 2, 0, 3]. As such, I think there is only 1 plausible api and that we should implement
it since we are getting about 1 new issue per month on this topic.

@oscardssmith oscardssmith added the strings "Strings!" label Dec 31, 2020
@Seelengrab
Copy link
Contributor

Seelengrab commented Jan 1, 2021

The existing API for arrays makes sense though, since each replacement only modifies one index, not multiple as is the case for strings. There, replacements of the form "aa" => 'B' would span over multiple indices.

As mentioned above, limiting it to Char => Char replacements would keep the same behaviour, though it'd probably lead to "why can't I replace substrings"-style questions 🤷‍♂️

@oscardssmith
Copy link
Member

I think that the limited version of Char => Char is unambiguous so we should do that now. We'd still get questions, but the questions we'd get would be about the non-obvious case which is an improvement.

vtjnash added a commit to vtjnash/julia that referenced this issue Apr 14, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes JuliaLang#35327
fixes JuliaLang#39061
fixes JuliaLang#35414
fixes JuliaLang#29849
fixes JuliaLang#30457
fixes JuliaLang#25396
JeffBezanson pushed a commit that referenced this issue Jun 7, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. #25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes #35327
fixes #39061
fixes #35414
fixes #29849
fixes #30457
fixes #25396
shirodkara pushed a commit to shirodkara/julia that referenced this issue Jun 9, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes JuliaLang#35327
fixes JuliaLang#39061
fixes JuliaLang#35414
fixes JuliaLang#29849
fixes JuliaLang#30457
fixes JuliaLang#25396
johanmon pushed a commit to johanmon/julia that referenced this issue Jul 5, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes JuliaLang#35327
fixes JuliaLang#39061
fixes JuliaLang#35414
fixes JuliaLang#29849
fixes JuliaLang#30457
fixes JuliaLang#25396
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

No branches or pull requests

10 participants