Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode/utf16: add AppendRune #51896

Closed
qmuntal opened this issue Mar 23, 2022 · 7 comments
Closed

unicode/utf16: add AppendRune #51896

qmuntal opened this issue Mar 23, 2022 · 7 comments

Comments

@qmuntal
Copy link
Member

qmuntal commented Mar 23, 2022

Background

utf16.Encode always allocates a []uint16 large enough to fit the UTF-16 encoded sequence, which is really ergonomic but forces one allocation.

Proposal

Update, May 27 2022: The proposed API has changed (see #51896 (comment)) to:

// AppendRune appends the UTF-16 encoding of the Unicode code point r
// to the end of p and returns the extended buffer. If the rune is not
// a valid Unicode code point, it appends the encoding of U+FFFD.
func AppendRune(p []uint16, r rune) []uint16

Update, May 27 2022: The following functions were been superseded by the previous AppendRune.

For those cases that the extra allocation matters, unicode/utf16 could provide an additional encoding function which accepts a pre-allocated (and large enough) backing slice.

The signature would look like this:

// EncodeInto writes into a (which must be large enough) the UTF-16 encoding
// of the Unicode code point sequence s.
func EncodeInto(a []uint16, s []rune) []uint16

Optionally, in order to know the minimum size of the backing array, unicode/utf16 could provide an additional function which counts the number of code units in a code point sequence.

It would look something like this:

// Countreturns the number of code units in p.
// Invalid encodings are treated as single runes of width 1 byte.
func Count(s []rune) int {
	n := len(s)
	for _, v := range s {
		if v >= surrSelf {
			n++
		}
	}
	return n
}

It worth northing that utf16.Encode could then be implemented using utf16.Count and utf16.EncodeInto.

Examples

My specific use case is to allow x/sys/windows/mkwinsyscall generate syscall wrappers which accept string arguments without allocating, at least for short strings. Check this comment for more context.

If I had utf16.EncodeInto I could implement a non-allocating wrapper as follows:

func Foo(s string) {
	p := []rune(s + "\x00")
	l := utf16.Count(p)
	var a []uint16
	if l < 32 {
		a = make([]uint16, 32)
	} else {
		a = make([]uint16, l)
	}
	a = utf16.EncodeInto(a, p)
	syscall.Syscall6(fnAddr(), 6, 0, uintptr(unsafe.Pointer(&a[0])), 0, 0, 0, 0)
	return
}
@gopherbot gopherbot added this to the Proposal milestone Mar 23, 2022
@randall77
Copy link
Contributor

Maybe instead we should have Append versions, like #50601 and #51644 ?
If that were the case maybe Count wouldn't be necessary? You could rely on appends growth to size the buffer correctly (across multiple calls to Encode).

@qmuntal
Copy link
Member Author

qmuntal commented Mar 24, 2022

Maybe instead we should have Append versions, like #50601 and #51644 ? If that were the case maybe Count wouldn't be necessary? You could rely on appends growth to size the buffer correctly (across multiple calls to Encode).

I like your suggestion a lot. It would match nicely utf8.AppendRune and there is no need for Count, just setting an initial buffer capacity is enough to avoid some allocations.

My syscall example would look like this:

func Foo(s string) {
	p := []rune(s + "\x00")
	a := make([]uint16, 0, 32)
	for _, r := range p {
		a = utf16.AppendRune(a, r)
	}
	syscall.Syscall6(procBCryptGetProperty.Addr(), 6, 0, uintptr(unsafe.Pointer(&a[0])), 0, 0, 0, 0)
	return
}

@rsc rsc changed the title proposal: unicode/utf16: encoding without allocations proposal: unicode/utf16: add AppendRune May 25, 2022
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/409054 mentions this issue: unicode/utf16: add AppendRune

@qmuntal
Copy link
Member Author

qmuntal commented May 27, 2022

Prototyped the proposal in CL 409054.

@rsc
Copy link
Contributor

rsc commented Jun 1, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Jun 8, 2022

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Jun 15, 2022

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc changed the title proposal: unicode/utf16: add AppendRune unicode/utf16: add AppendRune Jun 15, 2022
@rsc rsc modified the milestones: Proposal, Backlog Jun 15, 2022
@rsc rsc moved this to Accepted in Proposals Aug 10, 2022
@rsc rsc added this to Proposals Aug 10, 2022
@golang golang locked and limited conversation to collaborators Aug 19, 2023
@rsc rsc removed this from Proposals Aug 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants