Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

printf %ls support - C++ standard compatibility #572

Closed
vgalka-sl opened this issue Sep 24, 2017 · 16 comments
Closed

printf %ls support - C++ standard compatibility #572

vgalka-sl opened this issue Sep 24, 2017 · 16 comments

Comments

@vgalka-sl
Copy link
Contributor

vgalka-sl commented Sep 24, 2017

Hi,

The current library version does not allow mixing wchar_t* arguments into char* format strings. The following code does not compile (at least not on MS VS2017):

fmt::printf("%ls", L"foo");

However, according to the C++11 standard, it should be valid for printf-like functions.

C++11 includes the C library as described by the 1999 ISO C standard and its Technical Corrigenda 1, 2 and 3 (ISO/IEC 9899:1999 and ISO/IEC 9899:1999/Cor.1,2,3), plus (as by ISO/IEC 19769:2004).

Looking into ISO/IEC 9899:TC2, it describes the %ls format specifier as following (page 279):

If an l length modifier is present, the argument shall be a pointer to the initial element of an array of wchar_t type. Wide characters from the array are converted to multibyte characters (each as if by a call to the wcrtomb function, with the conversion state described by an mbstate_t object initialized to zero before the first wide character is converted) up to and including a terminating null wide character. The resulting multibyte characters are written up to (but not including) the terminating null character (byte).

It would be nice to have this standard compatibility :-)

Best regards,
Vasili

@vitaut
Copy link
Contributor

vitaut commented Sep 27, 2017

Thanks for raising this, I agree that it should be supported. Would you be willing to contribute a fix by any chance?

@vgalka-sl
Copy link
Contributor Author

I haven't dived into the internals yet. Made a workaround for now...
But I'll try when I find time :-)

@pulsa
Copy link

pulsa commented Dec 23, 2017

This sounds very useful. I started looking at how to do it.
My ideal implementation would ignore the 'l' in "%ls" and automatically convert all string arguments to match the format string encoding, for both (s)printf() and format(). This simplifies the format string requirements and avoids a source of run-time exceptions, which is important because...

If narrow format strings can accept wide arguments (and vice versa), we lose compile-time safety checks like #606. But if we can always do the right thing with the input, this is not a problem.

However, conversion errors can happen if the input is invalid UTF-8, or if it contains characters that aren't in the output code set (non-Unicode only). We could throw an exception, replace with some other character or truncate the string. Should this be configurable somehow?

One more complication: setlocale() can't use UTF-8 on Windows/MSVC, so std::wcrtomb/mbrtowc() won't work here. I'm guessing most users want UTF-8, but console output uses the system code page by default so if they do not explicitly set up for UTF-8 they will get mojibake or worse when writing non-ASCII. This points to another new option: use the locale setting or force UTF-8. Or we could say fmtlib always uses UTF-8 and if you want some other char encoding you have to convert it yourself. I would be happy with this restriction.

BasicWriter::write_str() currently uses std::uninitialized_copy() to convert from char to wchar_t. This only works if the input is ASCII. I think this would have to change.

I'm assuming wchar_t* always means UTF-16 on Windows and UTF-32 elsewhere. Hope that's OK.

@vitaut, are you ready to dive into Unicode madness? 😃

@vitaut
Copy link
Contributor

vitaut commented Dec 24, 2017

We could throw an exception, replace with some other character or truncate the string. Should this be configurable somehow?

I'd go with an exception by default, but it would be nice to make this configurable. The std branch introduced error handlers to make error behavior configurable, maybe they can be used here.

This points to another new option: use the locale setting or force UTF-8.

fmt::*printf should probably use the locale setting for compatibility with system printf. For the new format functions I'd go with UTF-8 since it has become the de-facto standard pretty much everywhere.

BasicWriter::write_str() currently uses std::uninitialized_copy() to convert from char to wchar_t. This only works if the input is ASCII. I think this would have to change.

Yes, this is a pre-#606 artefact.

I'm assuming wchar_t* always means UTF-16 on Windows and UTF-32 elsewhere. Hope that's OK.

Why is it necessary? If we use wcrtomb then it shouldn't matter.

@vitaut, are you ready to dive into Unicode madness?

Yes, sounds exciting =).

@pulsa
Copy link

pulsa commented Dec 29, 2017

I have a very rough proof of concept using wcrtomb/mbrtowc which works great in Linux, but with MSVC we get ANSI code pages and UCS-2 instead of UTF-8 and UTF-16. I'm working on an alternate method for proper Unicode support, using nowide::utf.

This work overlaps with #628, but if I tried to handle that at the same time I would never finish anything, so I'm ignoring it for now. This modern C++ stuff is still new to me.

@aetchevarne
Copy link

Just commenting that boost.locale and https://tzlaine.github.io/text/doc/html/index.html provide tools to work with unicode; maybe they are useful for fmt?

@vitaut
Copy link
Contributor

vitaut commented Jul 1, 2018

Thanks, @aetchevarne.

@matt77hias
Copy link
Contributor

matt77hias commented Sep 16, 2018

Is there a cheap workaround to use wchar_t*/std::wstring/std::wstring_view arguments for std::string_view format strings (similar to the use of %ls in std::printf)? Or the dual: using char_t*/std::string/std::string_view arguments for std::wstring_view format strings (similar to the use of %s in std::wprintf). Converting std::wstring to std::string and vice versa on the fly is pretty expensive (i.e. allocations).

On Windows these ANSI<>UTF-16 differences are really an issue. std::filesystem::path for instance uses wchar_t (UTF-16) and is quite useful to output in addition to errors while parsing files.

@vitaut
Copy link
Contributor

vitaut commented Sep 16, 2018

You could use utf16_to_utf8 which will not do dynamic allocations for strings smaller than inline_buffer_size (500 chars).

@matt77hias
Copy link
Contributor

matt77hias commented Sep 17, 2018

But does that mean that I need to link against Microsoft's complete C++ Rest SDK?

_ASYNCRTIMP std::string __cdecl utf16_to_utf8(const utf16string &w);

How does this returned string's content outlive the call to utf16_to_utf8?


As a side note, sometimes I need to perform conversions between std::string<>std::wstring (e.g., reading a std::string from a file and using it as a filename for another file), then I use the following:

#include <AtlBase.h>
#include <atlconv.h>

[[nodiscard]]
const std::wstring StringToWString(const std::string& str) {
    return std::wstring(CA2W(str.c_str()));
}

[[nodiscard]]
const std::string WStringToString(const std::wstring& str) {
    return std::string(CW2A(str.c_str()));
}

@matt77hias
Copy link
Contributor

matt77hias commented Sep 17, 2018

CA2W and CW2A use fixed size buffers of length 128. This is, however, in most cases insufficient for a complete file path while developing in Visual Studio. The MAX_PATH macro is set to 256.

In the above example, however, allocation will always happen, since the CA2W and CW2A will be destroyed upon returning. Alternatively, one can keep the CA2W and CW2A alive while using fmt, but this is not very transparent, since one needs to write this boilerplate for every occurence, and is not always needed, when redefining assert using fmt (NDEBUG).

A more transparent way consists of partially specializing fmt::formatter, but fmt::formatter<T>::format<FormatContextT>(const T& a, FormatContextT& ctx) returns the format and destroys all local buffers upon doing so.

Or alternatively why does fmt not just perform the char<>wchar_t conversion on the fly (Windows: WideCharToMultiByte and MultiByteToWideChar)? There is always the possibility of data loss, but fmt's output is pretty visible to the programmer ;-)

@vitaut
Copy link
Contributor

vitaut commented Sep 17, 2018

But does that mean that I need to link against Microsoft's complete C++ Rest SDK?

No, I was talking about fmt's utf16_to_utf8: https://github.com/fmtlib/fmt/blob/master/include/fmt/format.h#L1131. It gives you a temporary buffer and only allocates on large strings.

@matt77hias
Copy link
Contributor

matt77hias commented Sep 17, 2018

No, I was talking about fmt's utf16_to_utf8: https://github.com/fmtlib/fmt/blob/master/include/fmt/format.h#L1131. It gives you a temporary buffer and only allocates on large strings.

Ah, ok that seems like a more appropriate choice. Seems like ATL's CA2W and CW2A, but with exceptions.

CA2W and CW2A use fixed size buffers of length 128. This is, however, in most cases insufficient for a complete file path while developing in Visual Studio. The MAX_PATH macro is set to 256.

Correction: the size of the buffer is a template argument set to 128 by default. So I can increase this to 256 or 512 as well.


So can I call these utf16_to_utf8/utf8_to_utf16 inside a partial specialization of fmt::formatter<T>::format<FormatContextT>(const T& a, FormatContextT& ctx)?

Sample: Godbolt
Update

Without knowing much about the internals of fmt, this could only work if fmt::format_to is not lazy and directly evaluates the arguments (i.e. no capture beyond the lifetime of the fmt::formatter<T>::format<FormatContextT>(const T& a, FormatContextT& ctx) call).

Thanks for the support.

@vitaut
Copy link
Contributor

vitaut commented Sep 19, 2018

So can I call these utf16_to_utf8/utf8_to_utf16 inside a partial specialization of fmt::formatter::format(const T& a, FormatContextT& ctx)?

Sure, if it compiles =).

You can safely pass temporaries to fmt::format_to. It is not "lazy".

@vitaut
Copy link
Contributor

vitaut commented Mar 15, 2019

Looks like there is not enough interest in this feature so closing, but PRs are still welcome.

@vitaut vitaut closed this as completed Mar 15, 2019
@jovibor
Copy link

jovibor commented Jan 5, 2020

Could you please reconsider implementing this.
This is very useful feature, and it's very inconvenient to first convert char*<->wchar* to be able to use with format.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants