Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perspective on std::string_view support #626

Closed
zeux opened this issue Sep 5, 2024 · 14 comments
Closed

Perspective on std::string_view support #626

zeux opened this issue Sep 5, 2024 · 14 comments

Comments

@zeux
Copy link
Owner

zeux commented Sep 5, 2024

I think it would be helpful to outline my current perspective on std::string_view; it is different from what it has been. Doing this in a separate issue to make this more accessible, but this was originally requested many years ago in #73; also a few related PRs have been filed over the years, #441, #495, #588.

This was originally requested before std::string_view became standard and common-place. At the time, my hope was that the library interface can eventually be reworked to switch from char* in the interface towards some sort of wrapper like pugi::string_arg. This wrapper could hide the complexity of dispatch and take std::string, std::string_view, const char*, and whatever else when constructed. Without this, support for string_view would be limited to either pre-release (when requested originally) or very recent (when requested again 😅) C++ standards; since these are not always universally available, I was concerned that a significant expansion of the interface (every function that takes const char* needs to be duplicated!) would be rarely used. For C++ standards that didn't have std::string_view, this was also unfortunate because std::string also exists but having 3 overloads would be even more onerous.

So the mental model at the time was "string_arg will solve this". However, string_arg would be an ABI-breaking change (also in some cases an API-breaking change for people who rely on implicit conversions to const char*, but hopefully this is rare!), so something that would happen after 1.x. But 1.x lasted for way longer than I thought because I'm averse to major compat breaking changes anyway, and this alone seemingly didn't justify it.

Meanwhile, people started increasingly adopting std::string_view, so what used to be a minor annoyance if you had a std::string (just call .c_str()!) became more of a hurdle. Exposing pugixml via an interface to any language that doesn't use null terminated strings was similarly problematic.

A lot of pugixml methods like child(const char*) are mere helpers, and can be reimplemented without the use of any method that takes a string, by iterating through children externally. Some, like set_name(), can't be reimplemented so they would require an allocation-and-copy at the call site, which is problematic. To address that, over the last two releases (v1.13 in 2022 and v1.14 in 2023) additional overloads for all setters that take const char*, size_t pair were added. This doesn't solve ergonomics, but at least it solves the fundamental inability to use string_view like types... the problem of course is you need to stop using some methods like child(const char*), which reduces the ease of use - which is valuable for pugixml.

At this point, C++17 is the default C++ standard version for current versions of gcc and clang. MSVC, notably, still defaults to C++14, so I don't think I am very late in recognizing that the world has moved on ;) but I do expect MSVC to switch to it at some point. At the same time, this keeps being likely the most annoying issue that people run into with pugixml from time to time; it's not a deal-breaker, but it is a problem. And the only way to fix this problem for good would be to add std::string_view support. Because of implicit conversions, this should automatically make it so that std::string is supported as an argument.

Needless to say, pugixml is still used in environments where C++17 is not practical to enable, and in environments where even STL itself is disabled. So this needs to be an optional feature. But whereas before I was pretty strongly on the side of not adding this feature until 2.0, I've now changed my mind - it is entirely possible that pugixml keeps releasing minor version updates that maintain compatibility for years to come, and it is time to make this ergonomic change even at the cost of an API surface increase.

Concretely, I would like to make this change in the next release as an opt-in feature. Meaning, if PUGIXML_STRING_VIEW is defined (and PUGIXML_NO_STL isn't), we #include string_view and define overloads for methods that take const char*. Methods that return const char* can stay as is; pugixml doesn't track (and will not track) length of strings internally, so converting this to string_view at call site is transparent and comes with no additional overhead that we would eliminate anyway.

The intention would be to enable this if C++17 is enabled automatically, but I'd like to have one release where this feature is available opt-in, to reduce the compatibility risk.

Right now pugixml exposes a large set of functions that take const char* arguments:

bool xml_attribute::set_name(const char_t* rhs);
bool xml_attribute::set_value(const char_t* rhs);
xml_attribute& xml_attribute::operator=(const char_t* rhs);

xml_node xml_node::child(const char_t* name) const;
xml_attribute xml_node::attribute(const char_t* name) const;
xml_node xml_node::next_sibling(const char_t* name) const;
xml_node xml_node::previous_sibling(const char_t* name) const;
xml_attribute xml_node::attribute(const char_t* name, xml_attribute& hint) const;
const char_t* xml_node::child_value(const char_t* name) const;
bool xml_node::set_name(const char_t* rhs);
bool xml_node::set_value(const char_t* rhs);
xml_attribute xml_node::append_attribute(const char_t* name);
xml_attribute xml_node::prepend_attribute(const char_t* name);
xml_attribute xml_node::insert_attribute_after(const char_t* name, const xml_attribute& attr);
xml_attribute xml_node::insert_attribute_before(const char_t* name, const xml_attribute& attr);
xml_node xml_node::append_child(const char_t* name);
xml_node xml_node::prepend_child(const char_t* name);
xml_node xml_node::insert_child_after(const char_t* name, const xml_node& node);
xml_node xml_node::insert_child_before(const char_t* name, const xml_node& node);
bool xml_node::remove_attribute(const char_t* name);
bool xml_node::remove_child(const char_t* name);
xml_node xml_node::find_child_by_attribute(const char_t* name, const char_t* attr_name, const char_t* attr_value) const;
xml_node xml_node::find_child_by_attribute(const char_t* attr_name, const char_t* attr_value) const;
xml_node xml_node::first_element_by_path(const char_t* path, char_t delimiter = '/') const;
xpath_node xml_node::select_node(const char_t* query, xpath_variable_set* variables = PUGIXML_NULL) const;
xpath_node_set xml_node::select_nodes(const char_t* query, xpath_variable_set* variables = PUGIXML_NULL) const;
xml_object_range<xml_named_node_iterator> xml_node::children(const char_t* name) const;

bool xml_text::set(const char_t* rhs);
xml_text& xml_text::operator=(const char_t* rhs);

bool xpath_variable::set(const char_t* value);

xpath_variable* xpath_variable_set::add(const char_t* name, xpath_value_type type);
bool xpath_variable_set::set(const char_t* name, bool value);
bool xpath_variable_set::set(const char_t* name, double value);
bool xpath_variable_set::set(const char_t* name, const char_t* value);
bool xpath_variable_set::set(const char_t* name, const xpath_node_set& value);
xpath_variable* xpath_variable_set::get(const char_t* name);
const xpath_variable* xpath_variable_set::get(const char_t* name) const;

xpath_query::xpath_query(const char_t* query, xpath_variable_set* variables = PUGIXML_NULL);

std::basic_string<char> as_utf8(const wchar_t* str);
std::basic_string<wchar_t> as_wide(const char* str);

That's 40 functions; I have omitted xml_document::load/save for brevity. This is not a small number - pugixml exposes ~346 function symbols right now but that includes service functions like constructors, destructors and comparison operators, removing these gets us to ~250.

So initially I would remove XPath from this set; also, xml_node::childen(const char*) is unsafe to implement with string view because it returns an object that keeps a reference to the string, and using string_view here has lifetime implications. child_value is a semi-legacy function because it's replaced by text(), just not deprecated; find_child_by_attribute and first_element_by_path are fairly special case as well. as_utf8/as_wide are probably not as useful either, they work with std::strings anyway so the caller could eat an allocation cost and convert the input to std::string.

Pruning this, we end up with this set that would be cloned with overloads for string_view_t:

bool xml_attribute::set_name(const char_t* rhs);
bool xml_attribute::set_value(const char_t* rhs);
xml_attribute& xml_attribute::operator=(const char_t* rhs);

xml_node xml_node::child(const char_t* name) const;
xml_attribute xml_node::attribute(const char_t* name) const;
xml_node xml_node::next_sibling(const char_t* name) const;
xml_node xml_node::previous_sibling(const char_t* name) const;
xml_attribute xml_node::attribute(const char_t* name, xml_attribute& hint) const;
bool xml_node::set_name(const char_t* rhs);
bool xml_node::set_value(const char_t* rhs);
xml_attribute xml_node::append_attribute(const char_t* name);
xml_attribute xml_node::prepend_attribute(const char_t* name);
xml_attribute xml_node::insert_attribute_after(const char_t* name, const xml_attribute& attr);
xml_attribute xml_node::insert_attribute_before(const char_t* name, const xml_attribute& attr);
xml_node xml_node::append_child(const char_t* name);
xml_node xml_node::prepend_child(const char_t* name);
xml_node xml_node::insert_child_after(const char_t* name, const xml_node& node);
xml_node xml_node::insert_child_before(const char_t* name, const xml_node& node);
bool xml_node::remove_attribute(const char_t* name);
bool xml_node::remove_child(const char_t* name);

bool xml_text::set(const char_t* rhs);
xml_text& xml_text::operator=(const char_t* rhs);

That's 22 functions; a ~10% increase in the API surface is probably reasonable for something like this, given that it's been a pain point for a while and no clean alternatives exist short of a breaking change that I don't want to do.

Given the reasoning above, at this point I feel like I would be fine with accepting a clean, minimal, tested PR that adds these. For a new contributor who wants to try this, it would make sense to split into two PRs, one PR that adds string_view define, includes, typedefs for wchar_mode, and exposes this, for example, just as overloads for set_name/set_value/set that already have the necessary underlying code, and a separate PR, once the first is merged, that adds the rest. This would streamline the PR feedback necessary here. It is critical that the PRs are high quality, either initially or can get there quickly.

I currently don't have time to dedicate to the actual implementation myself but hopefully it should not be onerous or complicated; if nobody picks it up, it might be that I will, but it also might be that the next release will ship without this.

Feedback on this (including "this is a bad idea") is welcome, of course. I reserve the right to change my mind about this if more feedback is presented :)

@zeux zeux pinned this issue Sep 6, 2024
@zeux zeux unpinned this issue Sep 6, 2024
@brandl-muc
Copy link
Contributor

brandl-muc commented Sep 9, 2024

I'd suggest adding a feature check like so: (probably the names don't match pugi's naming schemes)

#if defined(__cpp_lib_string_view) || __cplusplus >= 201703L || defined(_MSVC_LANG) && _MSVC_LANG >= 201703L
#   define PUGI_STRING_VIEW_SUPPORTED 1
#endif

Then we can do

#if defined(PUGI_ENABLE_STRING_VIEW) && !defined(PUGIXML_NO_STL)
#   ifdef PUGI_STRING_VIEW_SUPPORTED
#       define PUGI_HAS_STRING_VIEW
#   else
#       warning "Support for std::string_view was requested but is not supported by your C++ standard"
#   endif
#endif

With the following release the feature check can be repurposed to enable support automatically. (and probably replace the opt-in with an opt-out macro PUGI_NO_STRING_VIEW?)

@brandl-muc
Copy link
Contributor

I'm definitely interested in contributing, I just don't know if I can muster the time. So if anybody else wants to give it a try, don't hold back. (probably let me know to avoid duplicate work)

@brandl-muc
Copy link
Contributor

@zeux Quick question, the static library could be compiled with different settings than the client which is including the header and linking the lib. One could implement the new functions inline header-only and delegate to the (const char_t *, size_t) overload.
Or one could implement them like the other overloads in the cpp if string_views are supported. This would however always add the symbols and simply hide them from the compiler of the client.
Which would you prefer? (or maybe there is another path I'm not seeing?)

@zeux
Copy link
Owner Author

zeux commented Sep 9, 2024

I think the implementation should be in the .cpp. This is not different from other defines like PUGIXML_NO_STL et al that should be consistent between compilation and uses. In the future, this would just be automatically determined based on the supported standard version anyway.

If the configurations mismatch now or in the future (eg in the future, you could compile pugixml.cpp with -std=c++11 but the rest of the code with -std=c++17 for some reason), this should not present issues if you don't use the newly added overloads, and if you do use them you should just get a linker error, so I don't think the mismatch here is particularly problematic either.

@brandl-muc
Copy link
Contributor

Thanks for your quick reply, I created #627 as draft to get quick feedback on the implementation.
Tests are still missing, I will add them soon and then publish the PR.

@dantargz
Copy link
Contributor

dantargz commented Oct 23, 2024

I am in the process of adding the remaining functions, but I wanted to double check the intended behavior when the string_view or char*+size contains a null in the middle (e.g. a string view constructed from std::string("AB\0C", 4)).

  • When invoking a setter (set_name, set_value, set, operator=), should the pugi internal datastructures contain just the null terminated string "AB", or the full string "AB\0C"? If I'm reading the current code correctly, I think it currently stores the full string "AB\0C"
  • When invoking a query function (child(), attribute(), *_sibling(), etc.), should it match against the full string view, or only the prefix up to the first null -- e.g. should string_view("AB\0C", 4) count nodes/attributes named "AB" as a match?

My sense is that the string_view APIs and the char*+size APIs should probably act as if they were truncated to their first null terminator for consistency with the rest of the APIs and to match the general invariants for the code -- so long as that behavior can be implemented without sacrificing performance.

@dantargz
Copy link
Contributor

When invoking a setter, having any flow that retains the "extra" data after the first null may be a footgun for users -- users may write code assuming that their data-with-null-in-the-middle will always round trip through pugi, but then one day they invoke some action that causes pugi to copy something up to its first null under the hood and the end user's code breaks.

When invoking a query function with a size-delimited input, the caller is probably expecting to only match things whose name has the same length as the provided size: so if the data contains a "middle" null, it should probably not be considered equal to anything stored in pugi.

@zeux
Copy link
Owner Author

zeux commented Oct 23, 2024

Existing setters with size store the full span but since the length isn’t stored, the user visible behavior is the same (modulo a slightly higher memory consumption). Note that XML has no valid way to store null bytes, so it shouldn’t be expected that they are preserved; not searching for null is thus faster.

For lookups, ideally the lookups would not return a node with name “n” when looking for “n\0pe”. This would prevent accidental exploits and would be the same as if string_view was compared with the value returned by name()/value() as a char*.

Note that per above that means adding a node with internal nul and searching for it wouldn’t find it, but that’s probably preferable to alternatives. If this becomes an issue, setters could begin asserting the lack of interior nul bytes which is probably benign wrt compatibility.

@zeux
Copy link
Owner Author

zeux commented Oct 24, 2024

This is now implemented on master (thanks @brandl-muc and @dantargz!); the following code compiles and works when PUGIXML_STRING_VIEW define is set and latest clang/gcc are used (also demonstrating transparent support for std::string):

#include "pugixml.hpp"
#include <iostream>

using namespace std::literals;

int main() {
    
    pugi::xml_document doc;
    pugi::xml_node node = doc.append_child("test"sv);
    node.text() = "hello"s;
    doc.print(std::cout);

    return 0;
}

I'll do a little more local testing and see if anything else needs to be adjusted before closing this and updating or closing #73 (might keep #73 open until this define is on by default, we'll see).

@dantargz
Copy link
Contributor

The change to pugixmlconfig.hpp accidentally broke the VERSION variable in the Makefile, I stumbled into this when attempting to update documentation for the new overloads. I have a draft PR in my fork with documentation updates and a fix for the Makefile, let me know if you would like me to open a clean PR just with the Makefile change (or a different Makefile fix): dantargz#3

@zeux
Copy link
Owner Author

zeux commented Oct 26, 2024

Ah, fun. It's fine to combine them in one PR if that's easier but either way works.

@zeux
Copy link
Owner Author

zeux commented Oct 29, 2024

I think we're all set with this; I've tested this on gcc/Linux, clang/Linux, clang/macOS and msvc/Windows and everything seems to function as expected, as long as the C++ standard is set to 17 (alongside the define). Removing the define will have to come with some CMake changes, and probably changes to VS2022 vcxproj and perhaps Xcode project if that even works (as it hasn't been touched in 14 years...).

I need to think about the release timing; normally it would happen in October but there aren't too many other changes and I'd probably want to wait a bit after these changes landed. But either way this can be closed now.

@zeux
Copy link
Owner Author

zeux commented Oct 29, 2024

The only addendum is that the change ended up being more straightforward (in the sense that the only issues I've discovered in testing were of the "I'd think it would be enabled, why isn't it?" variety) than I expected; given that the release so far only contains a bunch of warning/build fixes, it may make sense to delay the release by an extra month or two and remove the define sooner. I'll think about whether this makes more sense.

@zeux
Copy link
Owner Author

zeux commented Nov 4, 2024

I've gone ahead and merged a change that enables string_view by default whenever C++17 is detected, as this, on further reflection, is probably as safe as opt-in. The only actually potentially problematic part is enablement of C++17 itself, but this was bound to happen anyway, so we might as well catch any issues here early.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants