-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perspective on std::string_view support #626
Comments
I'd suggest adding a feature check like so: (probably the names don't match pugi's naming schemes)
Then we can do
With the following release the feature check can be repurposed to enable support automatically. (and probably replace the opt-in with an opt-out macro |
I'm definitely interested in contributing, I just don't know if I can muster the time. So if anybody else wants to give it a try, don't hold back. (probably let me know to avoid duplicate work) |
@zeux Quick question, the static library could be compiled with different settings than the client which is including the header and linking the lib. One could implement the new functions inline header-only and delegate to the |
I think the implementation should be in the If the configurations mismatch now or in the future (eg in the future, you could compile pugixml.cpp with |
Thanks for your quick reply, I created #627 as draft to get quick feedback on the implementation. |
I am in the process of adding the remaining functions, but I wanted to double check the intended behavior when the string_view or char*+size contains a null in the middle (e.g. a string view constructed from
My sense is that the string_view APIs and the char*+size APIs should probably act as if they were truncated to their first null terminator for consistency with the rest of the APIs and to match the general invariants for the code -- so long as that behavior can be implemented without sacrificing performance. |
When invoking a setter, having any flow that retains the "extra" data after the first null may be a footgun for users -- users may write code assuming that their data-with-null-in-the-middle will always round trip through pugi, but then one day they invoke some action that causes pugi to copy something up to its first null under the hood and the end user's code breaks. When invoking a query function with a size-delimited input, the caller is probably expecting to only match things whose name has the same length as the provided size: so if the data contains a "middle" null, it should probably not be considered equal to anything stored in pugi. |
Existing setters with size store the full span but since the length isn’t stored, the user visible behavior is the same (modulo a slightly higher memory consumption). Note that XML has no valid way to store null bytes, so it shouldn’t be expected that they are preserved; not searching for null is thus faster. For lookups, ideally the lookups would not return a node with name “n” when looking for “n\0pe”. This would prevent accidental exploits and would be the same as if string_view was compared with the value returned by name()/value() as a char*. Note that per above that means adding a node with internal nul and searching for it wouldn’t find it, but that’s probably preferable to alternatives. If this becomes an issue, setters could begin asserting the lack of interior nul bytes which is probably benign wrt compatibility. |
This is now implemented on master (thanks @brandl-muc and @dantargz!); the following code compiles and works when #include "pugixml.hpp"
#include <iostream>
using namespace std::literals;
int main() {
pugi::xml_document doc;
pugi::xml_node node = doc.append_child("test"sv);
node.text() = "hello"s;
doc.print(std::cout);
return 0;
} I'll do a little more local testing and see if anything else needs to be adjusted before closing this and updating or closing #73 (might keep #73 open until this define is on by default, we'll see). |
The change to pugixmlconfig.hpp accidentally broke the VERSION variable in the Makefile, I stumbled into this when attempting to update documentation for the new overloads. I have a draft PR in my fork with documentation updates and a fix for the Makefile, let me know if you would like me to open a clean PR just with the Makefile change (or a different Makefile fix): dantargz#3 |
Ah, fun. It's fine to combine them in one PR if that's easier but either way works. |
I think we're all set with this; I've tested this on gcc/Linux, clang/Linux, clang/macOS and msvc/Windows and everything seems to function as expected, as long as the C++ standard is set to 17 (alongside the define). Removing the define will have to come with some CMake changes, and probably changes to VS2022 vcxproj and perhaps Xcode project if that even works (as it hasn't been touched in 14 years...). I need to think about the release timing; normally it would happen in October but there aren't too many other changes and I'd probably want to wait a bit after these changes landed. But either way this can be closed now. |
The only addendum is that the change ended up being more straightforward (in the sense that the only issues I've discovered in testing were of the "I'd think it would be enabled, why isn't it?" variety) than I expected; given that the release so far only contains a bunch of warning/build fixes, it may make sense to delay the release by an extra month or two and remove the define sooner. I'll think about whether this makes more sense. |
I've gone ahead and merged a change that enables string_view by default whenever C++17 is detected, as this, on further reflection, is probably as safe as opt-in. The only actually potentially problematic part is enablement of C++17 itself, but this was bound to happen anyway, so we might as well catch any issues here early. |
I think it would be helpful to outline my current perspective on
std::string_view
; it is different from what it has been. Doing this in a separate issue to make this more accessible, but this was originally requested many years ago in #73; also a few related PRs have been filed over the years, #441, #495, #588.This was originally requested before
std::string_view
became standard and common-place. At the time, my hope was that the library interface can eventually be reworked to switch fromchar*
in the interface towards some sort of wrapper likepugi::string_arg
. This wrapper could hide the complexity of dispatch and takestd::string
,std::string_view
,const char*
, and whatever else when constructed. Without this, support for string_view would be limited to either pre-release (when requested originally) or very recent (when requested again 😅) C++ standards; since these are not always universally available, I was concerned that a significant expansion of the interface (every function that takesconst char*
needs to be duplicated!) would be rarely used. For C++ standards that didn't havestd::string_view
, this was also unfortunate becausestd::string
also exists but having 3 overloads would be even more onerous.So the mental model at the time was "string_arg will solve this". However,
string_arg
would be an ABI-breaking change (also in some cases an API-breaking change for people who rely on implicit conversions toconst char*
, but hopefully this is rare!), so something that would happen after 1.x. But 1.x lasted for way longer than I thought because I'm averse to major compat breaking changes anyway, and this alone seemingly didn't justify it.Meanwhile, people started increasingly adopting
std::string_view
, so what used to be a minor annoyance if you had astd::string
(just call.c_str()
!) became more of a hurdle. Exposing pugixml via an interface to any language that doesn't use null terminated strings was similarly problematic.A lot of pugixml methods like
child(const char*)
are mere helpers, and can be reimplemented without the use of any method that takes a string, by iterating through children externally. Some, likeset_name()
, can't be reimplemented so they would require an allocation-and-copy at the call site, which is problematic. To address that, over the last two releases (v1.13 in 2022 and v1.14 in 2023) additional overloads for all setters that takeconst char*, size_t
pair were added. This doesn't solve ergonomics, but at least it solves the fundamental inability to use string_view like types... the problem of course is you need to stop using some methods likechild(const char*)
, which reduces the ease of use - which is valuable for pugixml.At this point, C++17 is the default C++ standard version for current versions of gcc and clang. MSVC, notably, still defaults to C++14, so I don't think I am very late in recognizing that the world has moved on ;) but I do expect MSVC to switch to it at some point. At the same time, this keeps being likely the most annoying issue that people run into with pugixml from time to time; it's not a deal-breaker, but it is a problem. And the only way to fix this problem for good would be to add
std::string_view
support. Because of implicit conversions, this should automatically make it so thatstd::string
is supported as an argument.Needless to say, pugixml is still used in environments where C++17 is not practical to enable, and in environments where even STL itself is disabled. So this needs to be an optional feature. But whereas before I was pretty strongly on the side of not adding this feature until 2.0, I've now changed my mind - it is entirely possible that pugixml keeps releasing minor version updates that maintain compatibility for years to come, and it is time to make this ergonomic change even at the cost of an API surface increase.
Concretely, I would like to make this change in the next release as an opt-in feature. Meaning, if
PUGIXML_STRING_VIEW
is defined (andPUGIXML_NO_STL
isn't), we#include string_view
and define overloads for methods that takeconst char*
. Methods that returnconst char*
can stay as is; pugixml doesn't track (and will not track) length of strings internally, so converting this tostring_view
at call site is transparent and comes with no additional overhead that we would eliminate anyway.The intention would be to enable this if C++17 is enabled automatically, but I'd like to have one release where this feature is available opt-in, to reduce the compatibility risk.
Right now pugixml exposes a large set of functions that take
const char*
arguments:That's 40 functions; I have omitted
xml_document::load/save
for brevity. This is not a small number - pugixml exposes ~346 function symbols right now but that includes service functions like constructors, destructors and comparison operators, removing these gets us to ~250.So initially I would remove XPath from this set; also,
xml_node::childen(const char*)
is unsafe to implement with string view because it returns an object that keeps a reference to the string, and using string_view here has lifetime implications.child_value
is a semi-legacy function because it's replaced bytext()
, just not deprecated;find_child_by_attribute
andfirst_element_by_path
are fairly special case as well.as_utf8
/as_wide
are probably not as useful either, they work with std::strings anyway so the caller could eat an allocation cost and convert the input tostd::string
.Pruning this, we end up with this set that would be cloned with overloads for
string_view_t
:That's 22 functions; a ~10% increase in the API surface is probably reasonable for something like this, given that it's been a pain point for a while and no clean alternatives exist short of a breaking change that I don't want to do.
Given the reasoning above, at this point I feel like I would be fine with accepting a clean, minimal, tested PR that adds these. For a new contributor who wants to try this, it would make sense to split into two PRs, one PR that adds string_view define, includes, typedefs for wchar_mode, and exposes this, for example, just as overloads for
set_name
/set_value
/set
that already have the necessary underlying code, and a separate PR, once the first is merged, that adds the rest. This would streamline the PR feedback necessary here. It is critical that the PRs are high quality, either initially or can get there quickly.I currently don't have time to dedicate to the actual implementation myself but hopefully it should not be onerous or complicated; if nobody picks it up, it might be that I will, but it also might be that the next release will ship without this.
Feedback on this (including "this is a bad idea") is welcome, of course. I reserve the right to change my mind about this if more feedback is presented :)
The text was updated successfully, but these errors were encountered: