Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal - Update XPath to (at least) v2.0 #903

Open
WebReflection opened this issue Oct 13, 2020 · 34 comments
Open

Proposal - Update XPath to (at least) v2.0 #903

WebReflection opened this issue Oct 13, 2020 · 34 comments
Labels
addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest

Comments

@WebReflection
Copy link

WebReflection commented Oct 13, 2020

While the latest recommendation is v3.1, most questions related to XPath seems to miss Regular Expressions, introduced in v2.0 which is nearly a 10 years ago recommendation.

However, all browsers support only XPath v1.0 from 1999.

Background

Widely adopted in 2007 by popular frameworks such as Dojo, Prototype, or Mootools, the XPath language is an extremely powerful tool to query and crawl the DOM in all its axes, hence superior than CSS, and able to unleash proposed selectors already, such as :has(...) even in its version 1.

// CSS container:has(child)
// XPath since 1999
'.//container[count(.//child) > 0]'

But this is only scratching the surface of operations that XPath can do, as opposite of querying via CSS, check surrounding DOM nodes via JS, check results are valid (i.e. if (child.closest('container')) plus there's no way to target text nodes or even comments.

Proposal

Provide at least the method matches(RegExp, flag) to the current XPath 1.0 (let's call it 1.1) or provide at least v.2 of this old but gold standard to crawl any DOM tree, as if it's still updated and useful for back end crawlers, it's unclear why the first class citizen JS should not benefit from its potentials, way superior than CSS selectors, and less error prone, as filtering and complex searches can be done directly through document.evaluate.

Thanks in advance for considering this improvement, as I'm sure once RegExp will be in, the usage of XPath for complex SPA/PWA pages would flourish again in either libraries, web components, or the Web in general.

@annevk annevk added addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest labels Oct 13, 2020
@annevk
Copy link
Member

annevk commented Oct 13, 2020

Per https://www.chromestatus.com/metrics/feature/popularity it does seem that about 1-2% of page views end up using XPath, so maybe it's worth considering, but I wouldn't really want to do anything here until #67 is fully settled, including tests. XPath has been a long neglected part of the platform, we should standardize what we have first before considering additions.

@WebReflection
Copy link
Author

@annevk thanks for pointing me out #67 ... I think I've searched in the HTML repo and not here, otherwise bumping the XPath version might be part of #67 too, imho, as once there's agreements for settling it, there might be agreements on what should run underneath, right? If you feel like that's the case, feel free to close this issue, and I'll keep watching/following the other one.

@annevk
Copy link
Member

annevk commented Oct 13, 2020

It's been like a decade so I might remember wrongly, but I don't think XPath 2.0 is backwards compatible. That doesn't mean we couldn't do compatible extensions to 1.0, but I'm not sure what the appetite is for that.

@WebReflection
Copy link
Author

I'm really after having matches(RegExp, flag) which is currently the biggest missing feature in XPath 1.0, available since XPath 2.0 ... however, if we focused on a new XPath API, I don't see backward compatibility as an issue.

@domenic
Copy link
Member

domenic commented Oct 13, 2020

Chrome is not interested in this. The XML parts of our pipeline are in maintenance mode and we would love to eventually deprecate and remove them, or at least replace them with something that generates less security bugs. Increasing the capabilities of XML in the browser runs counter to that goal.

@WebReflection
Copy link
Author

or at least replace them with something that generates less security bugs

if replaced, since work would need to be done regardless, what are the security implications of having matches in, if I might ask?

@WebReflection
Copy link
Author

Also worth mentioning that usage increased in the last years so that removing it looks indeed like a breaking change ... we just started using XPath extremely successfully in many occasions, having that fully removed would break many things so I hope there's room for changes but no deprecation ... it's super powerful as query language and it can provide things CSS might never have for perf or other reasons.

@yamahito
Copy link

yamahito commented Oct 13, 2020

"I don't think XPath 2.0 is backwards compatible."

This is not true, at least in the sense I would understand it, i.e. that an XPath 2.0 (or indeed 3.1, there is no XPath 2 :) ) processor will happily run an XPath 1.0 statement, and return the same nodes as an XPath 1.0 processor.

@yamahito
Copy link

yamahito commented Oct 13, 2020

"The XML parts of our pipeline are in maintenance mode and we would love to eventually deprecate and remove them, or at least replace them with something that generates less security bugs. Increasing the capabilities of XML in the browser runs counter to that goal."

Assuming for a moment that increasing XML capabilities "generates [...] security bugs" (I am not convinced), this is a proposal for querying the HTML DOM with XPath, not XML. (Thanks to @gimsieke for pointing this out!)

@domenic
Copy link
Member

domenic commented Oct 13, 2020

By "XML parts of our pipeline" I mean "everything implemented using libxml and libxslt".

@yamahito
Copy link

Deprecating/replacing libxml and libxslt would be a prerequisite of updating support to XPath v >1.

So Chrome should be behind such a move, right?

@domenic
Copy link
Member

domenic commented Oct 13, 2020

I can tell this is not going to be a productive conversation, as folks are intent on playing word games to try and pretend Chrome has a different stance than we do. As such, I won't be participating in this thread further. I think I've made our position clear.

@shawnz
Copy link

shawnz commented Oct 13, 2020

If supporting XPath >=2.0 would mean everyone needs a completely new implementation, then wouldn't it be less work overall to just continue to improve CSS selectors to support the missing features?

@domenic I don't think you are being fair to @yamahito's point. Just because XPath has "X" in the name doesn't necessarily mean it needs to have anything to do with XML.

@yamahito
Copy link

Sorry to make you feel I am playing word games, and that my joke has made you throw your toys out of the pram, but there was a serious point here, which I don't think you've addressed.

You and the OP sort of have the same problem: libxml and libxslt have not been updated to work with updated specifications for a very long time.

If you want a productive suggestion, how about the Saxon-HE/C library as a potential alternative?
https://www.saxonica.com/saxon-c/index.xml

@bryanrasmussen
Copy link

The products Saxon-PE/C and Saxon-EE/C are commercial products, and require a license key.

I mean maybe Michael Kay would have some idea whereby this could be doable and he would find it reasonable, but I think this makes it difficult for some browsers.

I personally would love if I saw XPath getting some love, so don't take my comment as negative.

@yamahito
Copy link

yamahito commented Oct 13, 2020

However, Saxon-HE/C is open source: you wouldn't have support for all features (e.g. schema awareness), but I don't think those would be missed for this purpose.

Of course, there may be other reasons why it's not doable (licensing issues), and I'm not qualified to comment on implementation. I certainly don't want to talk for Chrome, despite aspersions to the contrary. I just want to point out that the underlying issue is the use of a library many years out of date, but that said library does not reflect on XPath as a technology.

@WebReflection
Copy link
Author

@domenic I am not sure that "folks" included me (but I guess so ...)

I can tell this is not going to be a productive conversation, as folks intent are playing word games

I honestly had the feeling there was no room for any conversation, after your first reply:

Chrome is not interested in this. The XML parts of our pipeline are in maintenance mode and we would love to eventually deprecate and remove them

although, this sentence is both not exactly what I've proposed, but also scary, 'cause SVG, as far as I know, is still part of the XML namespace/pipeline, and announcing that anything XML is going to be deprecated and removed is concerning, imo.

I also think it's clear that developers knowing XPath, and its potentials, probably are not using it daily due its lack of improvements since 1999, so that asking why, where, or what, looks like a normal conversation to me, but "dropping the bomb and the mic" at the same time feels a bit "off", imho, but if there's anything I've said that made you put me in the "folks that play word games" category, I apology, 'cause even if I'm not sure where I gave you that feeling, it surely wasn't my intent.


I hope that the idea to improve XPath to let Web developers fulfill any requirement not satisfied by current CSS offer would be considered at least by other vendors, specially after reading that XPath has apparently security implications, while it's still a W3C recommendation ... it took much less to deprecate SQLite, and no security issue was obvious at that time, it's weird something known as insecure has been kept for 20 years in the platform and never got a chance to be updated.

@liamquin
Copy link

@annevk XPath 2 (and, more to the point these days, 3.1) are highly backward compatible with XPath 1. There are some differences. Example: in XPath 1, the string value of a sequence is the string value of the first item in the sequence; that was crazy and caused lots of bugs in people's XPath expressions.

The XPath 2 and 3 specs include notes for people implementing XPath 2 and 3 on how to handle those cases. They are very small edge cases & many are unlikely to apply to Web browser usage anyway.

Possible implementation approaches include (1) make a standard API that includes the desired XPath version; this is badly needed in any case... (2) use a JavaScript-based implementation (see e.g. frameless.io), (3) write or reuse a C/rust/C++ one, most likely starting with an XQuery implementation as that's an extension of XPath (XQuery 1 extends XPath 2, confusingly; XQuery 3.1 extends XPath 3.1).

Where XPath 1 was based on node lists, XPath 2 moved to being based on sequences; it's much more powerful for users, and a lot of things that were tricky became a lot clearer, but the underlying code is likely very different.

A CSS xpath('expr' [, version]) function would be super useful e.g. in the content property, as it can do string processing on text in the document - even if only in the "slow" profile of CSS.

@WebReflection the security issues in XPath are that there are functions (starting in XPath 1) that allow file access. The same security issues that XHTTPRequest has apply. There are also common extensions in XPath implementations to allow extended file access, but those make no sense in a Web browser - see e.g. expath.org. In XPath 3 it's possible to write recursive functions, as with JavaScript, so you could create infinite loops, and an implementation needs to detect this. There's also the possibility - again as with JavaScript - of building up variables, e.g. with the string concatenation operator || like this:
let $a := "socks socks socks socks, $b := $a || $a || $a || $a, $c := $b || $b || $b || $b return $c || $c || $c || $c which makes lots of socks. Or you can write string-join( (1 to 99999999), ", ")
to make "1, 2, 3, ..."
As with JavaScript, a sensible workaround is limits on variable size & sequence length. So the security issues are known and manageable.
But that's different fromwhat @domenic meant, which is that there were security issues in the XML pipeline - that is, in the C libraries they have been using, which are large, complex, and hard to fix.

Yes, CSS could be extended to be comparable - e.g to be able to do string matching & processing on text content, date/time arithmetic, joins, union/intersection, and so forth. It'd be a lot of work, although just adding matches() and replace() would go a long way -
td.matches("^-\d+") { color: red; }
(to invent a syntax in selectors)
although,
span.price.xpath(. gt 0 and . lt 100 and not(preceding-sibling::span[. = 'special'])) { color: green; }
would go further. I'd guess that in the next 10 or 15 years CSS will get there; in the meantime, custom CSS functions and selectors may give a way to do some of the things you can do with XPath, albeit more slowly.

@WebReflection
Copy link
Author

WebReflection commented Oct 13, 2020

@liamquin thanks a lot for the clarification, and yes, that makes sense. However, if XPath 3 is more problematic than 2, in terms of possible footgun within the parsing and features, I think having v2 available in JS would already be a killer feature compared to 1, and since nobody wants new footguns in JS, upgrading to the least problematic version that provides matches and replace (which is also in v2 IIRC), would enable a whole new world of possibilities that fit into a well known selector, instead of spanning through some CSS selector, plus JS checks, plus anything else that might result in more errors than features, for the platform.

@DrRataplan
Copy link

Personally, as one of the authors of a free open source XPath 3.1 implementation (https://github.com/FontoXML/fontoxpath), I do not really see the point in shipping XPath 3.x or 2.0 in the browser.

Rather, I would prefer to see a way to plugin into the CSS engine to use XPath in CSS, so that we can do what @liamquin described, but in a more flexible way. There will be many performance concerns over there, but those must be manageable in some way.

@WebReflection
Copy link
Author

@DrRataplan unless you are thinking about exposing XPath through querySelector/All, I am not sure how that would cover the crawling/addressing use case, but as your solution would still mean updating XPath and hooking it into CSS, I don't think your idea would take less time than simply updating XPath.

Also worth reminding that updating XPath, as proposed in here, has nothing to do with styling, as any live styling through XPath will make pages likely very slow, otherwise we would already have :has(...) selector widely implemented.

@sergeykish
Copy link

sergeykish commented Oct 15, 2020

I do not think @domenic meant removal of XML API

Deprecate, and consider removing, XSLT

The consensus last time we considered this was that xml and xslt are too important for enterprise and we cannot remove them from the platform. Closing this bug to match that reality. We'll open a new bug if we ever decide to do this. [1] (Feb 22, 2019)

https://bugs.chromium.org/p/chromium/issues/detail?id=514995

That said I have one example of XML API state. Have you known DOMParser parsing text/xml is slower than text/html?

We have querySelector/querySelectorAll, over the years it adopted many XPath selectors, yet there is no queryXPath/queryXPathAll and its polyfill is just a few lines:

XPathResult.prototype[Symbol.iterator] = function *() {
  let next;
  while (next = this.iterateNext()) {
    yield next;
  }
}
Document.prototype.queryXPathAll = function(expression, ...args) {
  return [...this.evaluate(expression, this, args)]
}
Element.prototype.queryXPathAll = function(expression, ...args) {
  return [...this.ownerDocument.evaluate(expression, this, args)]
}

There was a proposal, waits for #67, closed.

jQuery popularized CSS selectors. Somehow there is not much XPath, XPath 2.0, XPath 3.0 activity on the web. It would be great if its proponents described how it helps them. Personally I use XPath to query text nodes and as :has replacement

//text()[last()]
//a[text() = 'foo']
//a[img]

I do not think Web developers know and use count, etc. XPath 2.0 extends it, feels a lot like SQL:

//*[tokenize(@class, ' ') = 'foo']
//time[fn:year-from-date(xs:date(@datetime)) = 2020]

I would prefer Invisible XML approach

//*[id = 'foo']
//p[class = 'bar']
//p[lang/en/us]
//date[datetime/year = '2020']
//a[href/host = 'example.com']
//span[xstyle/color = 'blue']

emulated with

<p><id>foo</id></p>
<p><class>bar</class></p>
<p><lang><en><us></us></en></lang></p>
<date><datetime><year>2020</year></datetime></date>
<a><href><host>example.com</host></href></a>
<span><xstyle><color>blue</color></style></span>

(<style> is CDATA, I use <xstyle> instead)

Each node node knows its type, parses underlying mini language and presents as if it was nodes.

@WebReflection
Copy link
Author

WebReflection commented Oct 15, 2020

@sergeykish with XPath I can select even attribute nodes and/or text nodes, and this is gold for libraries based on template literals ... as example, this single query //*/@*[.="${uid}"]|//*/text()[contains(.,"${uid}")] lets me remove a tree walker with checks all over for the attribute content or text content, XPath does that in one line, because it's a language born to query the tree, not to style it. I wouldn't mind having queryXPath and/or queryXPathAll, if that helps adding matches, but I really hope matches can be added as amend of XPath 1.

The rest of the functions are also well known, and there are cheatsheets that help with it too:
https://devhints.io/xpath#class-check

@sirinath
Copy link

Why 2.0? Why not the latest version?

@WebReflection
Copy link
Author

WebReflection commented Oct 31, 2020

@sirinath apparently there's an agreement among XPath users that v1.0 is the right version to use and eventually new features should be implemented on top of v1.0, and to be honest, the only feature I really miss, and so do others, is the RegExp functionality, which together with current XPath 1.0 offer, would be already a huge upgrade in possibilities.

As apparently nobody wants to touch this part of the Web anyhow, we should try to understand if bringing just that would be possible, or if we could just close this proposal as not accepted and move forward.

@rhdunn
Copy link

rhdunn commented Nov 1, 2020

I don't think that using Hacker News comments is good for determining that there is a consensus that XPath 1.0 is the right version to use/build on. If you wanted to do something like that you would need to do a survey of companies and hobbyists to see what stacks they are using and if they would use XPath 2.0/3.0/3.1 features if they were available on those stacks (including on web browsers, e.g. when testing via Selenium). Personally, I like the changes that XPath 2.0 made to the language, as it tidied up several things like not being able to do *:hr or processing-instruction(name) in XPath 1.0.

FWIW, I have recreated the XPath 1.0 grammar using the XPath 2.0 names and structures at https://rhdunn.github.io/xquery-intellij-plugin/specifications/XPath%201.0%20as%202.0%20EBNF%20Grammar.html. It is 47 EBNF symbols, compared to XPath 2.0's 82, 3.0's 108 and XPath 3.1's 126. That document also describes the grammar differences between XPath 1.0 and XPath 2.0.

@WebReflection
Copy link
Author

@rhdunn

I don't think that using Hacker News comments is good for determining that there is a consensus that XPath 1.0 is the right version to use/build on.

absolutely, and I haven't used the consensus word, I've just found interesting comments from various people actually using, and appreciating, what XPath brings to the plate, and many said 2 or even 3 are too much to implement and possibly problematic, but few said it should be relatively easy to add RegExp on top of the current implementation only, which is 1.0.

As this issue mention upgrade to 2.0, that's the ideal dream/goal, but since vendors already stated they don't think this would ever happen, they have no interest, or it's complicated, then I'm just saying I personally miss RegExp, as I think that'd be a huge step forward already in scraping and querying possibilities.

@benjamingr
Copy link
Member

Hey, this issue is trending in HN https://news.ycombinator.com/item?id=24959588 - probably a good idea to lock it for a bit to reduce the amount of noise.

@benjamingr
Copy link
Member

Also, knowing some of the people involved - I think discussion here isn't too great:

  • Having a lot of people show up downvote/upvote things isn't great. It's a shame GitHub doesn't let repo owners turn that feature on/off. It creates a feeling of maintainers being attacked.
  • "Chrome is not interested in this" is a perfectly fine way to respond. You may not like the fact Domenic said that nor that it's Chrome's position but Chrome is allowed to have that position. I think a productive follow up would have been "what would it take for you to reconsider?" or "how can we help with Chrome's concerns?" or something similar.
  • Telling Chrome about libraries ( like saxon-c ) doesn't help. It's asking them to allocate a significant amount of work to refactor a feature they are not interested in maintaining to begin with but have to. It'll be a very hard sell.
  • Props to Andrea for trying to engage constructively and actually explain why XPath2 would be useful for Chrome and what capabilities it adds to the web platform that may ask Chrome to support it. Also props to Liam on explaining why that's good.

I am not sure why no one brought up the fact that XPath 3 implementations exist in userland (this seems to be the most popular one) but they are not popular. So XPath 3 does not add capabilities to the web platform since it's possible in userland and is not popular and does not fix the issues with the existing APIs since it can't replace it because of compatibility.

If you want to engage (constructively) with Chrome on this - you need to look at their perspective and explain how an investment into XPath 3.1 aligns with their goals. For example - get someone to sponsor work that reduces the existing technical debt significantly while adding the API. TBH if I were Chrome I'd likely still not go for it because of their perspective.

@bathos
Copy link

bathos commented Nov 1, 2020

Before the window here closes, as someone who isn’t here because of HN :), I thought it might be helpful to add a data point. I’ve also used XPath for the exact same purpose as @WebReflection. It seems very suited to this. (And — far more niche — I’ve also employed it when processing WOFF metadata.) I found the existing implementation adequate for both tasks, but figured it might still be useful for implementers to know that the xpath-for-template-substitutions pattern isn’t a one-off.

@benjamingr
Copy link
Member

@bathos I work in test automation and xpath is very popular and pervasive in that space. It is much more popular than CSS selectors for automation:

  • The vast majority of browser automation is in Selenium (at least 5 times more than puppeteer, playwright and cypress combined).
  • The vast majority of browser automation is in Java/Python.
  • XPath is the most popular way to select things.

I think that if you want to "prove" this it's pretty easy to ask grid vendors (like sauce labs) and I'm sure they'll be happy to help (I'm happy to ask them if this is called into question).

Note that "popularity" isn't the reason Chrome is objecting to this.

@liamquin
Copy link

liamquin commented Nov 1, 2020

Maybe a short-term (or medium-term) way forward would be to standardize an API for calling JavaScript functions from XPath. People could then either use the existing regular expression engine in JavaScript directly from XPath, or could implement fn:relace(). I know Tom Hodgins (innovati) has experimented with XPath in CSS selectors, but the escaping you have to do makes it essentially untenable - https://codepen.io/tomhodgins/pen/KxOOzZ

@benjamingr thanks for the kind mention. The argument about popularity is like the story of a town separated from another town by separated by a great ravine - people asked the mayor for a bridge and his response was, there's very little demand: hardly anyone swims across the rapids today, so we should not build a bridge. The install instructions for fontoxpath assume an experienced JavaScript developer; frameless.io requires registration and has usage restrictions. The others i know of are commercial products, so for sure there is demand. And as you say, out of the browser, XPath is widely used in specific areas - along with XSLT for that matter. Any spec with an X in its name has an uphill struggle in browserland these day, and it's not an easy change that's proposed.

@WebReflection
Copy link
Author

WebReflection commented Nov 2, 2020

@liamquin a JS hook into XPath 1.0 would surely open many possibilities but it would still require an update to the current implementation, which is what Chrome would like to avoid.

@benjamingr there are userland libraries, but these are huge and slow compared to the current, native, XPath 1.0, which is why we're not considering adopting these, as we can use a bit of JS to crawl axes and RegExp, yet the dance is awkward and always a bit ad-hoc, but we don't have license issues, bundling issues, foreign code to watch out, etc.

Of course something that can be written in JS will be written in JS, but at the same time, we all know as soon as something is available natively, it's better for everyone, and no unnecessary bloat or slower perf are needed.

I still would like to understand if there is any room for improvements or not at all though, and in latter case I think we should close this issue as "won't fix" and move forward.

@namedgraph

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest
Development

No branches or pull requests

16 participants