Feature request: integrate regex-based extraction mechanisms #46

aborruso · 2024-06-02T19:06:45Z

Hi,
in this site with not very good HTML, I use the awesome rsspls to generate an RSS feed:
https://bagheria.trasparenza-valutazione-merito.it/web/trasparenza/papca-ap/-/papca/igrid/39908/24264

I use these settings

[rsspls]
output = "../docs/c_a546"
[[feed]]
title = "Albo del comune di Bagheria"
filename = "feed.xml"
[feed.config]
url = "https://bagheria.trasparenza-valutazione-merito.it/web/trasparenza/papca-ap/-/papca/igrid/39908/24264"
item = "tbody tr:nth-child(n+2)"
heading = "td.oggetto.text"
link = "a[title=\"Apri Dettaglio\"]"

I would also like to extract the publication date which is the first date that appears in the td tag with class “date”

<td tabindex="0" role="cell" class="periodo-pubblicazione date">31/05/2024<br>
  15/06/2024</td>

But via CSS selector I don't think there is any way to select the first one.

So if I apply the CSS selector, and set format = "[day padding:none]/[month padding:none]/[year]" I get these errors, because I can't find a way to tell to disregard the second date as well.

 WARN  rsspls::feed > unable to parse date '31/05/2024  15/06/2024'
 WARN  rsspls::feed > unable to parse date '31/05/2024  15/06/2024'
 WARN  rsspls::feed > unable to parse date '31/05/2024  15/06/2024'
 WARN  rsspls::feed > unable to parse date '31/05/2024  15/06/2024'
...

It would be great to be able to set some magic char or regex, to escape some text. In example

regex="^(.+[0-9]) .+$"
format = "[day padding:none]/[month padding:none]/[year]"

to extract only 31/05/2024 from 31/05/2024 15/06/2024, and then apply the format.

Thank you

The text was updated successfully, but these errors were encountered:

wezm · 2024-06-02T23:05:29Z

Hmm that is a tricky one. I will admit that I'm feeling a bit hesitant to throw regexes into the mix. In your specific case it seems the page is using a fixed width format for the dates so I think you can use ignore:

This works in my testing:

format = "[day padding:none]/[month padding:none]/[year][ignore count:12]"

However it does feel a bit fragile. A nicer option would be to just stop parsing the date after the year. time already has an end component:

but it only succeeds if there is no further input after it. I wonder if they would be open to an option on that to allowed terminating parsing when it's encountered, even if there is input remaining.

aborruso · 2024-06-03T06:14:44Z

Thank you @wezm I have a lot of difficulty understanding that documentation.
And I didn't realize, for example, that it was possible to use this

format = "[day padding:none]/[month padding:none]/[year][ignore count:12]"

It may be fragile, but it seems stable for that site.

But I want to try again to propose the introduction of regexes, as an option for all fields to be extracted:

so many pages have bad HTML;
css selector alone may not be enough;
regular expressions are a standard;
you could optionally introduce them after extraction by CSS selector.

graph TD
    A[Webpage] --> B[CSS Selector]
    B --> C{Regex}
    C -- Filter --> D[RSS]
    C -- No filter --> D[RSS]
    style C stroke-dasharray: 5, 5;

Unfortunately I am not a developer and I can't help you with the code, however I think it could be a very convenient feature and maybe it is quite straightforward to introduce.

Once again thank you very much.

wezm · 2024-06-03T07:37:20Z

Thank you @wezm I have a lot of difficulty understanding that documentation.

That feels like something I can try to improve too. Was it the rsspls documentation or the time library's date parsing documentation that tripped you up (or both)?

But I want to try again to propose the introduction of regexes, as an option for all fields to be extracted:

I will keep it in mind but I'm not ready to add it yet. Regular expressions are a big hammer to solve a problem with so before incorporating them I'd like to have a decent collection of real-world examples of where they would help solve a problem. I'll leave this issue open to see if such cases come up.

aborruso · 2024-06-03T13:31:41Z

That feels like something I can try to improve too. Was it the rsspls documentation or the time library's date parsing documentation that tripped you up (or both)?

I like the documentation of rsspls. The one from time, it seems to be written for those who already know how it works. It would take examples, first the simplest and most complex (if for example this ... then you have to ...).

Regular expressions are a big hammer

But you can't do it like with time? Choose a regex Rust library and tell users to refer to it?
And enable only simply regex match. For my example ^(\d{2}/\d{2}/\d{4}).

Thank you again

wezm · 2024-06-03T23:27:59Z

I like the documentation of rsspls. The one from time, it seems to be written for those who already know how it works. It would take examples, first the simplest and most complex (if for example this ... then you have to ...).

Yes the time docs I'm linking to are the developer oriented docs. I think we can do better for end-users.

But you can't do it like with time? Choose a regex Rust library and tell users to refer to it?

Yes, I'm not saying it's hard to do, I'm saying that I'm not certain it's how I want to solve the problem. I'd like to gather more data points before committing to it.

wezm mentioned this issue Jun 5, 2024

parsing: option to [end] to terminate parsing even if there is further input time-rs/time#684

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: integrate regex-based extraction mechanisms #46

Feature request: integrate regex-based extraction mechanisms #46

aborruso commented Jun 2, 2024 •

edited

Loading

wezm commented Jun 2, 2024

aborruso commented Jun 3, 2024

wezm commented Jun 3, 2024

aborruso commented Jun 3, 2024 •

edited

Loading

wezm commented Jun 3, 2024

Feature request: integrate regex-based extraction mechanisms #46

Feature request: integrate regex-based extraction mechanisms #46

Comments

aborruso commented Jun 2, 2024 • edited Loading

wezm commented Jun 2, 2024

aborruso commented Jun 3, 2024

wezm commented Jun 3, 2024

aborruso commented Jun 3, 2024 • edited Loading

wezm commented Jun 3, 2024

aborruso commented Jun 2, 2024 •

edited

Loading

aborruso commented Jun 3, 2024 •

edited

Loading