Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: integrate regex-based extraction mechanisms #46

Open
aborruso opened this issue Jun 2, 2024 · 5 comments
Open

Feature request: integrate regex-based extraction mechanisms #46

aborruso opened this issue Jun 2, 2024 · 5 comments

Comments

@aborruso
Copy link

aborruso commented Jun 2, 2024

Hi,
in this site with not very good HTML, I use the awesome rsspls to generate an RSS feed:
https://bagheria.trasparenza-valutazione-merito.it/web/trasparenza/papca-ap/-/papca/igrid/39908/24264

I use these settings

[rsspls]
output = "../docs/c_a546"
[[feed]]
title = "Albo del comune di Bagheria"
filename = "feed.xml"
[feed.config]
url = "https://bagheria.trasparenza-valutazione-merito.it/web/trasparenza/papca-ap/-/papca/igrid/39908/24264"
item = "tbody tr:nth-child(n+2)"
heading = "td.oggetto.text"
link = "a[title=\"Apri Dettaglio\"]"

I would also like to extract the publication date which is the first date that appears in the td tag with class “date”

<td tabindex="0" role="cell" class="periodo-pubblicazione date">31/05/2024<br>
  15/06/2024</td>

But via CSS selector I don't think there is any way to select the first one.

So if I apply the CSS selector, and set format = "[day padding:none]/[month padding:none]/[year]" I get these errors, because I can't find a way to tell to disregard the second date as well.

 WARN  rsspls::feed > unable to parse date '31/05/2024  15/06/2024'
 WARN  rsspls::feed > unable to parse date '31/05/2024  15/06/2024'
 WARN  rsspls::feed > unable to parse date '31/05/2024  15/06/2024'
 WARN  rsspls::feed > unable to parse date '31/05/2024  15/06/2024'
...

It would be great to be able to set some magic char or regex, to escape some text. In example

regex="^(.+[0-9]) .+$"
format = "[day padding:none]/[month padding:none]/[year]"

to extract only 31/05/2024 from 31/05/2024 15/06/2024, and then apply the format.

image

Thank you

@wezm
Copy link
Owner

wezm commented Jun 2, 2024

Hmm that is a tricky one. I will admit that I'm feeling a bit hesitant to throw regexes into the mix. In your specific case it seems the page is using a fixed width format for the dates so I think you can use ignore:

Screenshot from 2024-06-03 08-58-59

This works in my testing:

format = "[day padding:none]/[month padding:none]/[year][ignore count:12]"

However it does feel a bit fragile. A nicer option would be to just stop parsing the date after the year. time already has an end component:

Screenshot from 2024-06-03 09-00-57

but it only succeeds if there is no further input after it. I wonder if they would be open to an option on that to allowed terminating parsing when it's encountered, even if there is input remaining.

@aborruso
Copy link
Author

aborruso commented Jun 3, 2024

Thank you @wezm I have a lot of difficulty understanding that documentation.
And I didn't realize, for example, that it was possible to use this

format = "[day padding:none]/[month padding:none]/[year][ignore count:12]"

It may be fragile, but it seems stable for that site.

But I want to try again to propose the introduction of regexes, as an option for all fields to be extracted:

  • so many pages have bad HTML;
  • css selector alone may not be enough;
  • regular expressions are a standard;
  • you could optionally introduce them after extraction by CSS selector.
graph TD
    A[Webpage] --> B[CSS Selector]
    B --> C{Regex}
    C -- Filter --> D[RSS]
    C -- No filter --> D[RSS]
    style C stroke-dasharray: 5, 5;
Loading

Unfortunately I am not a developer and I can't help you with the code, however I think it could be a very convenient feature and maybe it is quite straightforward to introduce.

Once again thank you very much.

@wezm
Copy link
Owner

wezm commented Jun 3, 2024

Thank you @wezm I have a lot of difficulty understanding that documentation.

That feels like something I can try to improve too. Was it the rsspls documentation or the time library's date parsing documentation that tripped you up (or both)?

But I want to try again to propose the introduction of regexes, as an option for all fields to be extracted:

I will keep it in mind but I'm not ready to add it yet. Regular expressions are a big hammer to solve a problem with so before incorporating them I'd like to have a decent collection of real-world examples of where they would help solve a problem. I'll leave this issue open to see if such cases come up.

@aborruso
Copy link
Author

aborruso commented Jun 3, 2024

That feels like something I can try to improve too. Was it the rsspls documentation or the time library's date parsing documentation that tripped you up (or both)?

I like the documentation of rsspls. The one from time, it seems to be written for those who already know how it works. It would take examples, first the simplest and most complex (if for example this ... then you have to ...).

Regular expressions are a big hammer

But you can't do it like with time? Choose a regex Rust library and tell users to refer to it?
And enable only simply regex match. For my example ^(\d{2}/\d{2}/\d{4}).

Thank you again

@wezm
Copy link
Owner

wezm commented Jun 3, 2024

I like the documentation of rsspls. The one from time, it seems to be written for those who already know how it works. It would take examples, first the simplest and most complex (if for example this ... then you have to ...).

Yes the time docs I'm linking to are the developer oriented docs. I think we can do better for end-users.

But you can't do it like with time? Choose a regex Rust library and tell users to refer to it?

Yes, I'm not saying it's hard to do, I'm saying that I'm not certain it's how I want to solve the problem. I'd like to gather more data points before committing to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants