-
-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: integrate regex-based extraction mechanisms #46
Comments
Hmm that is a tricky one. I will admit that I'm feeling a bit hesitant to throw regexes into the mix. In your specific case it seems the page is using a fixed width format for the dates so I think you can use This works in my testing: format = "[day padding:none]/[month padding:none]/[year][ignore count:12]" However it does feel a bit fragile. A nicer option would be to just stop parsing the date after the year. time already has an but it only succeeds if there is no further input after it. I wonder if they would be open to an option on that to allowed terminating parsing when it's encountered, even if there is input remaining. |
Thank you @wezm I have a lot of difficulty understanding that documentation.
It may be fragile, but it seems stable for that site. But I want to try again to propose the introduction of regexes, as an option for all fields to be extracted:
graph TD
A[Webpage] --> B[CSS Selector]
B --> C{Regex}
C -- Filter --> D[RSS]
C -- No filter --> D[RSS]
style C stroke-dasharray: 5, 5;
Unfortunately I am not a developer and I can't help you with the code, however I think it could be a very convenient feature and maybe it is quite straightforward to introduce. Once again thank you very much. |
That feels like something I can try to improve too. Was it the
I will keep it in mind but I'm not ready to add it yet. Regular expressions are a big hammer to solve a problem with so before incorporating them I'd like to have a decent collection of real-world examples of where they would help solve a problem. I'll leave this issue open to see if such cases come up. |
I like the documentation of
But you can't do it like with Thank you again |
Yes the time docs I'm linking to are the developer oriented docs. I think we can do better for end-users.
Yes, I'm not saying it's hard to do, I'm saying that I'm not certain it's how I want to solve the problem. I'd like to gather more data points before committing to it. |
Hi,
in this site with not very good HTML, I use the awesome rsspls to generate an RSS feed:
https://bagheria.trasparenza-valutazione-merito.it/web/trasparenza/papca-ap/-/papca/igrid/39908/24264
I use these settings
I would also like to extract the publication date which is the first date that appears in the td tag with class “date”
But via CSS selector I don't think there is any way to select the first one.
So if I apply the CSS selector, and set
format = "[day padding:none]/[month padding:none]/[year]"
I get these errors, because I can't find a way to tell to disregard the second date as well.It would be great to be able to set some magic char or regex, to escape some text. In example
to extract only
31/05/2024
from31/05/2024 15/06/2024
, and then apply the format.Thank you
The text was updated successfully, but these errors were encountered: