-
-
Notifications
You must be signed in to change notification settings - Fork 827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "split" xpath in post-processing , newlines in replace support #579
Conversation
Added support for newlines when they are added with Replace with: Tested with the scraper above and the isthisreal.com site mainly and also with already existing scrapers to make sure no regression occurs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests ok. Just a minor change.
name: "AngelaWhite"
sceneByURL:
- action: scrapeXPath
url:
- angelawhite.com/tour/trailers/
scraper: sceneScraper
xPathScrapers:
sceneScraper:
common:
$info: //div[@class="video-details-inner"]
$search: //div[@class="video-details-inner"]/h2/text()
scene:
Title: $info/h2/text()
Date:
selector: $info/span/text()
parseDate: Jan 2, 2006
Details: $info//p/text()
Performers:
Name:
selector: $search
replace:
- regex: (^\d+)\s.+
with: "http://angelawhite.com/tour/search?query=$1"
subScraper:
selector: //div[@class="models-list"]/a/text()
concat: ","
split: ","
Tags:
Name:
selector: $search
replace:
- regex: (^\d+)\s.+
with: "http://angelawhite.com/tour/search?query=$1"
subScraper:
selector: //div[@class="categories-list"]/a/text()
concat: ","
split: ","
Image:
selector: $search
replace:
- regex: \s
with: "+"
- regex: ^
with: "http://angelawhite.com/tour/search?query=\""
- regex: $
with: "\""
subScraper:
selector: //img/@src0
replace:
- regex: ^
with: http://angelawhite.com
# Last Updated June 17, 2020 The above is another sample scraper that demonstrates/uses the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the newline issue will need to be revisited further in future. I was unable to get a unit test exercising it to work.
I added the following to the scene test html:
<div class="description">
This<br>
is<br>
a<br>
multi-line<br>
description.
</div>
Added the following to makeSceneXPathConfig
:
detailsConfig := make(map[interface{}]interface{})
detailsConfig["selector"] = `//div[@class="description"]`
var detailsReplace []interface{}
detailsReplace = append(detailsReplace, makeReplaceRegex(`<br>`, "\n"))
detailsReplace = append(detailsReplace, makeReplaceRegex(` +`, " "))
detailsConfig["replace"] = detailsReplace
config["Details"] = detailsConfig
And the following to TestApplySceneXPathConfig
:
const details = "This\nis\na\nmulti-line\nfield."
verifyField(t, details, scene.Details, "Details")
And got the following results:
-- FAIL: TestApplySceneXPathConfig (0.00s)
xpath_test.go:792: Expected Details to be set to This
is
a
multi-line
field., instead got This is a multi-line description.
The call to NodeText()
in process
seems to be converting the <br>
tags to newlines already, then commonPostProcess
removes the newlines, so the regex never actually gets hit.
All that said, it is working for the scrapers that use the newline in the replacement as far as I can tell, so I'll put this through for now.
Don't forget to update the wiki once this is merged. |
Adds support for split that is the opposite of concat.
It splits a string according to the separator it is given, while the separator is removed.
If some of the resulting strings are empty eg splitting
tag1,,tag2,tag3
with,
then they are ignoredResult should be 0:
tag1
1:tag2
2:tag3
It is executed last in order of processing: concat -> replace -> parseDate -> split
The community scraper for AdultTimeStudios for example can be written like this
EDIT scraper is still not complete not everything works as it should yet. Its mainly to demonstrate the split and "\n" usage.