Add "split" xpath in post-processing , newlines in replace support #579

bnkai · 2020-05-25T11:53:31Z

Adds support for split that is the opposite of concat.
It splits a string according to the separator it is given, while the separator is removed.
If some of the resulting strings are empty eg splitting tag1,,tag2,tag3 with , then they are ignored
Result should be 0:tag1 1:tag2 2:tag3

It is executed last in order of processing: concat -> replace -> parseDate -> split

The community scraper for AdultTimeStudios for example can be written like this

name: "AdultTimeStudios"
sceneByURL:
  - action: scrapeXPath
    url:
      - puretaboo.com/
      - isthisreal.com/en/video/
      - 21sextury.com/en/video/

    scraper: sceneScraper
xPathScrapers:
  sceneScraper:
    common:
      $videoscript: //script[contains(text(), 'ScenePlayerId = "player"')]/text()
      $datascript: //script[contains(text(), 'sceneDetails')]/text()
      $imagescript: //script[contains(text(), 'picPreview')]/text()
    scene:
      Title:
        selector: $videoscript
        replace:
          - regex: .+(?:"sceneTitle":")([^"]+).+
            with: $1
          - regex: .+(?:"sceneTitle":"").+
            with:
      Date:
        selector: $videoscript
        replace:
          - regex: .+(?:"sceneReleaseDate":")([^"]+).+
            with: $1
        parseDate: 2006-01-02
      Details:
        selector: $datascript
        replace:
          - regex: .+(?:sceneDescription":")(.+)(?:","sceneActors).+
            with: $1
          - regex: .+(?:"sceneDescription":"").+
            with:
          - regex: <\\\/br>|<br\s\\\/>|<br>
            with: "\n"
      Tags:
        # Section still being worked on
        Name:
          selector: $datascript
          replace:
            - regex: .+(?:sceneCategories":\[)(.+)(?:\],"sceneViews").+
              with: $1
            - regex: \"
              with:
          split: ","
      Performers:
        # Section still being worked on
        Name:
          selector: $datascript
          replace:
            - regex: .+(?:"sceneActors":)(.+)(?:,"sceneCategories").+
              with: $1
            - regex: \d+|actorId|actorName|\[|\]|"|\{|:|\}
              with:
          split: ","
      Image:
        selector: $imagescript
        replace:
          - regex: .+(?:picPreview":")([\w:]+)(?:[\\\/]+)([\w-\.]+)(?:[\\\/]+)(\w+)(?:[\\\/]+)(\d+)(?:[\\\/]+)([\d_]+)(?:[\\\/]+)(\w+)(?:[\\\/]+)(\d+)(?:[\\\/]+)(\d+)(?:[\\\/]+)([\w]+)(?:[\\\/]+)([\w.]+).+
            with: $1//$2/$3/$4/$5/$6/$7/$8/$9/$10
            # if using the transport subdomain, parameters need to be passed
            # otherwise a cropped square image is returned by default
          - regex: (https:\/\/transform.+)
            with: $1?width=960&height=543&enlarge=true

EDIT scraper is still not complete not everything works as it should yet. Its mainly to demonstrate the split and "\n" usage.

bnkai · 2020-06-06T10:25:40Z

Added support for newlines when they are added with Replace with:
If you add a regex with a with: "\n" clause then those newlines should be preserved.
It works by replacing "\n" with "\r" before commonPostprocessing is applied and restores them later.
Due to the commonPostprocessing applied in several places this was the simplest way i could implement this.
The only edge case is if there is already "\r" in some fields but i didn't encounter that. It could be altered to use some other special char instead of "\r" if needed.

Tested with the scraper above and the isthisreal.com site mainly and also with already existing scrapers to make sure no regression occurs.

WithoutPants

Tests ok. Just a minor change.

pkg/scraper/xpath.go

bnkai · 2020-06-17T16:23:23Z

name: "AngelaWhite"
sceneByURL:
  - action: scrapeXPath
    url:
      - angelawhite.com/tour/trailers/
    scraper: sceneScraper
xPathScrapers:
  sceneScraper:
    common:
      $info: //div[@class="video-details-inner"]
      $search: //div[@class="video-details-inner"]/h2/text()
    scene:
      Title: $info/h2/text()
      Date:
        selector: $info/span/text()
        parseDate: Jan 2, 2006
      Details: $info//p/text()
      Performers:
        Name:
          selector: $search
          replace:
            - regex: (^\d+)\s.+
              with: "http://angelawhite.com/tour/search?query=$1"
          subScraper:
            selector: //div[@class="models-list"]/a/text()
            concat: ","
          split: ","
      Tags:
        Name:
          selector: $search
          replace:
            - regex: (^\d+)\s.+
              with: "http://angelawhite.com/tour/search?query=$1"
          subScraper:
            selector: //div[@class="categories-list"]/a/text()
            concat: ","
          split: ","
      Image:
        selector: $search
        replace: 
          - regex: \s
            with: "+"
          - regex: ^
            with: "http://angelawhite.com/tour/search?query=\""
          - regex: $
            with: "\""
        subScraper:
          selector: //img/@src0
          replace:
            - regex: ^
              with: http://angelawhite.com
# Last Updated June 17, 2020

The above is another sample scraper that demonstrates/uses the split function.

WithoutPants

I think the newline issue will need to be revisited further in future. I was unable to get a unit test exercising it to work.

I added the following to the scene test html:

<div class="description">
    This<br>
    is<br>
    a<br>
    multi-line<br>
    description.
</div>

Added the following to makeSceneXPathConfig:

	detailsConfig := make(map[interface{}]interface{})
	detailsConfig["selector"] = `//div[@class="description"]`
	var detailsReplace []interface{}
	detailsReplace = append(detailsReplace, makeReplaceRegex(`<br>`, "\n"))
	detailsReplace = append(detailsReplace, makeReplaceRegex(`  +`, " "))
	detailsConfig["replace"] = detailsReplace
	config["Details"] = detailsConfig

And the following to TestApplySceneXPathConfig:

const details = "This\nis\na\nmulti-line\nfield."
verifyField(t, details, scene.Details, "Details")

And got the following results:

-- FAIL: TestApplySceneXPathConfig (0.00s)
    xpath_test.go:792: Expected Details to be set to This
        is
        a
        multi-line
        field., instead got This is a multi-line description.

The call to NodeText() in process seems to be converting the <br> tags to newlines already, then commonPostProcess removes the newlines, so the regex never actually gets hit.

All that said, it is working for the scrapers that use the newline in the replacement as far as I can tell, so I'll put this through for now.

WithoutPants · 2020-06-18T00:26:52Z

Don't forget to update the wiki once this is merged.

…tashapp#579)

bnkai added 2 commits May 25, 2020 14:34

Add split helper function for xpath

a40a4b1

* typo fix

6c03261

bnkai added the feature Pull requests that add a new feature label May 25, 2020

bnkai added 3 commits May 26, 2020 11:01

* fix typo

e9b11cf

Merge remote-tracking branch 'upstream/develop' into issues/573

747b5ae

* keep newlines when they are added through "with:"

7311e6e

bnkai changed the title ~~Add "split" xpath post-processing support~~ Add "split" xpath in post-processing , newlines in replace support Jun 6, 2020

bnkai mentioned this pull request Jun 6, 2020

[Bug Report] XPath Scraper shouldn't remove newlines for Detail fields #591

Open

WithoutPants mentioned this pull request Jun 17, 2020

Refactor xpath scraper code. Add fixed and map #616

Merged

WithoutPants added this to the Version 0.3.0 milestone Jun 17, 2020

WithoutPants requested changes Jun 17, 2020

View reviewed changes

pkg/scraper/xpath.go Outdated Show resolved Hide resolved

bnkai added 2 commits June 17, 2020 15:23

Merge remote-tracking branch 'upstream/develop' into issues/573

10c93a2

* refactor, switch from \r to \a

e42591c

bnkai mentioned this pull request Jun 17, 2020

Adds AngelaWhite xPath scraper stashapp/CommunityScrapers#77

Merged

WithoutPants approved these changes Jun 18, 2020

View reviewed changes

WithoutPants added 2 commits June 18, 2020 10:27

Merge remote-tracking branch 'upstream/develop' into prs/579

d2ce136

Update changelog

6622d62

WithoutPants merged commit 9d0522f into stashapp:develop Jun 18, 2020

Tweeticoats pushed a commit to Tweeticoats/stash that referenced this pull request Feb 1, 2021

Add "split" xpath in post-processing , newlines in replace support (s…

82509bc

…tashapp#579)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "split" xpath in post-processing , newlines in replace support #579

Add "split" xpath in post-processing , newlines in replace support #579

bnkai commented May 25, 2020 •

edited

Loading

bnkai commented Jun 6, 2020 •

edited

Loading

WithoutPants left a comment

bnkai commented Jun 17, 2020

WithoutPants left a comment •

edited

Loading

WithoutPants commented Jun 18, 2020

Add "split" xpath in post-processing , newlines in replace support #579

Add "split" xpath in post-processing , newlines in replace support #579

Conversation

bnkai commented May 25, 2020 • edited Loading

bnkai commented Jun 6, 2020 • edited Loading

WithoutPants left a comment

Choose a reason for hiding this comment

bnkai commented Jun 17, 2020

WithoutPants left a comment • edited Loading

Choose a reason for hiding this comment

WithoutPants commented Jun 18, 2020

bnkai commented May 25, 2020 •

edited

Loading

bnkai commented Jun 6, 2020 •

edited

Loading

WithoutPants left a comment •

edited

Loading