Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "split" xpath in post-processing , newlines in replace support #579

Merged
merged 9 commits into from
Jun 18, 2020

Conversation

bnkai
Copy link
Collaborator

@bnkai bnkai commented May 25, 2020

Adds support for split that is the opposite of concat.
It splits a string according to the separator it is given, while the separator is removed.
If some of the resulting strings are empty eg splitting tag1,,tag2,tag3 with , then they are ignored
Result should be 0:tag1 1:tag2 2:tag3

It is executed last in order of processing: concat -> replace -> parseDate -> split

The community scraper for AdultTimeStudios for example can be written like this

name: "AdultTimeStudios"
sceneByURL:
  - action: scrapeXPath
    url:
      - puretaboo.com/
      - isthisreal.com/en/video/
      - 21sextury.com/en/video/

    scraper: sceneScraper
xPathScrapers:
  sceneScraper:
    common:
      $videoscript: //script[contains(text(), 'ScenePlayerId = "player"')]/text()
      $datascript: //script[contains(text(), 'sceneDetails')]/text()
      $imagescript: //script[contains(text(), 'picPreview')]/text()
    scene:
      Title:
        selector: $videoscript
        replace:
          - regex: .+(?:"sceneTitle":")([^"]+).+
            with: $1
          - regex: .+(?:"sceneTitle":"").+
            with:
      Date:
        selector: $videoscript
        replace:
          - regex: .+(?:"sceneReleaseDate":")([^"]+).+
            with: $1
        parseDate: 2006-01-02
      Details:
        selector: $datascript
        replace:
          - regex: .+(?:sceneDescription":")(.+)(?:","sceneActors).+
            with: $1
          - regex: .+(?:"sceneDescription":"").+
            with:
          - regex: <\\\/br>|<br\s\\\/>|<br>
            with: "\n"
      Tags:
        # Section still being worked on
        Name:
          selector: $datascript
          replace:
            - regex: .+(?:sceneCategories":\[)(.+)(?:\],"sceneViews").+
              with: $1
            - regex: \"
              with:
          split: ","
      Performers:
        # Section still being worked on
        Name:
          selector: $datascript
          replace:
            - regex: .+(?:"sceneActors":)(.+)(?:,"sceneCategories").+
              with: $1
            - regex: \d+|actorId|actorName|\[|\]|"|\{|:|\}
              with:
          split: ","
      Image:
        selector: $imagescript
        replace:
          - regex: .+(?:picPreview":")([\w:]+)(?:[\\\/]+)([\w-\.]+)(?:[\\\/]+)(\w+)(?:[\\\/]+)(\d+)(?:[\\\/]+)([\d_]+)(?:[\\\/]+)(\w+)(?:[\\\/]+)(\d+)(?:[\\\/]+)(\d+)(?:[\\\/]+)([\w]+)(?:[\\\/]+)([\w.]+).+
            with: $1//$2/$3/$4/$5/$6/$7/$8/$9/$10
            # if using the transport subdomain, parameters need to be passed
            # otherwise a cropped square image is returned by default
          - regex: (https:\/\/transform.+)
            with: $1?width=960&height=543&enlarge=true

EDIT scraper is still not complete not everything works as it should yet. Its mainly to demonstrate the split and "\n" usage.

@bnkai bnkai added the feature Pull requests that add a new feature label May 25, 2020
@bnkai
Copy link
Collaborator Author

bnkai commented Jun 6, 2020

Added support for newlines when they are added with Replace with:
If you add a regex with a with: "\n" clause then those newlines should be preserved.
It works by replacing "\n" with "\r" before commonPostprocessing is applied and restores them later.
Due to the commonPostprocessing applied in several places this was the simplest way i could implement this.
The only edge case is if there is already "\r" in some fields but i didn't encounter that. It could be altered to use some other special char instead of "\r" if needed.

Tested with the scraper above and the isthisreal.com site mainly and also with already existing scrapers to make sure no regression occurs.

@bnkai bnkai changed the title Add "split" xpath post-processing support Add "split" xpath in post-processing , newlines in replace support Jun 6, 2020
@WithoutPants WithoutPants added this to the Version 0.3.0 milestone Jun 17, 2020
Copy link
Collaborator

@WithoutPants WithoutPants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests ok. Just a minor change.

pkg/scraper/xpath.go Outdated Show resolved Hide resolved
@bnkai
Copy link
Collaborator Author

bnkai commented Jun 17, 2020

name: "AngelaWhite"
sceneByURL:
  - action: scrapeXPath
    url:
      - angelawhite.com/tour/trailers/
    scraper: sceneScraper
xPathScrapers:
  sceneScraper:
    common:
      $info: //div[@class="video-details-inner"]
      $search: //div[@class="video-details-inner"]/h2/text()
    scene:
      Title: $info/h2/text()
      Date:
        selector: $info/span/text()
        parseDate: Jan 2, 2006
      Details: $info//p/text()
      Performers:
        Name:
          selector: $search
          replace:
            - regex: (^\d+)\s.+
              with: "http://angelawhite.com/tour/search?query=$1"
          subScraper:
            selector: //div[@class="models-list"]/a/text()
            concat: ","
          split: ","
      Tags:
        Name:
          selector: $search
          replace:
            - regex: (^\d+)\s.+
              with: "http://angelawhite.com/tour/search?query=$1"
          subScraper:
            selector: //div[@class="categories-list"]/a/text()
            concat: ","
          split: ","
      Image:
        selector: $search
        replace: 
          - regex: \s
            with: "+"
          - regex: ^
            with: "http://angelawhite.com/tour/search?query=\""
          - regex: $
            with: "\""
        subScraper:
          selector: //img/@src0
          replace:
            - regex: ^
              with: http://angelawhite.com
# Last Updated June 17, 2020

The above is another sample scraper that demonstrates/uses the split function.

Copy link
Collaborator

@WithoutPants WithoutPants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the newline issue will need to be revisited further in future. I was unable to get a unit test exercising it to work.

I added the following to the scene test html:

<div class="description">
    This<br>
    is<br>
    a<br>
    multi-line<br>
    description.
</div>

Added the following to makeSceneXPathConfig:

	detailsConfig := make(map[interface{}]interface{})
	detailsConfig["selector"] = `//div[@class="description"]`
	var detailsReplace []interface{}
	detailsReplace = append(detailsReplace, makeReplaceRegex(`<br>`, "\n"))
	detailsReplace = append(detailsReplace, makeReplaceRegex(`  +`, " "))
	detailsConfig["replace"] = detailsReplace
	config["Details"] = detailsConfig

And the following to TestApplySceneXPathConfig:

const details = "This\nis\na\nmulti-line\nfield."
verifyField(t, details, scene.Details, "Details")

And got the following results:

-- FAIL: TestApplySceneXPathConfig (0.00s)
    xpath_test.go:792: Expected Details to be set to This
        is
        a
        multi-line
        field., instead got This is a multi-line description.

The call to NodeText() in process seems to be converting the <br> tags to newlines already, then commonPostProcess removes the newlines, so the regex never actually gets hit.

All that said, it is working for the scrapers that use the newline in the replacement as far as I can tell, so I'll put this through for now.

@WithoutPants
Copy link
Collaborator

Don't forget to update the wiki once this is merged.

@WithoutPants WithoutPants merged commit 9d0522f into stashapp:develop Jun 18, 2020
Tweeticoats pushed a commit to Tweeticoats/stash that referenced this pull request Feb 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Pull requests that add a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants