Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weasyl Extractor #977

Merged
merged 3 commits into from
Sep 25, 2020
Merged

Weasyl Extractor #977

merged 3 commits into from
Sep 25, 2020

Conversation

Korvox
Copy link

@Korvox Korvox commented Sep 4, 2020

Closes #813

Supports individual posts, user galleries, folders, a journal, or all of a users journals. Uses the Weasyl API for everything but journals since there is no endpoint for that.

  • Currently user extraction is an alias for submissions extraction but it could be made its own extractor for both submissions and journals. I feel like in general most people would just want the gallery if they fed in a username though.
  • There might be someone that wants to download a favorites collection which is not a folder or gallery. There is no api for this so it would require iterating over the pages of the collection and grabbing the submits. If someone actually wants this its pretty trivial to implement but its a lot of non-API requests so I left it out for now.
  • There is no way to filter by file type - if someone has text files, music, videos, and pictures in their submissions you get all of them. If someone wants filtering it can be added.

docs/supportedsites.rst Outdated Show resolved Hide resolved
@kattjevfel
Copy link
Contributor

Decided to give this one a go and other than the supportsites.py issue the only feedback I've got is to perhaps set a different default filename format? Others extractors tend to prefix the filenames with the category.

My suggestion would be something like {category}_{submitid}_{title}.{extension} as the default, as that seems to be most common one.

2020-09-22_darkchibishadow-solanaceae-prologue-chapter-2-5-page-1.png --> weasyl_1948673_Solanaceae - Prologue Chapter 2.5 - Page 1.png

Other than that, seems to work great!

@Korvox
Copy link
Author

Korvox commented Sep 23, 2020

My suggestion would be something like {category}_{submitid}_{title}.{extension} as the default, as that seems to be most common one.

Most extractors format title to remove spaces, lowercase, etc. They don't just stick the raw title in a filename. The {filename} part is nice because Weasyl already does this formatting in its own filenames.

I can get adding {category} for consistency but I do like keeping the date around since its data that isn't carried anywhere else (it isn't in the img metadata) and can be useful for sorting. Most extractors just don't have the luxury of getting it via an API.

Hows {category}_{filename}_{date}? That respects the folder hierarchy containing it. The submitid is an internal detail the user shouldn't really care about.

@kattjevfel
Copy link
Contributor

I was mainly thinking of the imgur, gfycat and furaffinity extractor that does in fact just get the title, perhaps others too. I personally find the date really useless, you get the same thing with ID (which is why it goes after {category}, for sorting).

filename_fmt = "{category}_{id}{title:?_//}.{extension}"

filename_fmt = "{category}_{gfyName}{title:?_//}.{extension}"

filename_fmt = "{id} {title}.{extension}"

Though I see now that the furaffinity one is also lacking {category}, so idk how important consistency is.
Another reason for using {title} is that the {filename} on weasyl is quite useless for identification and not unique.

@mikf
Copy link
Owner

mikf commented Sep 23, 2020

I think it is generally a good idea for sites with 1 file per post and many posts per user to try and replicate the general structure of the furaffinity module, but it might be a bit too late for that.

  • As for the filename_fmt, {submitid} {title}.{extension} sounds like a good idea
  • Usernames can contain -. Replace all ([\w\d]+) with ([\w-]+) (\d is included in \w)
  • There is a text.parse_datetime() function, which should be used to parse dates into datetime objects
  • Why do journals have a completely different filename_fmt than everything else?
  • Why doesn't retrieve_journal(self) have a journalid parameter, but instead uses self.journalid?

@Korvox
Copy link
Author

Korvox commented Sep 23, 2020

There is a text.parse_datetime() function, which should be used to parse dates into datetime objects

If I'm not encoding date in the filename_fmt anymore all the date stuff can be dropped. Mind if I take a stab at #374 to embed this stuff? Everything else is in the latest revision. Unrelated question but is there a way to run tests on just the one extractor?

@mikf
Copy link
Owner

mikf commented Sep 25, 2020

Ok then, thanks a lot @Korvox. Time to merge this.

If I'm not encoding date in the filename_fmt anymore all the date stuff can be dropped

The filename_fmt value of each extractor is just the default. Anyone can change it to his own personal taste with the filename option. More metadata fields are usually better.

Mind if I take a stab at #374 to embed this stuff?

Sure. Do you know of a good (cross-platform) library that can do this?
(I'd recommend implementing this as a postprocessor module, by the way)

Unrelated question but is there a way to run tests on just the one extractor?

Running test/test_results.py with the category value you want to test as argument is the closest you can get, but you could also modify this script as needed.

$ python test/test_results.py weasyl

@mikf mikf merged commit ebb7737 into mikf:master Sep 25, 2020
@tux93
Copy link
Contributor

tux93 commented Sep 25, 2020

First of all thank you very much for implementing this @Korvox !

  • There might be someone that wants to download a favorites collection which is not a folder or gallery. There is no api for this so it would require iterating over the pages of the collection and grabbing the submits. If someone actually wants this its pretty trivial to implement but its a lot of non-API requests so I left it out for now.

Should I open up a followup issue for this, since I think it would be worth having?

@mikf
Copy link
Owner

mikf commented Sep 25, 2020

@tux93 #1032

@God-damnit-all
Copy link
Contributor

Is there any way to login or use cookies? As it stands, without a way to login, only SFW submissions are able to be downloaded.

Being able to use an API key generated in my account settings would be good too.

@Korvox
Copy link
Author

Korvox commented Oct 11, 2020

You can generate an API key here: https://www.weasyl.com/control/apikeys

API calls want the X-Weasyl-API-Key header set to it.

I'll look into hooking it into extractor.config. Tumblr does basically the same thing.

@God-damnit-all
Copy link
Contributor

You can generate an API key here: https://www.weasyl.com/control/apikeys

API calls want the X-Weasyl-API-Key header set to it.

I'll look into hooking it into extractor.config. Tumblr does basically the same thing.

Does gallery-dl even have a way of using custom header values other than User-Agent?

@God-damnit-all
Copy link
Contributor

@Korvox There's another problem I ran into. Usernames are allowed to have tildes in their name, which means two tildes in the URL. This messes up the pattern matching for extraction.

@Korvox
Copy link
Author

Korvox commented Oct 11, 2020

Can you link me an example of it?

Its a weird edge case if it exists because all the API requests use login_names: "A user’s username, lowercase, and omitting all non-alphanumeric, non-ASCII characters."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Site Support Request] Weasyl
5 participants