Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text posts download archive (archive for postprocessors) #2421

Closed
AlttiRi opened this issue Mar 18, 2022 · 13 comments
Closed

Text posts download archive (archive for postprocessors) #2421

AlttiRi opened this issue Mar 18, 2022 · 13 comments

Comments

@AlttiRi
Copy link

AlttiRi commented Mar 18, 2022

I would like to have the archive for text posts.

Currently a metadata postprocessor, for example, for Twitter:

        "twitter":
        {
            "retweets": "original",
            "directory": ["[gallery-dl]", "[{category}] {author[name]}"],
            "filename": "[{category}] {author[name]}—{date:%Y.%m.%d}—{retweet_id|tweet_id}—{filename}.{extension}",
            "text-tweets": true,
            "postprocessors": [{
                "name": "mtime",
                "event": "post"
            }, {
                "directory": "metadata",
                "filename": "[{category}] {author[name]}—{date:%Y.%m.%d}—{retweet_id|tweet_id}.html",
                "name": "metadata",
                "event": "post",
                "mtime": true,
                "mode": "custom",
                "format": "<div id='{retweet_id|tweet_id}'><h4><a href='https://twitter.com/{author[name]}/status/{retweet_id|tweet_id}'>{retweet_id|tweet_id}</a> by <a href='https://twitter.com/{author[name]}'>{author[name]}</a></h4><div class='content'>{content}</div><hr><div>{date:%Y.%m.%d %H:%M:%S}</div><hr></div><br>"
            }]
        },

will create text files every run.

It's very inconvenient if you use --download-archive option.
I need to delete a lot of tiny duplicate metadata files every time.

I think it should look so (in the second postprocessor):

    "archive": "~/gallery-dl-postprocessors.sqlite",
    "archive-format": "twitter_postprocessor1_{retweet_id|tweet_id}",
  • "archive" is to specify there to store entries (a SQLite DB file).
  • "archive-format" is an entry format.
@mikf
Copy link
Owner

mikf commented Mar 19, 2022

The problem with implementing this feature is that files generated by post processors do not go through the same pipeline like regular, downloaded files do, and that --download-archive only handles single files and not entire posts. Would be a lot easier and cleaner to implement if that weren't the case, but I'll see what I can do with the current code base.


            "format": "<div id='{retweet_id|tweet_id}'><h4><a href='https://twitter.com/{author[name]}/status/{retweet_id|tweet_id}'>{retweet_id|tweet_id}</a> by <a href='https://twitter.com/{author[name]}'>{author[name]}</a></h4><div class='content'>{content}</div><hr><div>{date:%Y.%m.%d %H:%M:%S}</div><hr></div><br>"

You could move that long format string to an external file and reference it here with "format": "\fT /path/to/file.txt"

mikf added a commit that referenced this issue Mar 21, 2022
'archive', 'archive-format', and 'archive-prefix'
@mikf
Copy link
Owner

mikf commented Mar 21, 2022

The metadata post processor now has the same archive options as regular files do: 9bd27b1.
It worked during my limited testing, but there are always bugs with things like this. Let me know if you find any.

@Hrxn
Copy link
Contributor

Hrxn commented Mar 21, 2022

Based on the description for metadata.mtime, i.e.

For example, a metadata post processor for "event": "post" will not be able to set its file's modification time unless an mtime post processor with "event": "post" runs before it.

Seems like this does apply to this example here, no?
Let's see, setting up such a postprocessor config (including the new "archive" option) is basically something like this, right?

(Working example for such an postprocessor set-up is here):
#2421 (comment)

Now, I'll only have to activate this postprocessor config for the Twitter extractor...

        "twitter":
        {
            "<Rest of Twitter options here, yada yada>"

            "postprocessors": ["modtime", "twitter-thread"]
        },

@mikf
Copy link
Owner

mikf commented Mar 21, 2022

That should work, except you will also need a custom archive-format for "event": "post". If that is not given, it tries to use the default per-file archive ID format string with a slight modification, which contains a {num} field that is not defined for posts.

@AlttiRi
Copy link
Author

AlttiRi commented Mar 21, 2022

"archive-format" is a mandatory thing as I can understand, since it does not defined by default (for post processors). And it should be chosen wisely.


For

"archive": "~/gallery-dl-postprocessors.sqlite",
"archive-format": "twitter_p1_{retweet_id|tweet_id}",

I get

KeyError: 'retweet_id|tweet_id'

UPD.

Ok, anyway I can use twitter_p1_{tweet_id}_{retweet_id}. Maybe it will be even better.

UPD2.

Checked the data base. I don't need to add twitter string prefix manually, so:
{tweet_id}_{retweet_id}_p1

Here _p1 is just in case, to indicate that it is an entry for a certain postprocessor.

@Hrxn
Copy link
Contributor

Hrxn commented Mar 21, 2022

Twitter text/metadata extraction postprocessor:

	"postprocessor": 
	{

        "modtime":
        {
            "name": "mtime",
            "event": "post"
        },

        "twitter-thread":
        {
                "name": "metadata",
                "directory": "metadata",
                "filename": "[{category}] {author[name]}—{date:%Y.%m.%d}—{retweet_id|tweet_id}.html",
                "event": "post",
                "mtime": true,
                "mode": "custom",
                "format": "<div id='{retweet_id|tweet_id}'><h4><a href='https://twitter.com/{author[name]}/status/{retweet_id|tweet_id}'>{retweet_id|tweet_id}</a> by <a href='https://twitter.com/{author[name]}'>{author[name]}</a></h4><div class='content'>{content}</div><hr><div>{date:%Y.%m.%d %H:%M:%S}</div><hr></div><br>",
                "archive": "E:\\Archives\\Downloads\\gallery-dl\\gldl-archive-twitter-metadata.db",
                "archive-format": "{tweet_id}_{retweet_id}"
        }

	}

(This is the default archive-format from the Twitter extractor, minus {num}.

@mikf
Copy link
Owner

mikf commented Mar 21, 2022

@AlttiRi archive-format strings are just regular Python format strings without any "special" functionality. Alternatives with | are not supported there, which is why the default uses "{tweet_id}_{retweet_id}_{num}". I should probably just "upgrade" them.

@Hrxn yeah that should work like that. The post processor order does matter; mtime needs to run before metadata if the generated metadata file should have its modification time set.

@Hrxn
Copy link
Contributor

Hrxn commented Mar 21, 2022

Okay, so..

This does not work:
"postprocessors": ["twitter-thread", "modtime"]
This does work:
"postprocessors": ["modtime", "twitter-thread"]

yes? (I'm removing that part from my comment above...)

@mikf
Copy link
Owner

mikf commented Mar 21, 2022

["twitter-thread", "modtime"] does not set the metadata file's mtime.
That only happens with ["modtime", "twitter-thread"].

I should really update metadata.mtime to not require this and just fetch the correct mtime value on its own.

@AlttiRi
Copy link
Author

AlttiRi commented Mar 21, 2022

It looks working.


Off-topic about Twitter with "text-tweets": true:

Is possible not to download images from the link's preview of text posts?

It's mostly a not important information:
image

(Such "spam" I got after enabling "text-tweets": true. More that 300 such thumbnails for that artist.)

@mikf
Copy link
Owner

mikf commented Mar 21, 2022

I don't think this "spam" is due to "text-tweets": true, but because of the cards option, which got enabled by default fairly recently (f2e8aed).

"text-tweets" only has an effect on tweets without any downloadable media.

@Hrxn
Copy link
Contributor

Hrxn commented Mar 21, 2022

["twitter-thread", "modtime"] does not set the metadata file's mtime. That only happens with ["modtime", "twitter-thread"].

I should really update metadata.mtime to not require this and just fetch the correct mtime value on its own.

Yes, that's what I meant. Thanks! 😄

@Hrxn
Copy link
Contributor

Hrxn commented Mar 29, 2022

Can confirm, seems to work for me as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants