Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export content between date #25

Closed
gagarine opened this issue Apr 3, 2018 · 4 comments
Closed

Export content between date #25

gagarine opened this issue Apr 3, 2018 · 4 comments

Comments

@gagarine
Copy link

gagarine commented Apr 3, 2018

I want to export soc.culture.soviet but it's big... In fact I'm only interested of a five months period. I didn't see a way to export only between a specific period.

@icy
Copy link
Owner

icy commented Apr 4, 2018

Hi @gagarine,

Do you know the page range? Google group pagination works on numbers (e.g, 30 posts per page), and that doesn't have anything related to date. If you navigate the group contents to see an approximate number that would help, e.g,

https://groups.google.com/forum/?_escaped_fragment_=forum/archlinuxvn%5B21-40%5D

However, this doesn't really work all the time. Users can post to very old thread, and what we can see from e.g, the link above, is the dates of the last posts; it's not the date of the first posts in the thread. If you really like to work this way I can have a simple patch.

The group soc.culture.soviet that you mentioned has about 28k topics (the number of messages is bigger of course), and to fetch all these 28k topics that would take few hours (assuming that Google doesn't have any kind of throttle number). I think that's reasonable...

@gagarine
Copy link
Author

gagarine commented Apr 4, 2018

Mmmmh I understand. Using the Google Group web interface, I was using a filter on "first post" and looked between 1991 january to 1991 december. My primary interest is around august so that would be

https://groups.google.com/forum/?_escaped_fragment_=forum/soc.culture.soviet[27630-27720]

Yeah, I saw the import was faster that I tough. I tried a full import but it seem that google killed my connection.

I will play a bit more to see if I can do a full import. Perhaps it easier. Mainly I don't want to not have a message because someone posted on the thread later on.

@icy
Copy link
Owner

icy commented Apr 5, 2018

It's interesting to hear that Google killed your connection :) I am not sure if adding some sleep to the hook can help (https://github.com/icy/google-group-crawler#the-hook)

I will try to have a range support for the script, so that you can specify 27630-27720 as input.

@icy
Copy link
Owner

icy commented Apr 13, 2020

Someone can download 350k messages from a group (#32), maybe this isn't an issue so far. As I didn't intend to have pagination support, I will close this ticket. Free free to reopen it if there is any better idea to support the feature.

Thanks a lot.

@icy icy closed this as completed Apr 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants