-
-
Notifications
You must be signed in to change notification settings - Fork 7.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to exclude suspended domains/subdomains from tootctl domains crawl #11454
Conversation
This new option ignores any instances suspended server-wide as well as their associated subdomains. This queries all domain blocks up front, then runs a regexp on each domain. This improves performance over what may be the obvious implementation, which is to ask `DomainBlocks.blocked?(domain)` for each domain -- this hits the DB many times, slowing things down considerably.
lib/mastodon/domains_cli.rb
Outdated
|
||
pool = Concurrent::ThreadPoolExecutor.new(min_threads: 0, max_threads: options[:concurrency], idletime: 10, auto_terminate: true, max_queue: 0) | ||
|
||
work_unit = ->(domain) do | ||
next if stats.key?(domain) | ||
next if blocked_domains.any? { |blocked| domain.match(Regexp.new('\\.?' + blocked + '$')) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose you could pre-compose a regex with all the domains, dunno which way is faster off the top of my head, it would be a big regex but you would save on an array iteration and initializing a new regex on each item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compiling a giant regexp does seem significantly faster, I'll rewrite.
require 'Benchmark'
domains_to_test = []
blocked_domains = []
200000.times do blocked_domains.push(rand(400).to_s) end
15000.times do domains_to_test.push(rand(400).to_s) end
puts "Regexp.new"
puts Benchmark.measure {
domains_to_test.each do |domain|
blocked_domains.any? { |blocked| domain.match(Regexp.new('\\.?' + blocked + '$')) }
end
}
puts "precompiled giant Regexp"
puts Benchmark.measure {
reg = Regexp.new('\\.?' + blocked_domains.join('|') + '$')
domains_to_test.each do |domain|
domain.match(reg)
end
}
yields
Regexp.new
8.581110 0.078617 8.659727 ( 8.678938)
precompiled giant Regexp
0.231612 0.009794 0.241406 ( 0.241989)
lib/mastodon/domains_cli.rb
Outdated
failed = Concurrent::AtomicFixnum.new(0) | ||
start_at = Time.now.to_f | ||
seed = start ? [start] : Account.remote.domains | ||
blocked_domains = options[:exclude_suspended] ? Regexp.new('\\.?' + DomainBlock.where(severity: 1).pluck(:domain).join('|') + '$') : '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are Ruby regexes thread-safe? I think you might need to make this a thread-local (and it would be easier to read, since you could put it into a memoized method and wouldn't have to include the weird options[:exclude_domain]
conditional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put the conditional there so the DB doesn't get hit if the --exclude-suspended
option is false. But you're right, this code does work if I simply remove the ternary like so:
blocked_domains = Regexp.new('\\.?' + DomainBlock.where(severity: 1).pluck(:domain).join('|') + '$')
It just means that the DB gets hit on this line every time even if this feature isn't being used. But it's only one query per run so perhaps that's okay.
And once we get into issues of concurrent programming in Ruby I'm afraid I'm in over my head. In most languages, regular expressions are thread-safe because they are immutable, but I can't say 100% for Ruby's case. The code does work, and there is no Concurrent::Regexp
, which I think means it's probably the case that it's thread safe.
That aside, if the modification above is to your liking I'm happy to include it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm saying that by extracting the code into a lazy function:
def blocked_domains
@blocked_domains ||= Regexp.new('\\.?' + DomainBlock.where(severity: 1).pluck(:domain).join('|') + '$')
end
we could avoid both compiling the regex unnecessarily and still making the code cleaner. I was also saying that this sort of refactoring would be required if we needed to maintain a new copy of the regex per thread, instead of one regex instance shared across all of the threads. (but using thread-local variables instead). However, now that I consider it, the lazy instance-variable approach would introduce thread-unsafety itself, so that's probably right out.
on regex thread safety—many regex implementations use stateful caches, adaptive optimizations, etc and need some amounts of thread-local space to work in, see https://github.com/rust-lang/regex/blob/master/PERFORMANCE.md#using-a-regex-from-multiple-threads for an example. JRuby apparently had a thread safety bug in their Regexes once: jruby/jruby#3670. However, I can't find any discussion of the thread safety of mri ruby Regex instances, so i'm fine assuming they're thread-safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all that said, I think it probably makes the most sense to just compile the regex no matter what. conditioning the compile on the option is probably premature optimization.
Okay, per @nightpool's comments I removed the ternary operator. |
LGTM! Thanks for bearing with me as I thought through all this thread safety stuff out loud |
…ns crawl (mastodon#11454) * Add "--exclude-suspended" to tootctl domains crawl This new option ignores any instances suspended server-wide as well as their associated subdomains. This queries all domain blocks up front, then runs a regexp on each domain. This improves performance over what may be the obvious implementation, which is to ask `DomainBlocks.blocked?(domain)` for each domain -- this hits the DB many times, slowing things down considerably. * cleaning up code style * Compiling regex * Removing ternary operator
…ns crawl (mastodon#11454) * Add "--exclude-suspended" to tootctl domains crawl This new option ignores any instances suspended server-wide as well as their associated subdomains. This queries all domain blocks up front, then runs a regexp on each domain. This improves performance over what may be the obvious implementation, which is to ask `DomainBlocks.blocked?(domain)` for each domain -- this hits the DB many times, slowing things down considerably. * cleaning up code style * Compiling regex * Removing ternary operator
I encountered an issue on crawling the fediverse via
tootctl domains crawl
where there are about 185,000 spam instances of the formatI'd like more accurate stats. This new option ignores any instances suspended server-wide as well as their associated subdomains. So as an admin, I simply add
gab.best
to my domain blocks, and then runtootctl domains crawl --exclude-suspended
to get stats excluding all 185k of those domains. This also significantly improves execution time for the crawl because it doesn't have to make three GET requests per each of those 185k domains.
Implementation notes
This queries all domain suspensions up front, then runs a regexp on each domain to see if it matches the subdomain. This improves performance over what may be the obvious implementation, which is to ask
DomainBlocks.blocked?(domain)
for each domain -- this method hits the DB once per domain checked, slowing things down considerably.