-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize costs #1149
Comments
related: #1152 (but I expect outbox is a small fraction of overall cost) |
Looking at https://cloud.google.com/firestore/docs/audit-logging . I've turned on audit logs for Firestore/Datastore API and Access Approval in https://console.cloud.google.com/iam-admin/audit?project=bridgy-federated . |
Other leads: |
Got ~20h of datastore logs, digging into them in log analytics now. https://console.cloud.google.com/logs/analytics |
...maybe promising? Eg this query breaks down by API method: SELECT DISTINCT
proto_payload.audit_log.method_name as method_name,
count(*) as count FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
group by method_name
order by count desc
LIMIT 1000 and this one samples the actual contents of queries: SELECT proto_payload
FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
limit 200 I can't aggregate (group by) by fields inside |
It's in BigQuery, https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1sbridgy-federated!2slogs!3s_Default, let's see how that goes. 37M rows total. |
Damn, BigQuery can't do it either. Maybe if I pull the JSON out in a view. |
Back to looking at this. |
Some useful datastore API usage analytics below, and already finding some useful results. Queries are ~40% of datastore API calls, and the vast majority of thoose are looking up original QuerySELECT DISTINCT proto_payload.audit_log.method_name as method_name, count(*) as count FROM bridgy-federated.logs._AllLogs where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' group by method_name order by count desc
QuerySELECT string(proto_payload.audit_log.request.query.kind[0].name) as kind, ARRAY_LENGTH(JSON_QUERY_ARRAY(proto_payload.audit_log.request.query.filter.compositeFilter.filters)) as num_composite_filters, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery' group by kind, num_composite_filters order by count(*) desc
QuerySELECT string(proto_payload.audit_log.request.query.kind[0].name) as kind, FilterStr(proto_payload.audit_log.request.query.filter) as filter, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery' and proto_payload.audit_log.request.query.filter.propertyFilter is not null group by kind, filter order by count(*) desc
|
^ should hopefully help with queries. Here's the monitoring I'm watching: https://console.cloud.google.com/apis/api/datastore.googleapis.com/metrics?project=bridgy-federated&pageState=%28%22duration%22%3A%28%22groupValue%22%3A%22PT1H%22%2C%22customValue%22%3Anull%29%29 |
Now looking at lookups aka gets. Surprising conclusion: the vast majority are for stored DID docs. Who knew. QuerySELECT array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) as num_keys, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup' group by num_keys order by count(*) desc
QuerySELECT string(proto_payload.audit_log.request.keys[0].path[0].kind) as kind, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup' and array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) = 1 group by kind order by count(*) desc
Object lookups by id scheme: QuerySELECT split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] as scheme, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup' and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object' group by scheme order by count(*) desc
|
I don't understand why the DID QuerySELECT JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name) as name, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup' and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object' and split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] = 'did' group by name order by count(*) desc limit 100
|
Now looking at optimizing log storage. We were doing 250G/mo a bit ago, we're now down to ~150G/mo or so. The main cost here is initial ingest. We get 50G/mo for free, then $.50/G after that. That includes 30d retention, and our current retention period is set to 30d, so reducing retention wouldn't help. https://cloud.google.com/stackdriver/pricing#cloud-monitoring-pricing Tips on optimizing logging costs: https://cloud.google.com/architecture/framework/cost-optimization/cloudops#logging Logs in BigQuery: https://console.cloud.google.com/bigquery?invt=AbjFiQ&project=bridgy-federated&inv=1&ws=%211m0 Logs dashboard: https://console.cloud.google.com/monitoring/dashboards/builder/24c22d42-91d8-4feb-aa6b-99dbb84c6417;duration=PT8H?project=bridgy-federated |
The other next cost to look at is CPU. router is currently on four cores, atproto-hub on one. We should be able to get them both down. Here's one place to start: https://cloud.google.com/profiler/docs |
Next step for datastore reads: snarfed/arroba#30 |
^ The difficulty with merging router and atproto-hub is that we currently run four WSGI workers for router, and we want to run the atproto-hub threads on just one of them. I had trouble running all the threads in just one worker on router a while back, but that may have been a red herring, it may actually work ok after all, and I doubt we're compute-bound in a way that we'd care about the GIL. Worth trying. |
Another idea here: stop storing and fetching transient activities (create, update, delete, undo, accept/reject, etc) in the datastore entirely. Just use memcache for them. |
Added some logging on datastore lookups, we're doing tons without caching, as expected. A few stack traces:
router:
|
No good reason, but I vaguely suspect our Lines 65 to 101 in b4270a5
|
Still struggling to instrument datastore lookups and understand why they're not getting cached. Got logging on the source side of ndb lookups, via |
Here's everything I can think of that I've done to get ndb caching in memcache working, or to even check that it's working:
I still don't see |
Finally got ndb logs working! The GCP Python SDKs disable propagation at the for logger_name in all_excluded_loggers:
# prevent excluded loggers from propagating logs to handler
logger = logging.getLogger(logger_name)
logger.propagate = False Fixed that with ...and that led to the big find here, we're overusing transactions, which skip the cache and always go to the datastore. I removed the transactions where we're fine with last writer wins, which was most of them, and datastore lookups are down by half! |
Did this in 6388aab. |
...because transactions don't read or write memcache. :/ Fortunately we don't really depend on atomicity for much, last writer wins is pretty much always fine. for #1149
…no mf2 this hopefully stops us from re-fetching them repeatedly in Protocol.for_id or Web.load/fetch for #1149
I ended up caching |
Don't want to draw attention to this, I've been looking at it mostly behind the scenes, but I'd like to start tracking at least the investigation and work more publicly.
I expect there's plenty of low hanging fruit here. Biggest contributors right now are datastore reads and frontend instances, both of which I should be able to cut down. Biggest blocker right now is that I'm not sure what's driving the datastore read load, esp since I added a memcached instance a while back. Hrm.
The text was updated successfully, but these errors were encountered: