Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize costs #1149

Open
snarfed opened this issue Jun 24, 2024 · 60 comments
Open

Optimize costs #1149

snarfed opened this issue Jun 24, 2024 · 60 comments

Comments

@snarfed
Copy link
Owner

snarfed commented Jun 24, 2024

Don't want to draw attention to this, I've been looking at it mostly behind the scenes, but I'd like to start tracking at least the investigation and work more publicly.

I expect there's plenty of low hanging fruit here. Biggest contributors right now are datastore reads and frontend instances, both of which I should be able to cut down. Biggest blocker right now is that I'm not sure what's driving the datastore read load, esp since I added a memcached instance a while back. Hrm.

month

year

@snarfed snarfed added the now label Jun 25, 2024
@snarfed
Copy link
Owner Author

snarfed commented Jun 26, 2024

related: #1152 (but I expect outbox is a small fraction of overall cost)

@snarfed
Copy link
Owner Author

snarfed commented Jun 26, 2024

Looking at https://cloud.google.com/firestore/docs/audit-logging . I've turned on audit logs for Firestore/Datastore API and Access Approval in https://console.cloud.google.com/iam-admin/audit?project=bridgy-federated .

@snarfed
Copy link
Owner Author

snarfed commented Jun 28, 2024

Got ~20h of datastore logs, digging into them in log analytics now. https://console.cloud.google.com/logs/analytics

@snarfed
Copy link
Owner Author

snarfed commented Jun 28, 2024

...maybe promising? Eg this query breaks down by API method:

SELECT DISTINCT
  proto_payload.audit_log.method_name as method_name,
  count(*) as count FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
group by method_name
order by count desc
LIMIT 1000

and this one samples the actual contents of queries:

SELECT proto_payload
FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
limit 200

I can't aggregate (group by) by fields inside proto_payload though, I get Grouping by expressions of type JSON is not allowed. Next step, try BigQuery to see if it can get around that.

@snarfed
Copy link
Owner Author

snarfed commented Jun 28, 2024

It's in BigQuery, https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1sbridgy-federated!2slogs!3s_Default, let's see how that goes. 37M rows total.

@snarfed
Copy link
Owner Author

snarfed commented Jun 28, 2024

Damn, BigQuery can't do it either. Maybe if I pull the JSON out in a view.

@snarfed
Copy link
Owner Author

snarfed commented Jul 25, 2024

Back to looking at this.

@snarfed
Copy link
Owner Author

snarfed commented Jul 30, 2024

Some useful datastore API usage analytics below, and already finding some useful results. Queries are ~40% of datastore API calls, and the vast majority of thoose are looking up original Objects and users for a given copy.

Query
SELECT
  DISTINCT proto_payload.audit_log.method_name as method_name, count(*) as count
FROM bridgy-federated.logs._AllLogs
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  group by method_name
  order by count desc
method_name count
Lookup 4374961
RunQuery 3596350
Commit 548773
BeginTx 353552
Rollback 23279
...
Query
SELECT
  string(proto_payload.audit_log.request.query.kind[0].name) as kind,
  ARRAY_LENGTH(JSON_QUERY_ARRAY(proto_payload.audit_log.request.query.filter.compositeFilter.filters)) as num_composite_filters,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
group by kind, num_composite_filters
order by count(*) desc
kind num_composites count
Object 767016
ATProto 764103
MagicKey 755585
ActivityPub 754627
UIProtocol 468614
Follower 2 49023
AtpBlock 17479
AtpRemoteBlob 6890
AtpRepo 3566
Follower 3147
AtpRepo 2
ATProto 2
...
Query
SELECT
  string(proto_payload.audit_log.request.query.kind[0].name) as kind,
  FilterStr(proto_payload.audit_log.request.query.filter) as filter,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
  and proto_payload.audit_log.request.query.filter.propertyFilter is not null
group by kind, filter
order by count(*) desc
kind filter count
ActivityPub copies.uri EQUAL 753344
MagicKey copies.uri EQUAL 753290
Object copies.uri EQUAL 753245
ATProto copies.uri EQUAL 753243
UIProtocol copies.uri EQUAL 468614
AtpBlock seq GREATER_THAN_OR_EQUAL 17479
Object users EQUAL 12535
AtpRemoteBlob cid EQUAL 6890
ATProto handle EQUAL 6516
MagicKey manual_opt_out EQ 2295
Follower from EQUAL 1575
Follower to EQUAL 1572
ATProto enabled_protocols NOT_EQUAL 1232
ActivityPub enabled_protocols NOT_EQUAL 1231
...

@snarfed
Copy link
Owner Author

snarfed commented Jul 30, 2024

@snarfed
Copy link
Owner Author

snarfed commented Jul 30, 2024

Now looking at lookups aka gets. Surprising conclusion: the vast majority are for stored DID docs. Who knew.

Query
SELECT
  array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) as num_keys,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
group by num_keys
order by count(*) desc
num_keys count
1 4371422
2 999
3 519
4 365
5 229
6 171
7 101
8 83
9 74
100 68
12 67
...
Query
SELECT
  string(proto_payload.audit_log.request.keys[0].path[0].kind) as kind,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) = 1
group by kind
order by count(*) desc
kind count
Object 4080965
ATProto 112141
AtpBlock 97120
MagicKey 38829
ActivityPub 29996
AtpRepo 7222
AtpSequence 3551
AtpRemoteBlob 1574
Cursor 24

Object lookups by id scheme:

Query
SELECT
  split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] as scheme,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object'
group by scheme
order by count(*) desc
scheme count
did 3434021
at 413000
https 235446
http 85
...

@snarfed
Copy link
Owner Author

snarfed commented Jul 31, 2024

I don't understand why the DID Object lookups aren't using memcache. The ndb contexts in all three services seem to have it configured right. Here are the top DIDs by number of lookups, in just a ~9h window. Almost all of them should have been cached. Why weren't they?

Query
SELECT
  JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name) as name,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object'
  and split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] = 'did'
group by name
order by count(*) desc
limit 100
name count
did:plc:ayutykgvyf4x7ev5ornltyzz 176211
did:plc:4mjwxpnhoeaknxqabwhf2n6i 64569
did:plc:p2ygpwluon3vrk5yecjq7wc5 56364
did:plc:hm5cxb2g2q4ma4ucsks73tex 53051
did:plc:b5glolrcdfnaxwc4dbe4zbty 43557
did:plc:3rxcq5wacdop5thjoh3sny3p 35103
did:plc:6zfgmvghpidjvcn3cqtitnx5 28309
did:plc:5syae7p7dr6zsfdoe4k6buky 26882
did:plc:dqgdfku26vpxdktkzne5x2xj 26240
did:plc:sese4fb6luojywziva3s7zjo 24876
...

snarfed added a commit that referenced this issue Aug 1, 2024
@snarfed
Copy link
Owner Author

snarfed commented Aug 1, 2024

Progress here! Managed to cut datastore lookups and queries both way down, 5-10x each.

image image

snarfed added a commit to snarfed/webutil that referenced this issue Aug 2, 2024
snarfed added a commit that referenced this issue Aug 2, 2024
snarfed added a commit that referenced this issue Aug 2, 2024
@snarfed
Copy link
Owner Author

snarfed commented Aug 2, 2024

Now looking at optimizing log storage. We were doing 250G/mo a bit ago, we're now down to ~150G/mo or so.

The main cost here is initial ingest. We get 50G/mo for free, then $.50/G after that. That includes 30d retention, and our current retention period is set to 30d, so reducing retention wouldn't help. https://cloud.google.com/stackdriver/pricing#cloud-monitoring-pricing

Tips on optimizing logging costs: https://cloud.google.com/architecture/framework/cost-optimization/cloudops#logging

Logs in BigQuery: https://console.cloud.google.com/bigquery?invt=AbjFiQ&project=bridgy-federated&inv=1&ws=%211m0

Logs dashboard: https://console.cloud.google.com/monitoring/dashboards/builder/24c22d42-91d8-4feb-aa6b-99dbb84c6417;duration=PT8H?project=bridgy-federated

image

@snarfed
Copy link
Owner Author

snarfed commented Aug 2, 2024

The other next cost to look at is CPU. router is currently on four cores, atproto-hub on one. We should be able to get them both down. Here's one place to start: https://cloud.google.com/profiler/docs

image

@snarfed
Copy link
Owner Author

snarfed commented Aug 3, 2024

Next step for datastore reads: snarfed/arroba#30

@snarfed
Copy link
Owner Author

snarfed commented Aug 9, 2024

Results from the datastore reads optimization here have been disappointing. I cut them by ~3x, but spend on datastore read API calls only went down maybe 1/4-1/3, from $10-12/d to $7-8/d. Hrmph.

image image

(Datastore reads are the top blue part of the cost bars above.)

@snarfed
Copy link
Owner Author

snarfed commented Aug 13, 2024

Another opportunity here could be reducing router and atproto-hub allocated CPU. Router CPU is down to mostly under two cores, atproto-hub to maybe half a core. Could either drop router from 4 to 2, or leave it at 4 and merge atproto-hub into it.

image

@snarfed
Copy link
Owner Author

snarfed commented Aug 15, 2024

^ The difficulty with merging router and atproto-hub is that we currently run four WSGI workers for router, and we want to run the atproto-hub threads on just one of them. I had trouble running all the threads in just one worker on router a while back, but that may have been a red herring, it may actually work ok after all, and I doubt we're compute-bound in a way that we'd care about the GIL. Worth trying.

@snarfed
Copy link
Owner Author

snarfed commented Aug 15, 2024

Another idea here: stop storing and fetching transient activities (create, update, delete, undo, accept/reject, etc) in the datastore entirely. Just use memcache for them.

@snarfed
Copy link
Owner Author

snarfed commented Aug 15, 2024

...but looking at our incoming activity types, that might not make a big difference. The bulk of incoming activities are likes, follows, and reposts, in that order.

Measure first, then optimize.

image

(I should double check to make sure that count isn't after various filtering though!)

@snarfed
Copy link
Owner Author

snarfed commented Dec 12, 2024

Started caching queries on User/Object.copies last night ^. It helped, but not as much as I'd hoped, only maybe 25-30%, down from ~70qps to ~50qps. 😕

image

snarfed added a commit that referenced this issue Dec 13, 2024
snarfed added a commit that referenced this issue Dec 15, 2024
sad, it's useful, but it's too expensive. for #1501, #1149
@snarfed
Copy link
Owner Author

snarfed commented Dec 16, 2024

Looking back at the last 2w, we've made a bit more progress here on datastore calls than I thought. Feeling a bit better now. Still lots more to do though!

image

@snarfed
Copy link
Owner Author

snarfed commented Dec 19, 2024

Added some logging on datastore lookups, we're doing tons without caching, as expected. A few stack traces:

default (frontend): did:plc:ljb6ugearscroyatvai5kkfr:

  File "/workspace/activitypub.py", line 1007, in actor
    user = _load_user(handle_or_id, create=True)
  File "/workspace/activitypub.py", line 985, in _load_user
    user = proto.get_or_create(id) if create else proto.get_by_id(id)
  File "/workspace/models.py", line 379, in get_or_create
    user = _run()
  File "/workspace/models.py", line 302, in _run
    user = cls.get_by_id(id, allow_opt_out=True)
  File "/workspace/models.py", line 270, in get_by_id
    user = cls._get_by_id(id, **kwargs)
  File "/workspace/activitypub.py", line 1007, in actor
    user = _load_user(handle_or_id, create=True)
  File "/workspace/activitypub.py", line 985, in _load_user
    user = proto.get_or_create(id) if create else proto.get_by_id(id)
  File "/workspace/models.py", line 379, in get_or_create
    user = _run()
  File "/workspace/models.py", line 302, in _run
    user = cls.get_by_id(id, allow_opt_out=True)
  File "/workspace/models.py", line 276, in get_by_id
    elif user.status and not allow_opt_out:
  File "/workspace/models.py", line 475, in status
    if not self.obj or not self.obj.as1:
  File "/workspace/models.py", line 973, in as1
    handle = ATProto(id=owner).handle
  File "/workspace/atproto.py", line 257, in handle
    return did_to_handle(self.key.id())
  File "/workspace/atproto.py", line 183, in did_to_handle
    if did_obj := ATProto.load(did, did_doc=True, remote=remote):
  File "/workspace/atproto.py", line 746, in load
    return super().load(id, **kwargs)
  File "/workspace/protocol.py", line 1566, in load
    obj = Object.get_by_id(id)

router: did:plc:rvu67iewv37qcskwd22qe7i4:

  File "/workspace/protocol.py", line 1699, in receive_task
    return PROTOCOLS[obj.source_protocol].receive(
  File "/workspace/protocol.py", line 909, in receive
    obj = Object.get_or_create(id, authed_as=actor, **orig.to_dict())
  File "/workspace/models.py", line 1114, in get_or_create
    obj.put()
  File "/workspace/models.py", line 1047, in _pre_put_hook
    if self.as1 and self.as1.get('objectType') == 'activity':
  File "/workspace/models.py", line 973, in as1
    handle = ATProto(id=owner).handle
  File "/workspace/atproto.py", line 257, in handle
    return did_to_handle(self.key.id())
  File "/workspace/atproto.py", line 183, in did_to_handle
    if did_obj := ATProto.load(did, did_doc=True, remote=remote):
  File "/workspace/atproto.py", line 746, in load
    return super().load(id, **kwargs)
  File "/workspace/protocol.py", line 1566, in load
    obj = Object.get_by_id(id)
  File "/workspace/protocol.py", line 1699, in receive_task
    return PROTOCOLS[obj.source_protocol].receive(
  File "/workspace/protocol.py", line 935, in receive
    Object.get_or_create(inner_obj_id, our_as1=inner_obj_as1,
  File "/workspace/models.py", line 1083, in get_or_create
    orig_as1 = obj.as1
  File "/workspace/models.py", line 973, in as1
    handle = ATProto(id=owner).handle
  File "/workspace/atproto.py", line 257, in handle
    return did_to_handle(self.key.id())
  File "/workspace/atproto.py", line 183, in did_to_handle
    if did_obj := ATProto.load(did, did_doc=True, remote=remote):
  File "/workspace/atproto.py", line 746, in load
    return super().load(id, **kwargs)
  File "/workspace/protocol.py", line 1566, in load
    obj = Object.get_by_id(id)

@snarfed
Copy link
Owner Author

snarfed commented Dec 19, 2024

No good reason, but I vaguely suspect our Key has already been set in this batch patch from googleapis/python-ndb#743 (comment) , below. I'll try disabling it now, along with our tasklets that hydrate objects and authors in pages.serve_feed, to see what happens.

bridgy-fed/flask_app.py

Lines 65 to 101 in b4270a5

# https://github.com/googleapis/python-ndb/issues/743#issuecomment-2067590945
#
# fixes "RuntimeError: Key has already been set in this batch" errors due to
# tasklets in pages.serve_feed
from logging import error as log_error
from sys import modules
from google.cloud.datastore_v1.types.entity import Key
from google.cloud.ndb._cache import (
_GlobalCacheSetBatch,
global_compare_and_swap,
global_set_if_not_exists,
global_watch,
)
from google.cloud.ndb.tasklets import Future, Return, tasklet
GLOBAL_CACHE_KEY_PREFIX: bytes = modules["google.cloud.ndb._cache"]._PREFIX
LOCKED_FOR_READ: bytes = modules["google.cloud.ndb._cache"]._LOCKED_FOR_READ
LOCK_TIME: bytes = modules["google.cloud.ndb._cache"]._LOCK_TIME
@tasklet
def custom_global_lock_for_read(key: str, value: str):
if value is not None:
yield global_watch(key, value)
lock_acquired = yield global_compare_and_swap(
key, LOCKED_FOR_READ, expires=LOCK_TIME
)
else:
lock_acquired = yield global_set_if_not_exists(
key, LOCKED_FOR_READ, expires=LOCK_TIME
)
if lock_acquired:
raise Return(LOCKED_FOR_READ)
modules["google.cloud.ndb._cache"].global_lock_for_read = custom_global_lock_for_read

@snarfed
Copy link
Owner Author

snarfed commented Dec 19, 2024

Nope. (Dashed line marks when this ^ was deployed.)

image

@snarfed
Copy link
Owner Author

snarfed commented Dec 20, 2024

Still struggling to instrument datastore lookups and understand why they're not getting cached. Got logging on the source side of ndb lookups, via google.cloud.ndb._datastore_api definitely seems to confirm they're not getting cached right, in either the frontend or router. Also added logging to google.cloud.ndb.global_cache in snarfed/python-ndb@d1d399f, seeing it locally, but not in prod at all. 😕 Maybe the build isn't picking up that branch from GitHub right?

@snarfed
Copy link
Owner Author

snarfed commented Dec 21, 2024

Here's everything I can think of that I've done to get ndb caching in memcache working, or to even check that it's working:

  • Configured ndb to use memcache for its global cache:
    MEMCACHE_HOST: '10.126.144.3'

    bridgy-fed/common.py

    Lines 108 to 118 in 382b30e

    if memcache_host := os.environ.get('MEMCACHE_HOST'):
    logger.info(f'Using real memcache at {memcache_host}')
    memcache = pymemcache.client.base.PooledClient(
    memcache_host, timeout=10, connect_timeout=10, # seconds
    allow_unicode_keys=True)
    pickle_memcache = pymemcache.client.base.PooledClient(
    memcache_host, timeout=10, connect_timeout=10, # seconds
    serde=PickleSerde(), allow_unicode_keys=True)
    # ideally we'd use MemcacheCache.from_environment, but it doesn't let us
    # pass kwargs like serde to the pymemcache client constructor
    global_cache = MemcacheCache(memcache, strict_read=True)

    bridgy-fed/common.py

    Lines 484 to 491 in 382b30e

    NDB_CONTEXT_KWARGS = {
    # limited context-local cache. avoid full one due to this bug:
    # https://github.com/googleapis/python-ndb/issues/888
    'cache_policy': cache_policy,
    'global_cache': global_cache,
    'global_cache_policy': global_cache_policy,
    'global_cache_timeout_policy': global_cache_timeout_policy,
    }
  • Enabled ndb debug logging with NDB_DEBUG=true and logging.getLogger('google.cloud.ndb.global_cache').setLevel(logging.DEBUG).
  • Checked the build to confirm that it's using my ndb fork and branch with global_cache logging.
  • Ran unit tests locally with eg env MEMCACHE_HOST=localhost NDB_DEBUG=true python -m unittest -v ..., confirmed that I see google.cloud.ndb.global_cache output.

I still don't see google.cloud.ndb.global_cache log lines in prod, and I still see repeated datastore lookups for the same key.

@snarfed
Copy link
Owner Author

snarfed commented Jan 13, 2025

Finally got ndb logs working! The GCP Python SDKs disable propagation at the google.cloud level for some reason:

    for logger_name in all_excluded_loggers:
        # prevent excluded loggers from propagating logs to handler
        logger = logging.getLogger(logger_name)
        logger.propagate = False

Fixed that with logging.getLogger('google.cloud').propagate = True.

...and that led to the big find here, we're overusing transactions, which skip the cache and always go to the datastore. I removed the transactions where we're fine with last writer wins, which was most of them, and datastore lookups are down by half!

image

@snarfed
Copy link
Owner Author

snarfed commented Jan 13, 2025

Another idea here, cache follower counts in memcache instead of instance memory. Not sure how much that would matter, I'll see when I dig into datastore usage.

Did this in 6388aab.

@snarfed
Copy link
Owner Author

snarfed commented Jan 13, 2025

Next idea: use #552 to optimize Protocol.for_id, #1341. Maybe with Web.status == 'webfinger' entities in datastore?

snarfed added a commit that referenced this issue Jan 14, 2025
...because transactions don't read or write memcache. :/ Fortunately we don't really depend on atomicity for much, last writer wins is pretty much always fine.

for #1149
snarfed added a commit that referenced this issue Jan 15, 2025
snarfed added a commit that referenced this issue Jan 15, 2025
…no mf2

this hopefully stops us from re-fetching them repeatedly in Protocol.for_id or Web.load/fetch

for #1149
@snarfed
Copy link
Owner Author

snarfed commented Jan 15, 2025

I ended up caching Protocol.for_id in memcache instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants