Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application crashes with RocksIOError #184

Closed
2 tasks done
hartmutbehrens opened this issue Oct 16, 2018 · 5 comments
Closed
2 tasks done

Application crashes with RocksIOError #184

hartmutbehrens opened this issue Oct 16, 2018 · 5 comments

Comments

@hartmutbehrens
Copy link

hartmutbehrens commented Oct 16, 2018

Checklist

  • I have included information about relevant versions
  • I have verified that the issue persists when using the master branch of Faust.

Steps to reproduce

  • Create two source topics in Kafka
  • Create two Faust topics: topic1 = app.topic('source_topic_1'...) and topic2 = app.topic('source_topic_2'...)
  • Create two Faust tables: table1 = app.Table('table1'...) and table2 = app.Table('table2'...)
  • Process the topics with separate @app.agent(topic1), @app.agent(topic2) in Faust, populating tables in the process.
  • Use rocksdb:// store
  • App uses the medium/large project layout described here (https://faust.readthedocs.io/en/latest/userguide/application.html#medium-large-projects):
hartmut@hartmutvrbox:services (feature/config_service)$ tree opm
opm
├── opm
│   ├── app.py
│   ├── __init__.py
│   ├── __init__.pyc
│   ├── __main__.py
│   └── pmdb_service
│       ├── agents.py
│       ├── __init__.py
│       ├── models.py
│       └── views.py
├── Pipfile
├── Pipfile.lock
└── setup.py

The app is configured as follows:

app = faust.App(
    'opm',
    autodiscover=True,
    broker='kafka://127.0.0.1',
    origin='opm',
    store='rocksdb://',
    topic_partitions=1,
    version=21,
    key_serializer='json',
    value_serializer='json'
)

I have two agents, processing two Kafka source topics:

import faust
from opm.app import app
from opm.pmdb_service.models import PMObjectsRecord, PMPlantsRecord

plants_topic = app.topic('opm.pmdb.PMPlants',
                         value_type=PMPlantsRecord)
objects_topic = app.topic('opm.pmdb.PMObjects',
                          value_type=PMObjectsRecord)

plants_table = app.Table('opm.pmdb_service.plants',
                        default=dict,
                        key_type=str,
                        value_type=dict)
objects_table = app.Table('opm.pmdb_service.objects',
                          default=dict,
                          key_type=str,
                          value_type=dict)

@app.agent(plants_topic)
async def process_plants(plants: faust.Stream[PMPlantsRecord]):
    async for plant in plants:
        plants_table[plant.payload.ID] = plant.payload

@app.agent(objects_topic)
async def process_objects(objects: faust.Stream[PMObjectsRecord]):
    async for object in objects:
        objects_table[object.payload.ID] = object.payload

As far as I can see, i only have one process running, so there shouldn't be multiple processes contending for the lock:

hartmut@hartmutvrbox:~$ ps -ef | grep Faust
hartmut   8244  7515  2 09:14 pts/1    00:00:02 [Faust:Worker] -init- opm-v21 -p 6066 /home/hartmut/intellisenzo_microservices/services/opm/opm-data
hartmut   8309  8115  0 09:16 pts/2    00:00:00 grep --color=auto Faust

and

hartmut@hartmutvrbox:~$ pstree 8244
[Faust:Worker] ───8*[{[Faust:Worker] }]

I've also tried the following to resolve the problem:

  • building rocksdb, and python-rocksdb bindings from source
  • change number of partitions on source topics to 4
  • resetting the datadir

Expected behavior

Application does not crash.

Actual behavior

Application crashes due to error: reason=RocksIOError(b'IO error: lock : opm-data/v21/tables/opm.pmdb_service.db/LOCK: No locks available

Full traceback

[2018-10-16 09:17:22,161: INFO]: [^Worker]: Starting...
[2018-10-16 09:17:22,164: INFO]: [^-App]: Starting...
[2018-10-16 09:17:22,164: INFO]: [^--MonitorService]: Starting...
[2018-10-16 09:17:22,164: INFO]: [^CacheBackend]: Starting...
[2018-10-16 09:17:22,164: INFO]: [^--WebService]: Starting...
[2018-10-16 09:17:22,167: INFO]: [^---ServerThread]: Starting...
[2018-10-16 09:17:22,167: INFO]: [^--Web]: Serving on http://localhost:6066/
[2018-10-16 09:17:22,168: INFO]: [^--Producer]: Starting...
[2018-10-16 09:17:22,200: INFO]: [^--Consumer]: Starting...
[2018-10-16 09:17:22,211: INFO]: [^--LeaderAssignor]: Starting...
[2018-10-16 09:17:22,211: INFO]: [^--Producer]: Creating topic opm-v21-__assignor-__leader
[2018-10-16 09:17:22,211: INFO]: [^--ReplyConsumer]: Starting...
[2018-10-16 09:17:22,211: INFO]: [^AgentManagerService]: Starting...
[2018-10-16 09:17:22,212: INFO]: [^--Agent: opm.pmdb_servic[.]process_plants]: Starting...
[2018-10-16 09:17:22,214: INFO]: [^---OneForOneSupervisor]: Starting...
[2018-10-16 09:17:22,215: INFO]: [^--Agent: opm.pmdb_servi[.]process_objects]: Starting...
[2018-10-16 09:17:22,215: INFO]: [^---OneForOneSupervisor]: Starting...
[2018-10-16 09:17:22,215: INFO]: [^--Conductor]: Starting...
[2018-10-16 09:17:22,215: INFO]: [^--TableManager]: Starting...
[2018-10-16 09:17:24,217: INFO]: [^--Producer]: Creating topic opm-v21-__assignor-__leader
[2018-10-16 09:17:24,218: INFO]: [^--Producer]: Creating topic opm-v21-opm.pmdb_service.objects-changelog
[2018-10-16 09:17:24,218: INFO]: [^--Producer]: Creating topic opm-v21-opm.pmdb_service.plants-changelog
[2018-10-16 09:17:24,219: INFO]: Updating subscribed topics to: frozenset({'opm.pmdb.PMPlants', 'opm-v21-opm.pmdb_service.plants-changelog', 'opm-v21-__assignor-__leader', 'opm-v21-opm.pmdb_service.objects-changelog', 'opm.pmdb.PMObjects'})
[2018-10-16 09:17:24,222: INFO]: Subscribed to topic(s): {'opm.pmdb.PMPlants', 'opm-v21-opm.pmdb_service.plants-changelog', 'opm-v21-__assignor-__leader', 'opm-v21-opm.pmdb_service.objects-changelog', 'opm.pmdb.PMObjects'}
[2018-10-16 09:17:24,234: INFO]: Discovered coordinator 1001 for group opm-v21
[2018-10-16 09:17:24,235: INFO]: Revoking previously assigned partitions set() for group opm-v21
[2018-10-16 09:17:24,235: INFO]: [^--Fetcher]: Stopping...
[2018-10-16 09:17:24,235: INFO]: [^--Fetcher]: -Stopped!
[2018-10-16 09:17:24,273: INFO]: (Re-)joining group opm-v21
[2018-10-16 09:17:24,275: INFO]: Joined group 'opm-v21' (generation 3) with member_id faust-1.2.1-31699158-ed24-4829-b2e1-14a1c56f75f4
[2018-10-16 09:17:24,275: INFO]: Elected group leader -- performing partition assignments using faust
[2018-10-16 09:17:24,279: INFO]: Successfully synced group opm-v21 with generation 3
[2018-10-16 09:17:24,279: INFO]: Setting newly assigned partitions 
┌Topic Partition Set─────────────────────────┬────────────┐
│ topic                                      │ partitions │
├────────────────────────────────────────────┼────────────┤
│ opm-v21-__assignor-__leader                │ {0}        │
│ opm-v21-opm.pmdb_service.plants-changelog  │ {0}        │
│ opm-v21-opm.pmdb_service.objects-changelog │ {0}        │
│ opm.pmdb.PMPlants                          │ {0}        │
│ opm.pmdb.PMObjects                         │ {0}        │
└────────────────────────────────────────────┴────────────┘ for group opm-v21
[2018-10-16 09:17:24,280: INFO]: [^--Fetcher]: Stopping...
[2018-10-16 09:17:24,280: INFO]: [^--Fetcher]: -Stopped!
[2018-10-16 09:17:24,280: INFO]: [^--TableManager]: Triggered recovery in background
[2018-10-16 09:17:24,280: INFO]: [^--TableManager]: New assignments found
[2018-10-16 09:17:24,293: INFO]: [^---Store: opm.pmdb_service.objects]: DB for partition 0 is locked! Retry in 1s...
[2018-10-16 09:17:25,296: INFO]: [^---Store: opm.pmdb_service.objects]: DB for partition 0 is locked! Retry in 1s...
[2018-10-16 09:17:26,299: INFO]: [^---Store: opm.pmdb_service.objects]: DB for partition 0 is locked! Retry in 1s...
[2018-10-16 09:17:27,302: INFO]: [^---Store: opm.pmdb_service.objects]: DB for partition 0 is locked! Retry in 1s...
[2018-10-16 09:17:28,305: ERROR]: [^--TableManager]: Crashed reason=RocksIOError(b'IO error: lock : opm-data/v21/tables/opm.pmdb_service.db/LOCK: No locks available',)
Traceback (most recent call last):
  File "/home/hartmut/.local/share/virtualenvs/opm-JvQpWqh_/lib/python3.6/site-packages/faust/stores/rocksdb.py", line 210, in _db_for_partition
    return self._dbs[partition]
KeyError: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hartmut/.local/share/virtualenvs/opm-JvQpWqh_/lib/python3.6/site-packages/mode/services.py", line 700, in _execute_task
    await task
  File "/home/hartmut/.local/share/virtualenvs/opm-JvQpWqh_/lib/python3.6/site-packages/faust/tables/manager.py", line 196, in _recover
    await table.on_partitions_assigned(assigned)
  File "/home/hartmut/.local/share/virtualenvs/opm-JvQpWqh_/lib/python3.6/site-packages/faust/tables/base.py", line 409, in on_partitions_assigned
    await self.data.on_partitions_assigned(self, assigned)
  File "/home/hartmut/.local/share/virtualenvs/opm-JvQpWqh_/lib/python3.6/site-packages/faust/stores/rocksdb.py", line 273, in on_partitions_assigned
    self._db_for_partition(tp.partition)
  File "/home/hartmut/.local/share/virtualenvs/opm-JvQpWqh_/lib/python3.6/site-packages/faust/stores/rocksdb.py", line 212, in _db_for_partition
    db = self._dbs[partition] = self._open_for_partition(partition)
  File "/home/hartmut/.local/share/virtualenvs/opm-JvQpWqh_/lib/python3.6/site-packages/faust/stores/rocksdb.py", line 216, in _open_for_partition
    return self.options.open(self.partition_path(partition))
  File "/home/hartmut/.local/share/virtualenvs/opm-JvQpWqh_/lib/python3.6/site-packages/faust/stores/rocksdb.py", line 104, in open
    return rocksdb.DB(str(path), self.as_options(), read_only=read_only)
  File "rocksdb/_rocksdb.pyx", line 1437, in rocksdb._rocksdb.DB.__cinit__
  File "rocksdb/_rocksdb.pyx", line 84, in rocksdb._rocksdb.check_status
rocksdb.errors.RocksIOError: b'IO error: lock : opm-data/v21/tables/opm.pmdb_service.db/LOCK: No locks available'
[2018-10-16 09:17:28,310: INFO]: [^--Web]: Closing server
[2018-10-16 09:17:28,310: INFO]: [^--Web]: Waiting for server to close handle
[2018-10-16 09:17:28,310: INFO]: [^--Web]: Shutting down web application
[2018-10-16 09:17:28,310: INFO]: [^--Web]: Waiting for handler to shut down
[2018-10-16 09:17:28,310: INFO]: [^--Web]: Cleanup
[2018-10-16 09:17:28,310: INFO]: [^Worker]: Stopping...
[2018-10-16 09:17:28,310: INFO]: [^-App]: Stopping...
[2018-10-16 09:17:28,310: INFO]: [^--TableManager]: Stopping...
[2018-10-16 09:17:28,310: INFO]: [^--Fetcher]: Stopping...
[2018-10-16 09:17:28,310: INFO]: [^--Fetcher]: -Stopped!
[2018-10-16 09:17:28,310: INFO]: [^--TableManager]: Aborting ongoing recovery to start over
[2018-10-16 09:17:28,310: INFO]: [^--Table: opm.pmdb_service.plants]: Stopping...
[2018-10-16 09:17:28,310: INFO]: [^---Store: opm.pmdb_service.plants]: Stopping...
[2018-10-16 09:17:28,311: INFO]: [^---Store: opm.pmdb_service.plants]: -Stopped!
[2018-10-16 09:17:28,311: INFO]: [^--Table: opm.pmdb_service.plants]: -Stopped!
[2018-10-16 09:17:28,311: INFO]: [^--Table: opm.pmdb_service.objects]: Stopping...
[2018-10-16 09:17:28,311: INFO]: [^---Store: opm.pmdb_service.objects]: Stopping...
[2018-10-16 09:17:28,311: INFO]: [^---Store: opm.pmdb_service.objects]: -Stopped!
[2018-10-16 09:17:28,311: INFO]: [^--Table: opm.pmdb_service.objects]: -Stopped!
[2018-10-16 09:17:28,311: INFO]: [^--TableManager]: -Stopped!
[2018-10-16 09:17:28,311: INFO]: [^--Conductor]: Stopping...
[2018-10-16 09:17:28,311: INFO]: [^--Conductor]: Terminating cancelled task: <coroutine object Conductor._subscriber at 0x7efdfe237570>
[2018-10-16 09:17:28,312: INFO]: [^--Conductor]: -Stopped!
[2018-10-16 09:17:28,312: INFO]: [^AgentManagerService]: Stopping...
[2018-10-16 09:17:28,312: INFO]: [^--Agent: opm.pmdb_servic[.]process_plants]: Stopping...
[2018-10-16 09:17:28,312: INFO]: [^---OneForOneSupervisor]: Stopping...
[2018-10-16 09:17:28,312: INFO]: [^---OneForOneSupervisor]: Terminating cancelled task: <coroutine object SupervisorStrategy._supervisor at 0x7efdfe237d00>
[2018-10-16 09:17:28,312: INFO]: [^---OneForOneSupervisor]: -Stopped!
[2018-10-16 09:17:28,313: INFO]: [^--Agent: opm.pmdb_servic[.]process_plants]: -Stopped!
[2018-10-16 09:17:28,313: INFO]: [^--Agent: opm.pmdb_servi[.]process_objects]: Stopping...
[2018-10-16 09:17:28,313: INFO]: [^---OneForOneSupervisor]: Stopping...
[2018-10-16 09:17:28,313: INFO]: [^---OneForOneSupervisor]: Terminating cancelled task: <coroutine object SupervisorStrategy._supervisor at 0x7efdfe237e60>
[2018-10-16 09:17:28,313: INFO]: [^---OneForOneSupervisor]: -Stopped!
[2018-10-16 09:17:28,313: INFO]: [^--Agent: opm.pmdb_servi[.]process_objects]: -Stopped!
[2018-10-16 09:17:28,313: INFO]: [^AgentManagerService]: -Stopped!
[2018-10-16 09:17:28,313: INFO]: [^--ReplyConsumer]: Stopping...
[2018-10-16 09:17:28,313: INFO]: [^--ReplyConsumer]: -Stopped!
[2018-10-16 09:17:28,313: INFO]: [^--LeaderAssignor]: Stopping...
[2018-10-16 09:17:28,313: INFO]: [^--LeaderAssignor]: -Stopped!
[2018-10-16 09:17:28,313: INFO]: [^--Consumer]: Stopping...
[2018-10-16 09:17:28,315: INFO]: LeaveGroup request succeeded
[2018-10-16 09:17:28,316: INFO]: [^--Consumer]: -Stopped!
[2018-10-16 09:17:28,316: INFO]: [^--Producer]: Stopping...
[2018-10-16 09:17:28,316: INFO]: [^--Producer]: -Stopped!
[2018-10-16 09:17:28,316: INFO]: [^--WebService]: Stopping...
[2018-10-16 09:17:28,316: INFO]: [^---ServerThread]: Stopping...
[2018-10-16 09:17:28,316: INFO]: [^---ServerThread]: -Stopped!
[2018-10-16 09:17:28,316: INFO]: [^--WebService]: -Stopped!
[2018-10-16 09:17:28,316: INFO]: [^CacheBackend]: Stopping...
[2018-10-16 09:17:28,317: INFO]: [^CacheBackend]: -Stopped!
[2018-10-16 09:17:28,317: INFO]: [^--MonitorService]: Stopping...
[2018-10-16 09:17:28,317: INFO]: [^--MonitorService]: -Stopped!
[2018-10-16 09:17:28,317: INFO]: [^-App]: -Stopped!
[2018-10-16 09:17:28,317: INFO]: [^Worker]: -Stopped!
[2018-10-16 09:17:28,317: INFO]: [^Worker]: Gathering service tasks...
[2018-10-16 09:17:28,317: INFO]: [^Worker]: Gathering all futures...
^C-INT- -INT- -INT- -INT- -INT- -INT-
[2018-10-16 09:17:28,864: INFO]: [^Worker]: Stopping on signal received...
[2018-10-16 09:17:29,419: INFO]: [^Worker]: Closing event loop
[2018-10-16 09:17:29,420: CRITICAL]: [^Worker]: We experienced a crash! Reraising original exception...

Versions

  • Python version = v3.6.6
  • Faust version = v1.2.1
  • Operating system = Ubuntu 18.04
  • Kafka version = 2.0.0-cp1
  • RocksDB version (if applicable) = 5.8
@ask
Copy link
Contributor

ask commented Oct 17, 2018

If you really have only one process using the --datadir then it should be safe to remove the lock:

$ find opm-data/v21/tables/ -name "LOCK" | xargs rm

Usually the process will clean this up when exiting, and I haven't had any problems even after performing a kill -9 on a worker, but from experience flock is a rather tricky beast so there could still be edge cases I guess ¯_(ツ)_/¯

@hartmutbehrens
Copy link
Author

hartmutbehrens commented Oct 17, 2018

@ask I tried that, just before starting up the worker, but still no joy.
I've even nuked the entire opm-data datadir before startup, but still no success.

Do you have any other recommendations?

@hartmutbehrens
Copy link
Author

Some additional information:
If i remove one of the topics and associated tables (e.g. plants_table and plants_topic in the example above) AND set topic_partitions=1 then I don't get the No locks available error anymore.

@hartmutbehrens
Copy link
Author

hartmutbehrens commented Oct 17, 2018

If it helps, I start the worker with: python -m opm worker -l info. This results in only one python/faust process appearing in the output of ps -ef.

@ask
Copy link
Contributor

ask commented Oct 18, 2018

Thanks!

I managed to reproduce, and strangely the datadir is populated with a single
table directory, when there should be two:

(def) /opt/devel/faust(master)$> ls -l opm-data/v21/tables/
drwxr-xr-x - ask 18 Oct  6:41 opm.pmdb_service.db

If I rename them to something else:

# was opm.pmdb.service.plants
plants_table = app.Table('plants',
                        default=dict,
                        key_type=str,
                        value_type=dict)

# was opm.pmdb.service.objects
objects_table = app.Table('objects',
                          default=dict,
                          key_type=str,
                          value_type=dict)

it works - no lock errors and the expected number of table directories:

drwxr-xr-x - ask 18 Oct  6:49 objects-0.db
drwxr-xr-x - ask 18 Oct  6:49 plants-0.db

At first I thought it had something to do with . in the table names, but it still happens
if I change the dots to underscores. Will investigate further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants