Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic iRODS errors in CI after iRODS 4.3 env updates #2029

Open
mikkonie opened this issue Oct 16, 2024 · 7 comments
Open

Sporadic iRODS errors in CI after iRODS 4.3 env updates #2029

mikkonie opened this issue Oct 16, 2024 · 7 comments
Assignees
Labels
bug Something isn't working environment Issues of dependencies, CI, deployment etc. ongoing Ongoing issue, needs observation or pending on other projects

Comments

@mikkonie
Copy link
Contributor

mikkonie commented Oct 16, 2024

Earlier today I upgraded to python-irodsclient==2.2.0 (#2023), mainly because the release was supposed to fix the redirect problem I was previously expecting.

The initial test run worked fine. However, a subsequent commit which only contained documentation changes had a large number (all?) of iRODS tests failing with connection errors. Example in comments below.

My guesses:

I'll try to gather more information and possibly contact iRODS support.

@mikkonie mikkonie added bug Something isn't working environment Issues of dependencies, CI, deployment etc. labels Oct 16, 2024
@mikkonie mikkonie added this to the v1.0.0 milestone Oct 16, 2024
@mikkonie mikkonie self-assigned this Oct 16, 2024
@mikkonie
Copy link
Contributor Author

Example of one of the many identical test failures:

======================================================================
ERROR: test_update_parent_override (taskflowbackend.tests.test_plugins.TestPerformRoleModify.test_update_parent_override)
Test updating overriding parent assignment
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/session.py", line 326, in server_version
    return tuple(ast.literal_eval(reported_vsn))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/ast.py", line 64, in literal_eval
    node_or_string = parse(node_or_string.lstrip(" \t"), mode='eval')
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/ast.py", line 50, in parse
    return compile(source, filename, mode, flags,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<unknown>", line 0
    
SyntaxError: invalid syntax

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/session.py", line 332, in __server_version
    conn = next(iter(self.pool.active))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/pool.py", line 70, in get_connection
    conn = self.idle.pop()
           ^^^^^^^^^^^^^^^
KeyError: 'pop from an empty set'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/runner/work/sodar-server/sodar-server/taskflowbackend/tests/test_plugins.py", line 470, in setUp
    super().setUp()
  File "/home/runner/work/sodar-server/sodar-server/taskflowbackend/tests/test_plugins.py", line 42, in setUp
    super().setUp()
  File "/home/runner/work/sodar-server/sodar-server/taskflowbackend/tests/base.py", line 265, in setUp
    self.irods = self.irods_backend.get_session_obj()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/sodar-server/sodar-server/irodsbackend/api.py", line 473, in get_session_obj
    return self._init_irods()
           ^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/sodar-server/sodar-server/irodsbackend/api.py", line 115, in _init_irods
    raise ex
  File "/home/runner/work/sodar-server/sodar-server/irodsbackend/api.py", line 102, in _init_irods
    irods.collections.exists(
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/manager/collection_manager.py", line 87, in exists
    self.get(path)
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/manager/collection_manager.py", line 22, in get
    query = self.sess.query(Collection).filter(*filters)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/session.py", line 290, in query
    return Query(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/query.py", line 45, in __init__
    if self.sess.server_version >= col.min_version:
       ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/session.py", line 328, in server_version
    return self.__server_version()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/session.py", line 335, in __server_version
    conn = self.pool.get_connection()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/pool.py", line 17, in method_
    ret = method(self,*s,**kw)
          ^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/pool.py", line 87, in get_connection
    conn = Connection(self, self.account)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/connection.py", line 77, in __init__
    auth_module.login(self)
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/auth/native.py", line 3, in login
    conn._login_native()
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/connection.py", line 617, in _login_native
    self.recv()
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/connection.py", line 152, in recv
    exc = get_exception_by_code(msg.int_info, err_msg)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/exception.py", line 186, in get_exception_by_code
    exc_class = iRODSExceptionMeta.codes[ rounded_code( code ) ]
                ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: -167000

@mikkonie
Copy link
Contributor Author

mikkonie commented Oct 16, 2024

I totally forgot to check out the container output in the CI job shutdown phase. That tells us that iRODS is correctly initialized, but we have a bunch of "too many clients" errors from Postgres:

FATAL:  sorry, too many clients already

This would suggest something is trying to open too many connections with postgres. I've seen something similar when I had bugs in closing connections from the python-irodsclient and iRODS sessions were left open. However, those bugs have long since been fixed so this should not be an issue. And if it was, it would show up consistently.

I don't think the Django server would end up in any situation where it would flood postgres with open connections. At least I have never encountered that, even with heavy use in production.

It is also possible that this is the symptom, not the cause: something else goes wrong and postgres ends up being flooded with opened connections?

@mikkonie mikkonie changed the title Sporadic iRODS errors in CI after latest env updates Sporadic iRODS errors in CI after iRODS 4.3 env updates Oct 17, 2024
@mikkonie
Copy link
Contributor Author

Full log of the failed job here: ci_irods_crash.txt

I'll have to look into whether some clues can be derived from this. It would appear like none of the taskflow tests succeed. Perhaps they fail in a way which also leaves postgres connections open.

Another thing of note is the following error I don't recall encountering before. irods.message.Bad_AVU_Field: AVU value ('') is zero-length.

That could be a symptom and not the problem though. I'll have to look at this log file in detail later, currently too busy with other stuff.

@mikkonie
Copy link
Contributor Author

mikkonie commented Oct 22, 2024

I'm looking into the full crash log in detail, writing down some facts and thoughts here.

Facts and Observations

  • All irodsadmin tests succeed
    • These contain iRODS actions (although no file actions, in case that's relevant)
    • This implies the iRODS server initialization works OK and the server is responsive, at least initially
    • The tearDown() method in these tests looks OK at first glance
      • However it must be noted that these tests do NOT use the usual Taskflow base test classes
  • Tests in irodsbackend.tests.test_api succeed
    • These do not connect to iRODS, so that is to be expected
  • The first failing test case is irodsbackend.tests.test_api_taskflow.TestIrodsBackendAPITaskflow.test_get_delete_ticket
    • This is the first iRODS-using test after the irodsadmin tests (which all succeeded)
    • It should be noted that this test already fails in setUp(), so what is in the actual test case doesn't matter
    • My first hunch was that the iRODS server had somehow crashed after initial successful tests. However..
  • After this, all iRODS-connecting tests fail
  • A lot of messages thrown among remaining tests with CRITICAL:irods.connection:Read 0 bytes from socket instead of expected 4 bytes
    • I guess this is residue from those failed connections
  • Also many setUpClass errors, haven't seen those before..

Error Types Raised

  • projectroles.models.Project.DoesNotExist: Project matching query does not exist.
    • This means make_project_taskflow() failed, in this case probably due to not being able to initiate iRODS session.
  • irods.exception.UserDoesNotExist
    • Raised by e.g. irodsbackend.tests.test_signals_taskflow.TestCreateIrodsUser.test_create
    • It would seem user creation in the login signal fails silently, again because of a problem with iRODS session init, so we end up with no user
  • KeyError: -1213000
    • Calling conn.recv(), an exception is raised but with a code the python-irodsclient apparently doesn't understand?
    • Probably just another side-effect of the session failing unexpectedly
    • TODO: Search if there is a reference to something similar and this error code in particular
  • taskflowbackend.api.TaskflowAPI.FlowSubmitException: Error running flow: -1213000
    • See above
  • KeyError: -167000
    • This one is weird because it actually causes a SyntaxError to be raised from the python-irodsclient
    • It seems there is something wrong with the exception handing within the package, although that isn't the root cause here
  • django.db.utils.OperationalError: connection to server at "127.0.0.1", port 5432 failed: FATAL: sorry, too many clients already
    • This (along with the rest of the errors) is self-explanatory: too many postgres sessions have been left hanging, most likely from iRODS, so postgres denies further attempts at connections
    • Usually this happens if iRODS projects hang or we don't cleanup sessions in python-irodsclient
    • In normal conditions, cleanup does always happen

Conclusions

  • iRODS initialization works fine initially
  • iRODS tests succeed in the beginning
  • At one point, iRODS session start failing and are not closed properly
    • Why?
  • Too many hanging failed iRODS sessions cause the postgres client limit to be reached, denying all further connections

Since I currently have no way of reproducing this, I'll have to wait for this to happen again and compare the log to this one.

Another possibility would be to try running this in dev until the problem appears and then look at the irods-test logs. However, I haven't yet encountered this in dev. Maybe echo the iRODS log into the CI output?

@mikkonie
Copy link
Contributor Author

Possibly related: my latest local test run resulted with a bunch of timeouts messages logged while running tests in landingzones.tests.test_views_taskflow.TestZoneMoveView:

ERROR:irods.connection:Unable to send message. Connection to remote host may have closed. Releasing connection from pool.

However, all the tests executed successfully. After running the tests, an expected amount of sessions were open in irods-test and there was nothing in the irods.log about any errors.

I should probably set the log level on the test server to debug (trace would most likely be overkill) and see if I can spot anything when this does fail..

@mikkonie
Copy link
Contributor Author

I'm setting this as ongoing for a now, as it seems to be very rare (observed only twice so far) and hasn't happened in a while.

If similar server freezeups are observed in staging or production, this will of course get re-prioritized higher.

@mikkonie mikkonie removed this from the v1.0.0 milestone Nov 21, 2024
@mikkonie mikkonie added the ongoing Ongoing issue, needs observation or pending on other projects label Nov 21, 2024
@mikkonie
Copy link
Contributor Author

I have not noticed these for a while anymore with the current build of our image. Still monitoring the situation, but I'm hopeful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working environment Issues of dependencies, CI, deployment etc. ongoing Ongoing issue, needs observation or pending on other projects
Projects
None yet
Development

No branches or pull requests

1 participant