Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The "Found an invalid row for peer" error is obscure #303

Closed
bhalevy opened this issue Feb 28, 2024 · 3 comments · Fixed by #305
Closed

The "Found an invalid row for peer" error is obscure #303

bhalevy opened this issue Feb 28, 2024 · 3 comments · Fixed by #305
Assignees

Comments

@bhalevy
Copy link
Member

bhalevy commented Feb 28, 2024

I'm seeing these error in a dtest that adds 2 new nodes (in parallel) to a 3-nodes cluster.

08:56:37,665 1736154 ccm                            DEBUG    cluster.py          :762  | test_tablets_add_two_nodes_in_parallel: node4: Starting scylla: args=['/home/bhalevy/.dtest/dtest-u7u3ay78/test/node4/bin/scylla', '--options-file', '/home/bhalevy/.dtest/dtest-u7u3ay78/test/no
de4/conf/scylla.yaml', '--log-to-stdout', '1', '--api-address', '127.0.73.4', '--smp', '2', '--memory', '1024M', '--developer-mode', 'true', '--default-log-level', 'info', '--overprovisioned', '--prometheus-address', '127.0.73.4', '--unsafe-bypass-fsync', '1', '--kernel-page-cache'
, '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=False wait_for_binary_proto=False
08:56:37,667 1736154 ccm                            DEBUG    cluster.py          :762  | test_tablets_add_two_nodes_in_parallel: node4: Starting scylla-jmx: args=['/home/bhalevy/.dtest/dtest-u7u3ay78/test/node4/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.73.4', '-Djavax.managemen
t.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.73.4', '-Dcom.sun.management.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.73.4', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.
management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.management.jmxremote.ssl=false', '-jar', '/home/bhalevy/.dtest/dtest-u7u3ay78/test/node4/bin/scylla-jmx-1.0.jar']
08:56:37,972 1736154 ccm                            DEBUG    cluster.py          :762  | test_tablets_add_two_nodes_in_parallel: node5: Starting scylla: args=['/home/bhalevy/.dtest/dtest-u7u3ay78/test/node5/bin/scylla', '--options-file', '/home/bhalevy/.dtest/dtest-u7u3ay78/test/no
de5/conf/scylla.yaml', '--log-to-stdout', '1', '--api-address', '127.0.73.5', '--smp', '2', '--memory', '1024M', '--developer-mode', 'true', '--default-log-level', 'info', '--overprovisioned', '--prometheus-address', '127.0.73.5', '--unsafe-bypass-fsync', '1', '--kernel-page-cache'
, '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=False wait_for_binary_proto=False
08:56:37,973 1736154 ccm                            DEBUG    cluster.py          :762  | test_tablets_add_two_nodes_in_parallel: node5: Starting scylla-jmx: args=['/home/bhalevy/.dtest/dtest-u7u3ay78/test/node5/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.73.5', '-Djavax.managemen
t.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.73.5', '-Dcom.sun.management.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.73.5', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.
management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.management.jmxremote.ssl=false', '-jar', '/home/bhalevy/.dtest/dtest-u7u3ay78/test/node5/bin/scylla-jmx-1.0.jar']
08:56:43,620 1736154 cassandra.cluster              WARNING  thread.py           :58   | test_tablets_add_two_nodes_in_parallel: Found an invalid row for peer (127.0.73.5). Ignoring host.
08:56:43,620 1736154 cassandra.cluster              INFO     thread.py           :58   | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.4:9042 datacenter1> discovered
08:56:45,521 1736154 cassandra.cluster              WARNING  thread.py           :58   | test_tablets_add_two_nodes_in_parallel: Found an invalid row for peer (127.0.73.5). Ignoring host.
08:56:46,585 1736154 update_cluster_layout_tests    DEBUG    update_cluster_layout_tests.py:436  | test_tablets_add_two_nodes_in_parallel: Check that nodes started successfully
08:56:46,586 1736154 update_cluster_layout_tests    DEBUG    update_cluster_layout_tests.py:441  | test_tablets_add_two_nodes_in_parallel: Verify tablet load-balancing after 10 seconds
08:56:48,206 1736154 cassandra.cluster              INFO     thread.py           :58   | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.5:9042 datacenter1> discovered
08:56:56,590 1736154 cassandra.cluster              INFO     dtest_setup.py      :493  | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.5:9042 datacenter1> discovered
08:56:56,591 1736154 cassandra.cluster              INFO     dtest_setup.py      :493  | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.3:9042 datacenter1> discovered
08:56:56,591 1736154 cassandra.cluster              INFO     dtest_setup.py      :493  | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.2:9042 datacenter1> discovered
08:56:56,591 1736154 cassandra.cluster              INFO     dtest_setup.py      :493  | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.4:9042 datacenter1> discovered

It looks like the error heals itself eventually but it's unclear why the the row in system.peers was considered as invalid, and on which node it was read from.
It would help if we printed more information to facilitate debugging in case something is wrong in the scylla server side.

@avelanarius
Copy link

The relevant code that checks the "validity" of a system.peers row:

def _is_valid_peer(row):
return bool(_NodeInfo.get_broadcast_rpc_address(row) and row.get("host_id") and
row.get("data_center") and row.get("rack") and
('tokens' not in row or row.get('tokens')))

and the warning printed here (
if not self._is_valid_peer(row):
log.warning(
"Found an invalid row for peer (%s). Ignoring host." %
_NodeInfo.get_broadcast_rpc_address(row))
continue
) should print more details (entire row?) and which of the cases caused the invalidity of a the row.

@avelanarius avelanarius self-assigned this Mar 1, 2024
avelanarius added a commit to avelanarius/python-driver that referenced this issue Mar 1, 2024
Before this change, when the driver received an invalid system.peers row
it would log a very general warning:

    Found an invalid row for peer (127.0.73.5). Ignoring host.

A system.peers row can be invalid for a multitude of reasons and that
warning message did not describe the specific reason for the failure.

Improve the warning message by adding a specific reason why the row
is considered invalid by the driver. The message now also includes
the host_id or the entire row (in case the driver received a row
without even the basic broadcast_rpc).

It might be a bit inelegant to introduce a side effect (logging) to
the is_valid_peer static method, however the alternative solution seemed
even worse - adding that code to the already big
_refresh_node_list_and_token_map.

Fixes scylladb#303
avelanarius added a commit to avelanarius/python-driver that referenced this issue Mar 1, 2024
Before this change, when the driver received an invalid system.peers row
it would log a very general warning:

    Found an invalid row for peer (127.0.73.5). Ignoring host.

A system.peers row can be invalid for a multitude of reasons and that
warning message did not describe the specific reason for the failure.

Improve the warning message by adding a specific reason why the row
is considered invalid by the driver. The message now also includes
the host_id or the entire row (in case the driver received a row
without even the basic broadcast_rpc).

It might be a bit inelegant to introduce a side effect (logging) to
the _is_valid_peer static method, however the alternative solution seemed
even worse - adding that code to the already big
_refresh_node_list_and_token_map.

Fixes scylladb#303
@avelanarius
Copy link

It looks like the error heals itself eventually but it's unclear why the the row in system.peers was considered as invalid, and on which node it was read from.

I addressed "why the the row in system.peers was considered as invalid" part in #305, however not necessarily the "and on which node it was read from" part but that can be deduced from other logs.

@bhalevy
Copy link
Member Author

bhalevy commented Mar 1, 2024

It looks like the error heals itself eventually but it's unclear why the the row in system.peers was considered as invalid, and on which node it was read from.

I addressed "why the the row in system.peers was considered as invalid" part in #305, however not necessarily the "and on which node it was read from" part but that can be deduced from other logs.

thanks. hopefully it will be enough.
heads up @kbr-scylla

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants