The "Found an invalid row for peer" error is obscure #303

bhalevy · 2024-02-28T07:00:17Z

I'm seeing these error in a dtest that adds 2 new nodes (in parallel) to a 3-nodes cluster.

08:56:37,665 1736154 ccm                            DEBUG    cluster.py          :762  | test_tablets_add_two_nodes_in_parallel: node4: Starting scylla: args=['/home/bhalevy/.dtest/dtest-u7u3ay78/test/node4/bin/scylla', '--options-file', '/home/bhalevy/.dtest/dtest-u7u3ay78/test/no
de4/conf/scylla.yaml', '--log-to-stdout', '1', '--api-address', '127.0.73.4', '--smp', '2', '--memory', '1024M', '--developer-mode', 'true', '--default-log-level', 'info', '--overprovisioned', '--prometheus-address', '127.0.73.4', '--unsafe-bypass-fsync', '1', '--kernel-page-cache'
, '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=False wait_for_binary_proto=False
08:56:37,667 1736154 ccm                            DEBUG    cluster.py          :762  | test_tablets_add_two_nodes_in_parallel: node4: Starting scylla-jmx: args=['/home/bhalevy/.dtest/dtest-u7u3ay78/test/node4/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.73.4', '-Djavax.managemen
t.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.73.4', '-Dcom.sun.management.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.73.4', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.
management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.management.jmxremote.ssl=false', '-jar', '/home/bhalevy/.dtest/dtest-u7u3ay78/test/node4/bin/scylla-jmx-1.0.jar']
08:56:37,972 1736154 ccm                            DEBUG    cluster.py          :762  | test_tablets_add_two_nodes_in_parallel: node5: Starting scylla: args=['/home/bhalevy/.dtest/dtest-u7u3ay78/test/node5/bin/scylla', '--options-file', '/home/bhalevy/.dtest/dtest-u7u3ay78/test/no
de5/conf/scylla.yaml', '--log-to-stdout', '1', '--api-address', '127.0.73.5', '--smp', '2', '--memory', '1024M', '--developer-mode', 'true', '--default-log-level', 'info', '--overprovisioned', '--prometheus-address', '127.0.73.5', '--unsafe-bypass-fsync', '1', '--kernel-page-cache'
, '1', '--commitlog-use-o-dsync', '0', '--max-networking-io-control-blocks', '1000'] wait_other_notice=False wait_for_binary_proto=False
08:56:37,973 1736154 ccm                            DEBUG    cluster.py          :762  | test_tablets_add_two_nodes_in_parallel: node5: Starting scylla-jmx: args=['/home/bhalevy/.dtest/dtest-u7u3ay78/test/node5/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.73.5', '-Djavax.managemen
t.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.73.5', '-Dcom.sun.management.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.73.5', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.
management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.management.jmxremote.ssl=false', '-jar', '/home/bhalevy/.dtest/dtest-u7u3ay78/test/node5/bin/scylla-jmx-1.0.jar']
08:56:43,620 1736154 cassandra.cluster              WARNING  thread.py           :58   | test_tablets_add_two_nodes_in_parallel: Found an invalid row for peer (127.0.73.5). Ignoring host.
08:56:43,620 1736154 cassandra.cluster              INFO     thread.py           :58   | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.4:9042 datacenter1> discovered
08:56:45,521 1736154 cassandra.cluster              WARNING  thread.py           :58   | test_tablets_add_two_nodes_in_parallel: Found an invalid row for peer (127.0.73.5). Ignoring host.
08:56:46,585 1736154 update_cluster_layout_tests    DEBUG    update_cluster_layout_tests.py:436  | test_tablets_add_two_nodes_in_parallel: Check that nodes started successfully
08:56:46,586 1736154 update_cluster_layout_tests    DEBUG    update_cluster_layout_tests.py:441  | test_tablets_add_two_nodes_in_parallel: Verify tablet load-balancing after 10 seconds
08:56:48,206 1736154 cassandra.cluster              INFO     thread.py           :58   | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.5:9042 datacenter1> discovered
08:56:56,590 1736154 cassandra.cluster              INFO     dtest_setup.py      :493  | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.5:9042 datacenter1> discovered
08:56:56,591 1736154 cassandra.cluster              INFO     dtest_setup.py      :493  | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.3:9042 datacenter1> discovered
08:56:56,591 1736154 cassandra.cluster              INFO     dtest_setup.py      :493  | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.2:9042 datacenter1> discovered
08:56:56,591 1736154 cassandra.cluster              INFO     dtest_setup.py      :493  | test_tablets_add_two_nodes_in_parallel: New Cassandra host <Host: 127.0.73.4:9042 datacenter1> discovered

It looks like the error heals itself eventually but it's unclear why the the row in system.peers was considered as invalid, and on which node it was read from.
It would help if we printed more information to facilitate debugging in case something is wrong in the scylla server side.

The text was updated successfully, but these errors were encountered:

avelanarius · 2024-02-29T16:16:26Z

The relevant code that checks the "validity" of a system.peers row:

python-driver/cassandra/cluster.py

Lines 4018 to 4021 in 6b82872

    
           def _is_valid_peer(row): 
        
               return bool(_NodeInfo.get_broadcast_rpc_address(row) and row.get("host_id") and 
        
                           row.get("data_center") and row.get("rack") and 
        
                           ('tokens' not in row or row.get('tokens')))

and the warning printed here (

python-driver/cassandra/cluster.py

Lines 3949 to 3953 in 6b82872

    
           if not self._is_valid_peer(row): 
        
               log.warning( 
        
                   "Found an invalid row for peer (%s). Ignoring host." % 
        
                   _NodeInfo.get_broadcast_rpc_address(row)) 
        
               continue

) should print more details (entire row?) and which of the cases caused the invalidity of a the row.

Before this change, when the driver received an invalid system.peers row it would log a very general warning: Found an invalid row for peer (127.0.73.5). Ignoring host. A system.peers row can be invalid for a multitude of reasons and that warning message did not describe the specific reason for the failure. Improve the warning message by adding a specific reason why the row is considered invalid by the driver. The message now also includes the host_id or the entire row (in case the driver received a row without even the basic broadcast_rpc). It might be a bit inelegant to introduce a side effect (logging) to the is_valid_peer static method, however the alternative solution seemed even worse - adding that code to the already big _refresh_node_list_and_token_map. Fixes scylladb#303

Before this change, when the driver received an invalid system.peers row it would log a very general warning: Found an invalid row for peer (127.0.73.5). Ignoring host. A system.peers row can be invalid for a multitude of reasons and that warning message did not describe the specific reason for the failure. Improve the warning message by adding a specific reason why the row is considered invalid by the driver. The message now also includes the host_id or the entire row (in case the driver received a row without even the basic broadcast_rpc). It might be a bit inelegant to introduce a side effect (logging) to the _is_valid_peer static method, however the alternative solution seemed even worse - adding that code to the already big _refresh_node_list_and_token_map. Fixes scylladb#303

avelanarius · 2024-03-01T16:09:36Z

It looks like the error heals itself eventually but it's unclear why the the row in system.peers was considered as invalid, and on which node it was read from.

I addressed "why the the row in system.peers was considered as invalid" part in #305, however not necessarily the "and on which node it was read from" part but that can be deduced from other logs.

bhalevy · 2024-03-01T16:11:57Z

It looks like the error heals itself eventually but it's unclear why the the row in system.peers was considered as invalid, and on which node it was read from.

I addressed "why the the row in system.peers was considered as invalid" part in #305, however not necessarily the "and on which node it was read from" part but that can be deduced from other logs.

thanks. hopefully it will be enough.
heads up @kbr-scylla

avelanarius self-assigned this Mar 1, 2024

avelanarius mentioned this issue Mar 1, 2024

cluster: improve logging of peers row validation #305

Merged

Lorak-mmk closed this as completed in #305 Mar 1, 2024

bhalevy mentioned this issue Jun 24, 2024

[Tablets] Requests served imbalance after adding nodes to cluster scylladb/scylladb#19107

Closed

2 tasks

fee-mendes mentioned this issue Jun 24, 2024

Be more verbose when failing to fetch metadata scylladb/scylla-rust-driver#1021

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The "Found an invalid row for peer" error is obscure #303

The "Found an invalid row for peer" error is obscure #303

bhalevy commented Feb 28, 2024

avelanarius commented Feb 29, 2024

avelanarius commented Mar 1, 2024

bhalevy commented Mar 1, 2024

The "Found an invalid row for peer" error is obscure #303

The "Found an invalid row for peer" error is obscure #303

Comments

bhalevy commented Feb 28, 2024

avelanarius commented Feb 29, 2024

avelanarius commented Mar 1, 2024

bhalevy commented Mar 1, 2024