Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: crmd: Prevent use-after-free when an unexpected remote client takes over #928

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions crmd/crmd_lrm.h
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ lrm_state_t *lrm_state_find_or_create(const char *node_name);
* Regular resources go to the lrmd, and remote connection resources are
* handled locally in the crmd.
*/
void lrm_state_disconnect_only(lrm_state_t * lrm_state);
void lrm_state_disconnect(lrm_state_t * lrm_state);
int lrm_state_ipc_connect(lrm_state_t * lrm_state);
int lrm_state_remote_connect_async(lrm_state_t * lrm_state, const char *server, int port,
Expand Down
12 changes: 11 additions & 1 deletion crmd/lrm_state.c
Original file line number Diff line number Diff line change
Expand Up @@ -327,7 +327,7 @@ remote_proxy_disconnect_by_node(const char * node_name)
}

void
lrm_state_disconnect(lrm_state_t * lrm_state)
lrm_state_disconnect_only(lrm_state_t * lrm_state)
{
int removed = 0;

Expand All @@ -344,6 +344,16 @@ lrm_state_disconnect(lrm_state_t * lrm_state)
removed = g_hash_table_foreach_remove(lrm_state->pending_ops, fail_pending_op, lrm_state);
crm_trace("Synthesized %d operation failures for %s", removed, lrm_state->node_name);
}
}

void
lrm_state_disconnect(lrm_state_t * lrm_state)
{
if (!lrm_state->conn) {
return;
}

lrm_state_disconnect_only(lrm_state);

lrmd_api_delete(lrm_state->conn);
lrm_state->conn = NULL;
Expand Down
7 changes: 5 additions & 2 deletions crmd/remote_lrmd_ra.c
Original file line number Diff line number Diff line change
Expand Up @@ -492,8 +492,11 @@ remote_lrm_op_callback(lrmd_event_data_t * op)
if (ra_data->migrate_status == expect_takeover) {
ra_data->migrate_status = takeover_complete;
} else {
crm_err("Unexpected pacemaker_remote client takeover. Disconnecting");
lrm_state_disconnect(lrm_state);
crm_err("Unexpected pacemaker_remote client takeover for %s. Disconnecting", op->remote_nodename);
/* In this case, lrmd_tls_connection_destroy() will be called under the control of mainloop. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is the source removed from the mainloop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source is removed by lrm_state_disconnect() -> lrmd_api_disconnect() -> lrmd_tls_disconnect() -> lrmd_tls_disconnect() -> mainloop_del_ipc_client() -> mainloop_del_fd().

But after the control was returned to mainloop, crmd dumped core:

#0 0x00007fc8ac999187 in raise () from /lib64/libc.so.6
#1 0x00007fc8ac99a538 in abort () from /lib64/libc.so.6
#2 0x00007fc8ac9d6804 in __libc_message () from /lib64/libc.so.6
#3 0x00007fc8ac9dc06e in malloc_printerr () from /lib64/libc.so.6
#4 0x00007fc8ac9dcd86 in _int_free () from /lib64/libc.so.6
#5 0x00007fc8acd0fee5 in lrmd_tls_connection_destroy (userdata=) at lrmd_client.c:547
#6 0x00007fc8ad16903d in mainloop_gio_destroy (c=0x271f220) at mainloop.c:744
#7 0x00007fc8ac6a9228 in ?? () from /usr/lib64/libglib-2.0.so.0
#8 0x00007fc8ac6ac272 in g_main_context_dispatch () from /usr/lib64/libglib-2.0.so.0
#9 0x00007fc8ac6ac4b8 in ?? () from /usr/lib64/libglib-2.0.so.0
#10 0x00007fc8ac6ac8ba in g_main_loop_run () from /usr/lib64/libglib-2.0.so.0
#11 0x000000000040715e in crmd_init () at main.c:154
#12 0x0000000000406f34 in main (argc=1, argv=0x7ffd1c433858) at main.c:121

/* Do not free lrm_state->conn yet. */
/* It'll be freed in the following stop action. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know there will be a following stop action?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The disconnecting will introduce a monitor failure, which will cause it to be restarted, and actually eventually trigger a fencing.

lrm_state_disconnect_only(lrm_state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not freeing lrm_state here and expecting a stop action on it later, I'm thinking we should set migrate_status to takeover_complete unconditionally. That would prevent the stop from announcing the remote node as down. (If I'm following what's happening here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it makes sense for an unexpected takeover though. It can be produced:

  1. Connect to the remote node from a corosync node
  2. Connect to the remote node somehow again, for example from another cluster or with a tool like:
    netcat -t localhost 3121

See also:
http://bugs.clusterlabs.org/show_bug.cgi?id=5269

  1. crmd on the corosync node receives lrmd_event_new_client and crashes.

}
return;
}
Expand Down