Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channel refactor and optimize #8322

Merged
merged 2 commits into from
Nov 20, 2019

Conversation

firejox
Copy link
Contributor

@firejox firejox commented Oct 13, 2019

Several changes are applied:

  1. use internal storage list to reduce heap allocation
  2. move cleaning operations out of the lock section during close
  3. remove redundant lock by making internal send/receive return state
  4. because of 3, non blocking select can be determined by state
  5. the wait/unwait of select action can be done in O(1)

@lribeiro
Copy link

Hi firejox, how does it affect performance, do you have any benchmarks you can show?

@asterite
Copy link
Member

I don't know if anyone told you, but the PRs you are sending related to concurrency are great!

@firejox
Copy link
Contributor Author

firejox commented Oct 14, 2019

@lribeiro Here is the benchmarks

channel close benchmark

before

[developer@crystal-dev tmp]$ ../bin/crystal channel-close-bench.cr -Dpreview_mt --release -- 10000
Using compiled compiler at `.build/crystal'
00:54:15.125022741

after

[developer@crystal-dev tmp]$ ../bin/crystal channel-close-bench.cr -Dpreview_mt --release -- 10000
Using compiled compiler at `.build/crystal'
00:40:58.541846658

channel select unwait benchmark

before

[developer@crystal-dev tmp]$ ../bin/crystal channel-select-unwait-bench.cr --release -- 100000
Using compiled compiler at `.build/crystal'
00:00:00.450562818

after

[developer@crystal-dev tmp]$ ../bin/crystal channel-select-unwait-bench.cr --release -- 100000
Using compiled compiler at `.build/crystal'
00:00:00.210562496

channel ping pong benchmark

before

[developer@crystal-dev tmp]$ ../bin/crystal channel-ping-pong-bench.cr -Dpreview_mt --release -- 500000
Using compiled compiler at `.build/crystal'
00:00:20.045008184

after

[developer@crystal-dev tmp]$ ../bin/crystal channel-ping-pong-bench.cr -Dpreview_mt --release -- 500000
Using compiled compiler at `.build/crystal'
00:00:16.457026952

@firejox firejox force-pushed the channel-refactor-and-optimize branch from d5a39ab to 5747de8 Compare October 14, 2019 07:56
@jkthorne
Copy link
Contributor

Thank you @firejox for all the work on concurrency!!

@@ -0,0 +1,98 @@
# :nodoc:
macro container_of(ptr, type, member)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this is :nodoc: it leaks to the top-level namespace. So in my code I can call container_of and it will call this macro. This should probably be moved inside Crystal::StaticList and called like Crystal::StaticList.container_of. Same for the other macro.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will move these two macros inside the struct. Thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asterite Sorry for disturbing. I have a problem when I put them into Crystal::StaticList. The macros seem to be unable to know the Sender and Receiver because of visibility. And now I find three ways to resolve it.

  1. Remove the private of Sender and Receiver so that it can call Crystal::StaticList.container_of. But the problem is it will leak Sender and Receiver to public.

  2. Pack macros into a module like Crystal::StaticList::Reflect, and include the module into channel. It will call the container_of instead of Crystal::StaticList.container_of. And this way will be ugly and hard to understand.

  3. Use annotation and macro to generate corresponding method. Then it will invoke
    something like Receiver(T).from_{{ member_name }}(ptr). This could increase the compile time because of extra computation.

Here is the code pieces about these changes
https://gist.github.com/firejox/a6f2193a5e0820eef7e7e36c937c2330

Maybe there are other clever ways that I didn't come out. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of private visibility you can just put :nodoc: in a doc comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will use the :nodoc: . Thank you.

@@ -352,19 +405,18 @@ class Channel(T)

ops_locks.each &.lock

# Check that no channel is closed
unless ops.all?(&.available?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is removed, then the available? is no longer used.

But I don't follow where the check that non of the channel is originally closed is done. Because if there is a channel ready and other closed I think the current code in this PR will yield and not raise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because op.execute will return proper state, I think there is no need to check channel twice. The case you mentioned is a new case, and current spec doesn't cover it. The following code:

# ch1 is ready and ch2 is closed

# a.cr
select
when m1 = ch1.receive?
when m2 = ch2.receive
end

# b.cr
select
when m1 = ch1.receive
when m2 = ch2.receive?
end

What are the expect behaviors of the two code pieces?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. We can avoid the double check and do not have a guarantee regarding which channel, the closed or the ready one, will fulfill the select.

  • a.cr will either execute ch1's when with m1 = nil, or ch2's when with data.
  • b.cr will either raise due to ch1 been closed, or ch2's when with data.

There are no specs mixing receive and receive? though.

If ch2 would not be ready I would expect that b.cr will always raise. Both on a blocking and non-blocking select. That is a spec that could be added. But I don't expect that receive and receive? are mixed.

So, this change is good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Although your precondition is different from me, I will add the spec about mixing receive and receive?.

@@ -0,0 +1,97 @@
# :nodoc:
struct Crystal::StaticList
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to find a more crystal way to build this static list.
I think it should be possible to use a generic struct that will enforce typing and maybe allow some macros to hide the boilerplate as done here. But all the pointer, casting and final interface make me doubt.

Also, although it :nodoc: I think there could be some specs on the implementation.

The main problem is to build a linked list with data that should act as nodes and let them be stored in the stack, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a difference between internal storage and external storage. The external storage will enforce type explicit, and the internal storage requires some way to know which type stores the list node. That's why need the Crystal::StaticList.container_of. Because we store data in stack, it cannot prevent to handle without pointer.

And I will add some spec about the StaticList.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about something like this https://gist.github.com/bcardiff/4327823bb43cbb93734fbb6240315dac ?

  • there are less casts
  • the pointer tricker is hidden in the module
  • the API is a bit more clean: sometimes as methods, sometimes as macro calls.

there might be some subtle differences regarding the representation of the list. and this solution can only be used in structs. I didn't care to apply them to classes since I think it is only used in structs.

In case we need to extend it, in the worst case macros can be injected in the type that is including the module to specialize if the next/prev need to be pointer(T) or T.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late on reply because I'm busy today. In my opinion, this change is a bit worse than mine. Listify.prepend(list, node) will make code like c, not crystal. Although I currently use node.append_to pointerof(list) in code, the implementation can be easily replaced by list.append node.self_pointer. list.append and list.shift will be better than Listify.append and Listify.shift.

Besides, the each operation should not hide the pointer. If you hide the pointer, the object you get is the copy object of the pointer point to. Any changes will not apply on the original object. This will be a problem.

Including a module is good, but I don't want to limit the name of member variable in struct/class which use the list. (If possible, they may need the list with different name for future refactor/optimize)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I acknowledge that the loop in my gist should return pointers.

But I would argue that for me, the API you are proposing

  1. seems more similar to what I would find/write in C.
  2. have more features that needed considering it will be hidden from the std-lib
    a. it is only used for structs but classes are supported
    b. it allows elements to exists in multiple lists at the same time (by passing the member). But at the cost of adding repetition in callers.

I still think the API like the one proposed in my gist is more simple and can do the job.

I would let others from the core team weigh in here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at the way that Thread::LinkedList does it, and i'm pretty sure you can generalise that to work on the raw pointers. Then you don't need the init_empty_list either, just store the head and tail and use null pointers to signify the start/end of the list.

@firejox firejox force-pushed the channel-refactor-and-optimize branch from b5fffa6 to 288a200 Compare October 20, 2019 09:50
@firejox
Copy link
Contributor Author

firejox commented Oct 20, 2019

Updates:

  1. add specs of Crystal::StaticList
  2. add specs of mixing receive and receive? select

Copy link
Member

@bcardiff bcardiff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of comments more.

FYI I am waiting on this PR to ask for a rebase on #8006 .

Thanks for all the effort @firejox

@@ -0,0 +1,97 @@
# :nodoc:
struct Crystal::StaticList
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I acknowledge that the loop in my gist should return pointers.

But I would argue that for me, the API you are proposing

  1. seems more similar to what I would find/write in C.
  2. have more features that needed considering it will be hidden from the std-lib
    a. it is only used for structs but classes are supported
    b. it allows elements to exists in multiple lists at the same time (by passing the member). But at the cost of adding repetition in callers.

I still think the API like the one proposed in my gist is more simple and can do the job.

I would let others from the core team weigh in here.

@firejox firejox force-pushed the channel-refactor-and-optimize branch from 288a200 to 031c4ca Compare October 22, 2019 11:53
Copy link
Member

@asterite asterite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to review this but I currently don't have much time, plus concurrency internals is not my strong point (at the moment).

@bcardiff We can wait for a review from at least one other core-team member, but if you think this is good (code is readable, does the same thing, doesn't have hacks that later on we won't understand unless we add more comments or docs, etc.) then I'd merge this.

Copy link
Member

@RX14 RX14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lists are going to need a better API. I can barely follow the specs, and it seems quite easy to slip up using them.

end
end

describe "Crysta::StaticList" do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crystal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This typo can be avoided by removing the quotes.

src/channel.cr Outdated
# Because channel#close may clean up a long list, `select_context.try_trigger` may
# be called after the select return. In order to prevent invalid address access,
# the state is allocated in the heap.
state_ptr = Pointer(Atomic(SelectState)).malloc(1, Atomic(SelectState).new(SelectState::Active))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably just do Pointer.malloc here - the compiler will guess the type from the Atomic(SelectState).new(SelectState::Active).

@@ -0,0 +1,97 @@
# :nodoc:
struct Crystal::StaticList
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at the way that Thread::LinkedList does it, and i'm pretty sure you can generalise that to work on the raw pointers. Then you don't need the init_empty_list either, just store the head and tail and use null pointers to signify the start/end of the list.

@firejox firejox force-pushed the channel-refactor-and-optimize branch 2 times, most recently from 432003f to 8916ea6 Compare November 1, 2019 07:29
@firejox firejox requested a review from RX14 November 1, 2019 10:12
@RX14
Copy link
Member

RX14 commented Nov 1, 2019

I still don't understand the reason why you need container_of. Could you explain it for me?

@RX14
Copy link
Member

RX14 commented Nov 1, 2019

Like, why not just make your list generic and assume that T has the next and prev pointers already, just like Thread::LinkedList does. You don't have to do all this awful macro offsetof casting, the types are just... correct.

Thread::LinkedList already works with Reference types, you could rename Thread::LinkedList and specialize that type using macros to work with Value. Just look at the way Atomic(T) changes it's implementation based on T.

@firejox firejox force-pushed the channel-refactor-and-optimize branch from 8916ea6 to b540ce5 Compare November 1, 2019 14:55
@firejox
Copy link
Contributor Author

firejox commented Nov 1, 2019

@RX14 I want to keep some flexibility, so I use container_of and do not assume that T has next and prev pointer. All things is just for flexibility.

You cannot do that as Atomic(T) does. I have tried before I made this PR. The main problem will occur on the design of push operation. It should be expected to accept Pointer type, not Value type.

Now I have made the code in generic way. It seems that my implementation is not widely accepted by core team.

@asterite
Copy link
Member

asterite commented Nov 1, 2019

It seems that my implementation is not widely accepted by core team

I personally like the performance improvements but I think the code becomes much harder to follow, with many low-level tricks. My fear is that in a couple of months we won't be able to understand how it all works.

I wish for the performance improvements to happen without introducing that much of a code change.

@RX14
Copy link
Member

RX14 commented Nov 8, 2019

I want to keep some flexibility, so I use container_of and do not assume that T has next and prev pointer. All things is just for flexibility.

This flexibility - when the type will never be used outside of the stdlib - is not worth the drop in readability and the complication of the list API.

I personally like the performance improvements but I think the code becomes much harder to follow, with many low-level tricks. My fear is that in a couple of months we won't be able to understand how it all works.

I think really the only complicated tricks are in the list implementation. If that gets fixed, then i'm perfectly happy to merge this, pending @bcardiff's input.

@bcardiff
Copy link
Member

@firejox I made another round on this PR.

Applied a couple of refactors to use linked list specifically for this use and move the shared state to a class instead of a manual Pointer.malloc. These changes are applied in their own commits at https://github.com/bcardiff/crystal/tree/refactor/linked-list

The underlying behavior is the same.

Then I perform the benchmarks you sent at #8322 (comment), with -Dpreview_mt and without (note that not all of them in your original posting use -Dpreview_mt) and I am not able to validate the improvements locally in the channel-close-bench.cr.

bench base firejox bcardiff
MT channel-close-bench.cr 00:10:42.358743895 00:13:56.305350595 00:14:03.869554707
MT channel-select-unwait-bench.cr 00:00:00.686825065 00:00:00.635335868 00:00:00.684841458
MT channel-ping-pong-bench.cr 00:00:07.174520691 00:00:05.834123856 00:00:05.903892141
ST channel-close-bench.cr 00:03:55.545640555 00:03:26.212166600 00:03:20.613478228
ST channel-select-unwait-bench.cr 00:00:00.216716945 00:00:00.048762337 00:00:00.051219858
ST channel-ping-pong-bench.cr 00:00:02.652404693 00:00:02.439848184 00:00:02.458077994

Which platform/hardware are you using?
Mine is Darwin Kernel Version 18.6.0 MacBookPro14,3 Intel Core i7 2.9 GHz

Between the refactor and the fix of the potential invalid memory access, I don't mind the slowness in channel-close-bench.cr.

Do you want to add those commits here? Or we can open another PR based on this one and pick it up from there. As you prefer. Again, thanks for all the effort.

@firejox
Copy link
Contributor Author

firejox commented Nov 12, 2019

@bcardiff Interesting. I don't know why it does not yield the improvement. It actually shows the difference on my computer.

Which platform/hardware are you using?
Mine is Darwin Kernel Version 18.6.0 MacBookPro14,3 Intel Core i7 2.9 GHz

My platform is Linux 5.1.7-arch1-1, x86_64, AMD Athlon X2 II 270 3.4 GHz

Do you want to add those commits here? Or we can open another PR based on this one and pick it up from there. As you prefer. Again, thanks for all the effort.

The changes are great. It is more readable. I will merge them later. Thank you.

EDIT: Sorry. I didn't notice that there are two commits. I only looked at the last one before. Another commit can be better. Because the implementation is change, there is no need to use append_to. The simple swap will achieve the same result.

I will pick the last one and make some change on the other one.

@bcardiff
Copy link
Member

Sure thing @firejox thanks for keep iterating.

@firejox firejox force-pushed the channel-refactor-and-optimize branch from b540ce5 to ba92974 Compare November 13, 2019 04:44
@firejox firejox force-pushed the channel-refactor-and-optimize branch from ba92974 to fdb3167 Compare November 13, 2019 16:42
firejox and others added 2 commits November 14, 2019 12:43
Several changes are applied:
1. use internal storage list to reduce heap allocation
2. move cleaning operations out of the lock section during close
3. remove redundant lock by making internal send/receive return state
4. because of 3, non blocking select can be determined by state
5. the wait/unwait of select action can be done in O(1)
@firejox firejox force-pushed the channel-refactor-and-optimize branch from fdb3167 to ba33bee Compare November 14, 2019 04:43
Copy link
Member

@bcardiff bcardiff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 I have no more feedback. LGTM!

@bcardiff bcardiff added this to the 0.32.0 milestone Nov 14, 2019
@bcardiff bcardiff merged commit 7b46a78 into crystal-lang:master Nov 20, 2019
Comment on lines +79 to +81
def compare_and_set(cmp : SelectState, new : SelectState) : {SelectState, Bool}
@state.compare_and_set(SelectState::Active, SelectState::Done)
end
Copy link
Contributor

@Sija Sija Nov 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, seems like these arguments (cmp & new) aren't used anywhere... 🤔
Shouldn't this method body be @state.compare_and_set(cmp, new)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Fortunately, it is only used on those values 🙈.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants