-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proper support for distributed computing, parallelism and concurrency #1868
Comments
Wow, you should write a full master thesis around this topic one day ;) |
I already did many years ago (wow, the time flies) - but on a (very) different topic 😉.
I'd say let's keep things simple and actually rather provide no locks, no semaphores etc. to "force" programmers stick with the builtin parallelism. Of course, there will always be the
There shall be just one construct - imagine
There is no need for this - if one would want to offload stuff, then she would just use the
Actually rather not. There shall be no need for this - that's the responsibility of the programmer who will define her own cooperative/parallel scheduler (as I outlined in my first post). This might be a simple scheduler, but also - as you pointed out - something super complex adding tens/hundreds of new nodes, many new multiplexers and demultiplexers to the huge graph of go/v routines with queues scaling over heterogeneous clusters (for a moderately complex scheduler see internals of Python Dask). So this can't be part of V. |
@dumblob I see the "go" keyword as a builtin for multi-threading only (lightweight processess) - V could however try to loadbalance the threads automatically among available CPU cores? Fully agree, that higher layers of parallelizm should be spinned off the the standard-libs (or even seperate lib-projects written in V) |
My proposal is about changing it to a similar semantics as in go lang - i.e. the
If you refer to operating system threads, then there is no need for V to loadbalance them as the operating system does it. If you refer to go/v routines, then sure, V will need a scheduler which shall be aware of the number of physical CPU cores available (disregarding whether tiny ones or fast ones like e.g. in big.LITTLE), spawn this number of threads and schedule go/v routines among them). |
As @aprell pointed out in nim-lang/RFCs#160 (comment) , there is a nice and very current material describing the golang scheduler. It's a must read for those trying to implement an efficient scheduler for the mix of parallelism & concurrency. Some downsides of the golang scheduler are the following:
Otherwise golang implements pretty closely what I described in my initial post above - so it would make sense to "port" all the parallelism & concurrency logic (which is not exactly a trivial code) to V and start improving on top of it. |
Thanks a lot @dumblob This is the last unresolved thing in V for me, so all this info is very helpful. I'll start working on this in early October, but if anyone is willing to do this earlier, they are more than welcome. |
I hope you mean whith: For me these are two distinct use-cases where in the first one I rely on the OS to "load-balance" my threads among the available (by change hyper-threaded) CPU-cores and in the second case I want the control over which parts of code have higher priority to be executed (co-routines) even if I do not know a priori how much CPU cycles or wait-time this particular instruction block will use. I hope one day we can state about V as follows (taken from the Pony-project): "It’s data-race free. V doesn’t have locks or atomic operations or anything like that. Instead, the type system ensures at compile time that your concurrent program can never have data races. So you can write highly concurrent code and never get it wrong. It’s deadlock free. This one is easy, because V has no locks at all! So they definitely don’t deadlock, because they don’t exist." That's all from my side for now. |
Sure, I do.
I don't understand why these 2 should be 2 different use cases. For me it's definitely just one use case - I want things to run purely in parallel (concurrency won't help me at all). But because parallelism is super expensive (memory consumption as well as context switching of operating system threads doesn't scale into millions), I have to combine it with concurrency by leveraging all the more knowledge I have from the programmer (unlike an operating system kernel, which doesn't know what the programmer intended and thus has to waste memory and context switches on fully featured operating system threads) to make also go/v routines (i.e. concurrent threads) behave like truly parallel operating system threads. Thus combining the preemptiveness of operating system threads and the memory and context switching efficiency of go/v routines. I really don't see any point in having concurrency if we have true parallelism with the same amount of resources and comfort. Mind, that my proposal gives one full power over scheduling by exposing the scheduler to the programmer.
I'm pretty sure this will not be possible (see latest research). At least not with the current Vs semantics. Such guarantees can be achieved through constraining the programmer (see e.g. Rust as already many times mentioned here in V's discussions) or through making the language so incapable that it'll get very impractical 😢. What can be though achieved in V is to make races highly improbable (the syntax & semantics will guide the programmer into writing more or less race-free code) and on top of that provide e.g. a race analyzer as golang does (and run it automatically as part of compilation?). This is the difference between guarantee and high probability, and it's enormous. Besides, if one utterly needs plain concurrency, it's easy to do it yourself (there are many implementations on github as it's really an easy high school exercise). Even on embedded systems I always try to use first one big event loop and if that's not enough, then I resort directly to interrupts (which are costly, but provide true preemptiveness) instead of using any coroutines, because they won't bring me anything more than an event loop and they're about as complex as true interrupts with preemptiveness. |
upvote for parallelism + SIMD vectorization as requested here #1079 |
@medvednikov some implementation ideas:
|
@medvednikov a possible implementation of a highly efficient preemptive scheduling:
Note also, that we can easily analyze in compile time the maximum stack size needed for each V parallel routine, so we don't need dynamic stacks (see also the cactus stack problem and an innovative solution mratsim/weave#24 - worth reading!), but rather a fixed-size array of different size (known in compile time) for each V parallel routine. This saves a lot of trouble and allows for the best possible performance (yeah, for indirect & direct recursion, for FFI, and for |
There seems to be an implementation effort having very similar goals (maybe except for the reentrancy which I'd definitely like to see in V as mentioned above) and with extensive survey of existing implementations (Go is mentioned as well as an interesting example). |
Thanks @dumblob I'll check these out. |
@dumblob Could you please take a look at Chapel a parallel programming language (https://chapel-lang.org/ https://github.com/chapel-lang/ ) designed by CRAY engineers? Maybe you can borrow some design ideas from Chapel. |
@cristian-ilies-vasile I already did that. Chapel has a large set of very nice features (e.g. regions, multi-paradigm, etc.), but they don't seem to match V goals (especially the "there is only one way to do a thing") and second they're not generic enough (read: "Chapel is a pile of best practices & optimized principles for a - moderately large - selected finite number of use cases"). Due to this, Chapel programs do squeeze the maximum out of the given supercomputer, but can't be run enough efficiently on non-supercomputer hardware (which is the majority nowadays - see the table in the initial post above) as well as the hardware of tomorrow (yeah, stuff like several layers of remote memories with different access speeds instead of just I have to admit I couldn't find in Chapel nearly anything useful for V (the above proposal is generic enough to support quite efficiently nearly any of the Chapel abstractions implemented later in specific V libraries libraries - note especially the scheduler "callback"). If you have anything specific in mind, feel free to discuss it here though. I'll be more than happy to learn something or clarify or modify the proposal. |
@dumblob Thank you for the nice words. //1 - framework to eliminate bugs //2 - Function as a Service at run-time [1] https://songlh.github.io/paper/go-study.pdf Regards, |
@cristian-ilies-vasile YES to //1 but NOOOO to //2 - V shall have no "runtime" or even VM in my opinion - you are free to build your own VM using V though |
@gslicer I feel that you did not get it. It's not like Java and JVM, my suggestion is related to one specific case. The coder could decide that a long running function would run automatically on a different server/environment. The V ecosystem should provide tools and mechanisms for such approach on year 2020. Think about C, it is still used after 50 years... so I presume that V lifetime'll be the same. In this case let's prepare the foundations to sustain this type of weight. Regards, |
Well, my proposal diverges from Go lang in a way, that they won't be cooperatively concurrent but preemptively parallel. This is more in line with the "standard threading" experience one gets on Windows or Unix (pthreads and alike) than with Go lang. So one can reuse the knowledge, tooling etc. to more easily reason about the program than one can about Go programs. My proposal goes further though - I envision a builtin support (e.g. a dedicated syntax or even a new keyword) for "reentrancy" (a scheduler can spawn an arbitrary number of instances of a go routine any time and can kill and remove an arbitrary number of them also any time. Second, the programmer shall get the full power over scheduling (this might be achieved by many means, so I don't have any particular opinion on implementation - imagine e.g. the most error-prone but widely known:
This sounds somewhat similar to the reentrancy & scheduler "callback" I'm proposing, but my proposal is about the primitives needed (the language syntax & semantics) rather than about any built-in "high level" stuff like VMs. My goal is to offer the very minimum set of primitives to make it possible to efficiently implement something like you described in custom libraries, but not as part of V. |
@dumblob Related to |
@cristian-ilies-vasile sounds like something easily doable in the app itself - in other words I don't see any obstacles in my proposal which would prevent implementing such TTL functionality efficiently in your app without any cooperation from V side. It also doesn't sound like a default behavior of V's parallelism, so I don't see currently any reason to add explicit support directly in V. Or am I missing something? Please elaborate 😉. |
@dumblob Indeed, one can implement Timing Wheels logic on their own, but wouldn't be more helpful for coders to use a rock solid language abstraction out of the box with TTL included? Regards, P.S. |
This seems more needed for purely non-preemptive scheduling. Otherwise it seems to scratch the whole point of determinism especially in latency-critical situations like games, etc. - an area where Go has had (and still has) historically issues. Btw. Weave doesn't have this problem and if fully strict ordering of processing requests is needed, then Weave even offers a "gather it all in one place" queue for such scenarios. I envision something similar in V. Btw. I'm slowly reading all the newly referenced materials and will hopefully get to this THE Fuzzer of Fuzzy Latency for User Space Threads soon. |
@dumblob |
// 8 |
I found a paper where the influence of processor frequency on tasks completion times is studied. Based on this, one can think of a mechanism for allocating threads, including "overloading" at the processor level and dispatch more user space threads on fast CPUs. Paper: Fork/Wait and Multicore Frequency Scaling: a Generational Clash |
@dumblob Well, is not so easy. If inside a go fn() there is a piece of code using recursive calls to functions the static measurement will fail.
|
@dumblob |
Answering in hurry, bear with me.
I'm not sure I understand, but it sounds like you're asking about the basic idea behind work-stealing 😉. If it is so, then my answer is: "See any good work-stealing implementation (the best I saw is in Weave , another good implementation is the Go lang scheduler)." Nonetheless, there is no work-stealing implementation I know of, which handles dynamic removal and addition of physical CPU cores efficiently (i.e. changing N dynamically in runtime - e.g. in response to power saving). I envision the work-stealing scheduler should spawn os threads or stop (join) running os threads (but first stop stealing work in the designated os threads, e.g. those having the highest identification number /because this condition is detectable in each os thread separately without stop-the-world/ or those having shortest queues, and wait for their work queue to empty) as a response. Btw. https://github.com/krinkinmu/userspace-threads/ (pure preemptive scheduling in user space) could be merged with code from Weave (pure non-preemptive scheduling in user space) and we should be good to go. Actually rather incorporate preemptiveness (inspired by https://github.com/krinkinmu/userspace-threads/ ) into Weave as Weave is formally verified to be correct and at the same time it's already quite polished and as you wrote hella fast (scales nearly linearly with the number of physical cores! There is no other such capable library in the world I'd know of!). |
@dumblob wrote:
[…]
As I mentioned in my post in the related issues thread #3814, I think this reentrancy invariant is onerous and should be dropped. Unless I am misunderstanding what you intend “reentrancy” to mean in this context?
Contention leads to a hornet’s nest of Heisenbugs. Lockfree is the only way to go, please. Lift the semantics out of the low-level gutter. Or I mean at least strongly discourage the use of low-level thread synchronization primitives. |
Reentrancy is a flag a go routine will get from the programmer serving as a covenant with the meaning "I programmer hereby certify, that the following go routine could be spawned 1+ times any time and any running instance if it's not the last one can be stopped and discarded any time". It's just an optimization giving the scheduler a huge amount of freedom to make scheduling fully dynamic in time (i.e. spawning "unlimited" number of instances of the same go thread or stopping/discarding running instances of go thread if the throughput is not any more a bottle neck) and in general more efficient.
Hm, I hope you don't confuse lock-free and lockless. The proposal above is fully preemptive and thus what the language (V) itself offers is always lock-free. The programmer though can freely build lock-free as well as not lock-free algorithms (V has Note though, that I'm not sure V is already in a state when this lock-free gurantee holds in all cases - V is still alpha software, so keep than in mind. But if you meant lockless, then I'd disagree. Please google for lockless variants of efficient lock-based algorithms and you'll be surprised that lockless is only sometimes more efficient in practice (on top of this lockless quite often suffers from other symptoms). It's by far not this easy to say "lockless is best"...
That's what we also advocate for - mainly through |
Oh if it is optional then that seems more acceptable. The programmer could set the flag if for example they don’t mutate any resources which are shared between all copies of the thread. It is very difficult otherwise to prove compliance with the invariant — Rust’s exclusive write ownership is another means to prove reentrancy but I personally find that to be an unpleasant paradigm to code in. And I explained why I think Pony’s model is more flexible than Rust’s.
That seems reasonable to me iff (if and only if) thread synchronization primitives will only accessible from
There seems to be no salient distinction between the two, when I distill their definitions down because afaics in theory any code which employs locks will fail to meet the stated requirement of lockfree: https://stackoverflow.com/questions/20467247/whats-the-difference-between-lockless-and-lockfree Seems someone has in their mind that the terms should be applied in different contexts? So per my edit in my post in #3814 thread, I understand now that Dmitry was referring to the scheduler not being lockfree, which is not what I was concerned with. I am referring to the APIs that the safety-conscious Vlang programmer has access to, not the implementation of the runtime or
I’m already aware of that. I mean to suggest to avoid the need to lock by avoiding contention, e.g. by when accessing shared memory only allowing that be access to immutable data, for the code that is not marked This is a facet of Go that really disappoints me. Unfortunately there is no construct in the Go language for declaring and enforcing data to be immutable. Note as you may know, immutable data structures have the advantage that they can’t contain circular references, which is important if employing ARC instead of a tracing GC, because ARC can’t collect circular references (although there exist probabilistic designs which attempt to do some localized tracing along with ARC). One possible model for mutating shared immutable data is for the exclusive owner to create a new mutated, immutable copy (perhaps via an efficient data structure that only copies the changed part which afaik is what Clojure employs), then send a message to threads to tell them to update the copy of the immutable data they are referencing. As I think you all know (bcz I’ve seen it referenced already) that the Pony language has a references model for tracking which resources can be consumed to an immutable and then shared between threads. And there are other possible models such as Arena/Actor-like partitions which I have contemplated. I had posited in 2018 that immutable shared data will scale better to the inevitable future massively multicore shift away from universal hardware cache coherency, because the cache invalidation can be unified with the message between threads to update their copy. Of course there will still be locking issues at some higher-level semantics layer such as shared access to a shared database from concurrent threads, but afaics that is out-of-scope of Vlang’s design concerns. You do seem very knowledgeable about concurrency and more so than myself. I found myself upvoting most of your comment posts. Thanks. |
For those interested I posted a report On the way to an ultra low overhead equal-priority hard real time preemptive scheduler in the V Concurrency Considerations thread. |
Just an example for eBPF non-believers 😉 how authors of picoev (which is bundled with V) implemented eBPF in their h2o HTTP/1,2,3 server (btw. the server itself seems to be a pretty good piece of SW despite being still experimental; it's also not bulky 😮 but quite vice versa). |
@UweKrueger @medvednikov @spytheman @cristian-ilies-vasile @JalonSolov and others - feel free to read new mini-series about the current state of the art memory models (which exist only because of multicore architectures sharing the same physical memory) and their major flaws (led by the fact that they're mostly not formally defined 😮): They're written by the Go lang lead dev - Russ Cox (
For V I think this is the most important point of the series:
|
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
There is one thing in V I do care a lot about - parallelism & concurrency for seamless local and distributed programming.
Note, I didn't want to write this to #561 as the scope I'm describing is broader.
I'm confident every new language must provide first-class support for both parallelism and concurrency. Concurrency to allow one physical CPU thread do cooperative scheduling and parallelism to allow spawning the same go/v routine on several physical threads. This is actually what go lang does (when GOMAXPROCS is set to 2 or more and the underlying operating system supports threads) except for always spawning just one go routine disregarding the number of physical CPU cores (which is where V should improve on - see below).
Why both concurrency and parallelism and not either of them?
In the world of big.LITTLE, GPU/FPGA offloading, CPUs with tens/hundreds of physical cores (and hyperthreading on top), mobile devices trying to save as much energy as possible, NUMA supercomputers, mobile & close vicinity networks versus optical fiber networks etc. it's inevitable to dynamically schedule computation directly in runtime of each application (not just operating system wide which is too coarse and thus inefficient and not just cooperatively which would use just one physical CPU core per application which makes sense only for power saving scenarios, but nowhere else).
We have instruction-level paralellism covered (it's not yet present in V - see e.g. my rant about
restrict
- but it's a "solved problem" in todays compilers). The rest of the parallelism is just about "how close can we get to pure dataflow programming". Which appears to be kind of difficult.Because instruction-level paralelism is solved, the best approach nowadays seems to be to:
write as much serial code which allows being highly optimized by instruction-level parallelism at once in one go/v routine (typically a tight loop or one tree of nested tight loops or a slow I/O or offloading to GPU/FPGA or similar) - the outermost construct of this go routine will be always an endless loop yielding (calling the cooperative/parallel scheduler) every N iterations
then schedule these go/v routines in cooperative manner on one physical CPU core with fixed-size queue (single-producer-single-consumer "SPSC" - for performant implementations see discussion below) in between these go/v routines (this queue is called a channel in go lang); the queue size might be between the size of L1 and 2*L2 cache as suggested in paragraph 3.3 in Analyzing Efficient Stream Processing on Modern Hardware.
and if any go/v routine shall become a bottleneck during computation (i.e. consumer is slower than producer for some time - this can be easily monitored by the cooperative/parallel scheduler as metric "how full are the queues"), a new instance of the slow consumer go/v routine shall be spawned (with respect to reentrancy) in a different operating system thread (i.e. on a different physical CPU core) including all the plumbing work (e.g. spawning an additional go routine acting as multiplexer getting the producers outgoing data and multiplexing it to the original as well as the new instance of the consumer; and spawning a demultiplexer go routine to join the outgoing data from both consumers) - this everything would the cooperative/parallel scheduler do.
and if the queues shall become almost empty for some time, remove the multiplexers and demultiplexers.
The cooperative/parallel scheduler as referenced above shall be a user-definable routine (if not present, a default built-in scheduler working similarly like described above will be used). This scheduler shall run preemptively (in its own thread?) and get as input arguments at least the following: thread handles, pointers to all go/v routines and corresponding queues between them, pointer to the built-in scheduler (in case the user just wanted to wrap it), and pointer to user-defined metadata (e.g. statistics allowing better scheduling judgement). This would allow for really complex scheduling on embedded systems as well as NUMA systems or any other highly dynamic systems (e.g. a smartphone could leverage being shortly connected to a cluster of other computers or smartphones and offload some slow algorithm there etc.). See e.g. MultiQueues.
This way one can write application once and deploy it to your grandmas wrist watches as well as to a supercomputer (assuming the user-level scheduler is a "bit more" clever than outlined above - imagine complexity similar to an SQL query optimizer for a distributed DB). It's also a highly failure-tolerant system (imagine Erlang which supports even an online update of the whole application in memory while serving all clients under full load without interruption!). Also as you may have noticed, go lang does not scale first because there is no spawning of redundant go routines (those which will be bottle necks) to other physical CPU cores and second because go doesn't have a dynamic user-influencable scheduler (taking into account advanced statistics, power consumption, etc.).
There is one catch though. If implemented naively, then such elasticity doesn't guarantee global ordering among channels/queues (ordering works only inside of one channel) and real life scenarios in IT are unfortunately more often than not relying on ordering. This has to be accounted for and will require user intervention (i.e. be syntactically explicit). Either in the form of knowledge where to (a) insert "tagging" of items before they'll be spilled among different channels and (b) where the "tagging" will be removed and used to assemble the original ordering. Or in the form of assigning a strict priority to each channel. Or using other scheme.
See also lock-free and wait-free datastructures (state of the art as of now) which might be handy for an efficient implementation of parallelism.
IMHO we could actually postpone the implementation of concurrency and stick with just parallelism (as it's easier to implement than the interleaving between concurrency and parallelism as outlined above), implement the queue and a basic scheduler and first then (maybe even after V 1.0) get back to concurrency. As of now we have parallelism without queues and without the scheduler, so we actually already have a good starting point.
Table of parallelism possibilities we have nowadays (just for reference):
(in practice these might overlap and/or be used simultaneously)
P.S. The dynamic scaling over the different kinds of parallelism as outlined in the above table is called "fine grained distributed computing". And if anything from the above proposal sounds crazy to you, then I can assure you, that the world doesn't sleep and there is at least one seamless (fully dynamic) solution offering first class fine grained distributed computing - Unison.
Other things to consider and/or track:
rand
a7c8483#r39632493g_str_buf
with parallel access:map
etc.)The text was updated successfully, but these errors were encountered: