-
-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternatives to Python #343
Comments
My opinion is that, as a popular, powerful, and high-level language, teaching the bare minimum using Python is time well spent. If we choose to iterate, revisiting the π calculator for performance comparisons to another language, having this as a baseline is valuable. Teaching programming languages is such an involved topic, each one would be a lesson unto itself. |
All of the above languages have some notion of distributed memory parallelism. When teaching this, would go through one in detail. As an exercise, would ask participants to run one of the other languages and report execution time. The current Python version of a problem that is embarrassingly parallel uses significantly more memory than is needed. Perhaps this can be adapted? Amdahl's law can be mentioned briefly, but it might fit better into the next section. This may be helpful so that people understand not all problems are embarrassingly parallel and that they should do some scaling study to determine appropriate resources to use for their problem. |
This comment is moved here from the PR, where it doesn't really belong -- sorry for the extra e-mail traffic! WRT this issue more broadly, I fear that the introduction of parallel examples in additional languages, while interesting and valuable, detracts from the focus of HPC Intro as a whole, which is primarily about how to execute code in a cluster environment with shared resources. The primary reason for including parallel code at all (which, if I recall, was not uncontested) is that access to parallel resources is one of the principal features of cluster HPC environments, and is a big part of the motivation for learners in undertaking this lesson in the first place. The absence of a parallel example undercuts this motivation. The choice of a Python example (over the previous C "cpi" example) was motivated primarily by the relative simplicity of the code, and the relatively high level of the abstractions used, with the hope that it would not be too much of a distraction from the focal point of the lesson. Parallel programming and parallel models (especially MPI and OpenMP) are obviously valuable to the community, but I remain concerned that taking learners down that particular rabbit-hole in this lesson leave them confused about both operational issues with clusters and parallel programming. I would enthusiastically endorse a more separable module (or entire lesson?) on parallel programming specifically, possibly as a successor to both "parallel novice" and "hpc intro". |
Python is readable, but without knowing other languages, codes written only in Python are quite slow, see for example: |
It should also be noted that the current programming example can be improved, it scales poorly and uses more memory than needed to solve this particular problem. There are more effective methods of calculating pi, but if this one is to be used, it should also be well programmed. If demonstrating memory considerations and using arrays is important, then another example may be better, for example Game of Life or Vector difference described at https://www.software-carpentry.org/blog/2010/06/the-cowichan-problems.html |
What's the reward here for including other languages? As @reid-a said, it doesn't really further the teaching objectives of the HPC intro lesson, and it incurs a large maintenance burden. Introducing compiled languages also means introducing compilers, of which there are many and each system has good reason to recommend one or the other. Sticking to Python allows us to delay the introduction of that, which particularly helps when we know there are a lot of learners out there who will never compile code directly themselves. Perhaps the Python example could be improved, but I would do that to further the teaching goals, not to get a faster (but still poor) estimation of pi. One thing that we have right now is that each MPI process may be using the same random seed, to me this is a far more important issue than performance...and also an opportunity to teach something that the learners have a chance of grasping even at this stage. |
It is good that we agree the Python example should be improved. Changing the seed may be a good thing to show - this can be done in a language independent manner. However, parallel random number generation can be quite involved, see for example: There are also many people who will not need to or should not use Python for their computational problems. Forcing them to use it by not giving instructors some choice is poor pedagogy. Recognizing this, the Carpentries has typically tried to offer options in R and Python, though for HPC, R is not commonly used. Am happy to maintain other languages since I expect to also use other languages. The changes still allow instructors to use Python if they want to and it is most appropriate for the people they are teaching, but it allows for choice when Python is not most appropriate. |
We need to keep our teaching goals in mind. There are a million details we would all like to teach the Learners about how to craft a better parallel program: better algorithm, smaller memory footprint, higher-performance compiled language, superior PRNGs, accelerator architectures, ... the list goes on forever. The challenge for us, as curriculum developers and teachers, is to meet our learners where they are and feed them knowledge at a rate they can process. The Carpentries does this by leading the learners through the material one keystroke at a time, in brief windows of time, which means that each learner keystroke should be executed in service of the learning goals. If a learner has never seen a programming language, I would absolutely not choose C or Fortran as their introduction. Even python-novice-... acknowledges this:
Point the Fourth is key here, even for HPC. This set of reasons is probably also why TensorFlow and Keras -- each of which does massive computational work -- are Python packages (which call C and Fortran libraries under the hood). Forcing an instructor to use Python for a course on Fortran is poor pedagogy. Teaching Python first in a scientific computing context is one of the best pedagogical decisions available: Python is thoroughly supported by The Carpentries, meaning our learners have a better chance of having seen it before, so the parallel features we introduce can be the focus, rather than the language itself. I disagree that there are people who should not use Python: it is a Turing complete programming language with remarkable utility as a high-level glue for computation, statistics, and visualization. I would like to add that this discussion reminds me of similar ... disagreements ... between @psteinb and myself when I first stumbled upon HPC Carpentry. I prefer C++/CUDA for similar performance reasons, and wanted to dive straight in to buffered non-blocking MPI function calls, compiler optimizations, profiling, and scaling. These matter to me, as an HPC practitioner! |
Adding other languages does not force you or any instructor to use them, it gives choice for instructors and learners to choose what they think is best for their situation. Languages have strengths and benefits. Many engineering codes still use Fortran, see for example https://pages.nist.gov/fds-smv/ Note sure if you read: New languages such as Julia, Go are replacing Python for scientific computing and for web services. Enterprises still use Java because although programmer productivity for a single task is lower than Python, it is easier to maintain a high quality Java codebase than it is to maintain a high quality Python codebase. As an example h2o is written in Java, while not as popular as many of the machine learning libraries wrapped in Python, it is extremely easy to setup, partly because the discipline enforced by the language leads to higher quality code. Finally, note that the incubator has lessons on GPU programming and Java. |
The slowness complaint in the essay refers to the relative slowness of Python due to its scripting character, in the context of programming contests or exercises where, as I read it, various algorithms of interest are implemented "from scratch" using native language features. This is true, and a valid complaint, to which the Python community's solution is to wrap C/C++/Fortran libraries in Python-accessible wrappers. The mpi4py tool used in the current example is one such wrapped library, which I think detracts from the relevance of the speed issue in the HPC Carpentry context. Mpi4py itself is apparently C-based, judging from the The actual language of our example is not by itself of high importance. The features we require are a fairly high level of abstraction to minimize MPI-specific distractions (which has already motivated a shift away from a prior example in c), a correspondingly high level of readability (so learners can see where the parallelism is happening). High performance and an exhbition of HPC best-practices are useful, but of secondary importance. I personally remain undecided about having multiple language options. I feel like this opens the door to a parallel-programming rabbit-hole that property belongs in a separate lesson whose focus is parallel job operations, where, to be clear, such a discussion would likely be of very high value. Multiple language options also impose an organizational burden on future instructors, who may not themselves be experts in HPC operations. The Python example ticks a lot of the boxes we want. What little I know of Julia suggests that it could be a good choice as well. I think C or C++ likely expose too much extraneous detail. I suspect this is true of Fortran also, but I am sufficiently unfamiliar with current Fortran best-practice that I'm less confident here than I am for C/C++. A practical issue, related to the recent coordination meeting, might be to think about this in the context of what might be easily available in a purpose-built instructional cluster assembled using the ComputeCanada Magic Castle Terraform module set. Anything in that set, or easily added to it, is possibly a reasonable candidate. To harp on my newest hobby-horse, I also think we are starved for learner feedback. Arguments of the form "learners will be confused/disappointed/excited by X" are the strongest arguments we have for lesson modifications when they are backed by actual learner feedback expressing confusion/disappointment/excitement with X. |
I don't doubt the data they gathered in 2007, nor their order of magnitude difference for time limits at that time. I do wonder if a numpy solution to a problem is that much slower than Java or C/C++ today. At least from a single test of the numpy code versus the Java code at Stack Overflow: Python code 60x slower than Java, I get around a 3.7x speed difference in Java's favor. How much of that is overhead of starting the Python interpreter and would be less significant on a larger problem is currently unknown. The Python code certainly has room for improvements, as what adaptations I made to it were just to bring it in line with the topics covered in the original Python lessons for SC. So the only new concepts in the MPI version are just the basics of MPI itself. It may be different for others, but my intended audience for HPC Carpentry workshops are the new users at my university. The median user from science or engineering won't have done much at a command line, probably zero version control, and little to no Python. If feasible, I'd run them through the regular 2-day SC workshop, then follow up with the HPC-intro material. They're not generally going to try to eke out every bit of performance in their workflow, they're just trying to get science done more quickly without a huge amount of effort (i.e., going from novice to competent practitioner, not to expert). There's definitely a place for the other languages, but from what I've seen in other training, doing justice to OpenMP or MPI for users who already know their language of choice is a day-long or multi-day workshop on its own. |
There was a really interesting post on Twitter regarding the performance of Python and I think is related to this discussion (thanks @psteinb!). It references a recently published note in Nature Astronomy (which you can read online at https://rdcu.be/ciO0J). For me the point to take from that is that there is a lot of value in showing people that writing efficient code with Python is possible, and the tools you need to do that are something we could/should cover in greater detail. This type of course is something that we do cover at Juelich Supercompting Centre (see https://fz-juelich.de/SharedDocs/Termine/IAS/JSC/EN/courses/2021/ptc-hpc-python-2021.html). |
In #341, @bkmgit proposes
Inclusion of other languages, lower-level/higher-performance languages, and teaching complexity has come up before, but it's worth discussing again to help clarify the intent of this lesson, from a pedagogical viewpoint, and if/when/where we intend to teach more about programming languages than HPC resources.
Do you see value in porting the Monte Carlo π calculator to any of the following?
If so, where should we teach these languages? In this lesson, or in a new one, or create examples for after-workshop exploration?
The text was updated successfully, but these errors were encountered: