Recent

Author Topic: Parallel programming model  (Read 2223 times)

Warfley

  • Hero Member
  • *****
  • Posts: 910
Re: Parallel programming model
« Reply #30 on: July 06, 2022, 04:34:13 pm »
Pascal does not really have great support for parallel programming. It has threads and synchronization mechanisms (like critcial sections and events) but thats the bare minimum. Thats a bit like saying "here you have some iron, copper, rubber, and other raw materials, go and build a car". Possible but tedious.

There are languages out there that incorporate parallel programming into their language design. One such example would be Erlang, which was specifically designed by Erricson for high performance networking code, and today around 90% of all internet traffic is handled by erlang code somehwere along the way. It is also commonly used in server software that requires a high uptime and managing many users like Amazons SimpleDB powering ther AWS EC2 systems or the Facebook and WhatApp Chat servers.

And I think it is always a great case study to look at what makes erlang so successful for parallel programming, and how this is lacking in Pascal.

So beginning with the first reason, Erlang is VM based, which completely abstracts it from the underlying OS and hardware. Pascal threads are OS threads, while Erlang threads are VM threads. The VM controlls the utilization of the hardware, and maps the software threads to actual OS level threads. This means you are limited by the OS and hardware. On linux you can only have a certain number of threads, spawning and killing threads is slow, and the scheduler will always take a certain amount of time due to context switches. Erlang VM threads on the other hand are really lightweight, can be spawned and released in no time and the scheduler is blazing fast and there are no context switches involved.
This allows you to spawn millions of Erlang threads on pretty much any hardware and OS really efficiently. People often associate VMs with being slow, this is probably due to the garbage collection in Java or .Net, but this is not necessarily the case. As stated above, erlang was developed for creating router firmware and today erlang is capable of handling up to terrabytes of data each second.
Also this is completely hardware independent, so your Erlang code will work the same on a single cpu microcontroller or on a large computing cluster.

For an example, if you write a server that handles tens-hundreds of thousands of TCP streams, in Erlang, each of those streams would have it's own thread. With standard OS threads like with Pascal, this wouldn't work, and what you usually do is to have a few hundred threads, each serving a few thousand streams (using polling to access those streams non blockingly).
So while this is theoretically possible to manage in Pascal, the effort involved is much greater.
This is also why other languages like C# introduced their Task system with the thread manager that does exactly the same, chopping of your code into tasks, which can then be software scheduled transparently by the VM to the real os threads.

The second reason is probably the most obvious, Erlang is a functional language, which means it is side effect free. This means each piece of Erlang code is completely independent from each other, all data structures are immutable and there is no such thing as shared memory (in fact in functional programming there isn't even such thing as a concept of memory). Therefore you don't need synchronization mechanisms. The only way for data to be communicated between parts of the program is through well defined communication channels on a language level. This allows first for a formal method of ananlysis, as all data flows can be mapped through these well defined channels. This makes also both optimisation as well as distribution much easier. As you only have these channels, it doesn't matter how this data is transmitted, it could be send through a message queue in memory, managed by the VM, it could be through IPC or even through network to connect completely different machines with one logical erlang program.
Also by the fact that the memory is immutable, it also means that you can do lazy copies. If a number of threads use the same data (e.g. some lookup table), but as it is immutable no one can change it, you can simply let all of them access the same data, without having to create a copy for each object.

Immutability is a concept that is also coming more and more to other languages like Java, C#, Python, C++ or Javascript, have concepts of immutable datatypes, exactly for that reason. Immutability makes a lot of things easier, and with language level support for it, the compiler can also highly optimize such structures.

In pascal this is actually kinda archivable, by using forks that communicate through pipes. This separates the memory of each of the processes, but fork will only copy memory laziely, meaning shared pages that are not altered will be shared in memory. This is great for separating tasks and allows for structured communication between those tasks, where the OS takes the job of all the hard threading and synchronization problems. I would highly advise doing this, but the only problem here is, that this is extremely inefficient. Spawning processes is really slow and takes a lot of resources. So while this is a very great solution that allows for creating extremely clean seperation of tasks, it is also not scalable.
So I recommend using this when possible, but scaling it to thousands let alone millions of processes will probably not end well.
Also Pascal does not really have much support for immutable Datatypes, there are pretty much None in the RTL, FCL and RTL, and also on a language level there is no real way to write immutable datatypes (like final in java or const in C++), so there is no way for the compiler to optimise for them.

If you want to do this scalable in a large scale system, you need to build your own scalable message queues and build your tasks/threads in a way that you don't touch shared memory, or if you do, you need to make sure to use synchronization or that it is immutable (which again is not supported on a language level)

The last big reason is the unoffical slogan of Erlang: "let it fail". This referrs to the error management of erlang. Each task is generally supervised by another task so for example with a server, you might have your server task, which spawns a new task for each incomming connection, here the server task is the supervisor of the connection tasks. If you have an error in one of the tasks, the task will simply fail and be stopped, and the supervisor will be notified. The supervisor can than choose to react if it wants to ignore it, handle it, or fail itself. For example in telephony systems, where erlang was also heaviely used, when there was an error during a call, the switch would simply drop the connection and ignore the error. The humans trying to phone would simply try to call again (or maybe even the phone software would do this automatically) so the connection switch does not need to bother trying to fix the error, it will probably go away on a new call. If to many errors occur, it can try to resolve this, or the switch completely fails, the rest of the telephony network reroutes all of the incomming connections and an engineer can look at what went wrong.

Pascal has exceptions, which can be used for something similar. In fact, for my STAX I've basically rebuilt that system (even before I knew it from Erlang), but trust me, this is really tedious. So again, something you can do but a lot more effort as you basically have to build everything yourself.

All of this makes Erlang a really great language for parallel systems. Erricson actually measured a decrease in development time of a factor 10-25 and a reduction in code size of 2-10 compared to their previous development models (mainly with imperative languages like C which have similar properties to pascal with respect to parallel programming).
So to summarize, you can write scalable parallel programs in Pascal, but Pascal only gives you the bare bones and it is not comparable to any language that is designed for this and has this built in on a language level, so you will have much more work to archive similar results.
« Last Edit: July 06, 2022, 04:37:29 pm by Warfley »

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1043
Re: Parallel programming model
« Reply #31 on: July 06, 2022, 05:35:55 pm »
The two largest challenges to x86 class CPU's is branch prediction and keeping the caches coherent. Both require black magic and AI for the current ones. That's why a model like Cell is such an improvement. But it requires a different way of thinking and programming. And it isn't backwards compatible. That's basically why most of the parallel stuff runs on GPU's.

Another approach is distributing the objects and running them where convenient. Both .Net and JavaScript have that built in. In that way, they are virtual tasks. But communicating the result can be hard.

A CPU that consists of ~1000 simple ARM cores (with wide busses and lots of DMA) would be interesting. Bandwidth is going to be the main problem if you're going many-core. And if size is not a problem, a lot of micro-controllers. Optical connections and switches would rule.

Golang uses the same ideas as Erlang, but in a procedural programming model. Every function is a thread and parameters can be streams. No classes, though, so you have to keep it small.

Then again, as you said, the inter-task/process communication could use a massive overhaul in most current programming languages.

Warfley

  • Hero Member
  • *****
  • Posts: 910
Re: Parallel programming model
« Reply #32 on: July 06, 2022, 05:49:06 pm »
A CPU that consists of ~1000 simple ARM cores (with wide busses and lots of DMA) would be interesting. Bandwidth is going to be the main problem if you're going many-core. And if size is not a problem, a lot of micro-controllers. Optical connections and switches would rule.
To a certain degree these problems can be targeted by NUMA architectures where you have different memory regions assigned, or near to to different CPUs and connected between a shared memory bridge. Access to the "near" memory is extremly fast, while access to the more "distant" memory regions that need to go through the shared bridge. So the only question is how to map threads to cores such that their local memory access is maximized while shared memory access it minimized.

To the best of my knowledge the only solutions for this right now is manual mapping of data onto NUMA memory regions and building your whole application around it, but this is also quite tedious and you are building your application for specifically the kind of chip it is going to run on

MarkMLl

  • Hero Member
  • *****
  • Posts: 4730
Re: Parallel programming model
« Reply #33 on: July 06, 2022, 06:18:25 pm »
I'm not convinced of the benefit of a VM in this context, and the industry as a whole seems to be tending towards the LLVM model.

However I like the idea of a supervisor/worker arrangement, with some way to pass the worker's grumbles to the supervisor rather than propagating it up the call chain in a less controlled manner.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Warfley

  • Hero Member
  • *****
  • Posts: 910
Re: Parallel programming model
« Reply #34 on: July 06, 2022, 06:44:35 pm »
I'm not convinced of the benefit of a VM in this context, and the industry as a whole seems to be tending towards the LLVM model.
You do not necessarily need a VM, you can build a similar software threading models also natively (there are some C userspace scheduling libraries out there that are basically ready to use), but a vm makes this much easier and cleaner, which is why erlang went this route
« Last Edit: July 06, 2022, 06:48:34 pm by Warfley »

MarkMLl

  • Hero Member
  • *****
  • Posts: 4730
Re: Parallel programming model
« Reply #35 on: July 06, 2022, 07:05:26 pm »
You do not necessarily need a VM, you can build a similar software threading models also natively (there are some C userspace scheduling libraries out there that are basically ready to use), but a vm makes this much easier and cleaner, which is why erlang went this route

I'd suggest that, more importantly, at the time of its inception in the mid-80s there were still multiple commercial VMs including (obviously) the P-System but obviously more obscure ones like that used on https://en.wikipedia.org/wiki/Burroughs_B1700 , a terminal range from the same stable, ICL mainframes and so on. However this was supplanted a few years later by JIT compilation to native code which relegated the VM bytecodes to an intermediate representation which was not interpreted directly, followed a few years later by replacement of bytecode with a serialised representation of the AST.

I don't know whether its a coincidence that JIT rose to prominence at roughly the same time that computing became an x86 monoculture, or that it was superceded by LLVM etc. at roughly the same time that ARM became a major challenger (with RISCV etc. minor ones, and IBM mainframes still shuffling on with zombie-like persistence).

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1043
Re: Parallel programming model
« Reply #36 on: July 06, 2022, 10:38:17 pm »
The main difference with ARM is, that it has a lot of conditional opcodes. The good: much less branching. The bad: you have to execute both if you want maximum speed. But because it's just a single opcode and the registers and execution units on CPU's are highly virtualized anyway, that's a net gain. As long as you don't want to run a long pipeline, that is.

For comparison: CMOVE on x86 is one of the major speedups.

For the big picture: a current x86 CPU is a VM that emulates one. And it is filled with VM's that emulate execution units. Because the x86 architecture and opcodes do really poorly if you want to multiplex and multitask. But that's what everyone uses, so that's what you have to run.

On a higher level, a modern PC is just like an ancient one, with a bunch of stuff tacked on. Or a collection of them, like the virtual 8086 modes of an 80386, but with less fencing. Sharing all resources.

We're used to big cores that use time slices for multitasking, with unified memory and other resources. It's like the right-wing idea of liberalism: everyone should be able to do whatever they want. Like, driving on each side of the road, however they feel like.

Your storage units (disks, memory, websites), all contain a long string of ones and zeroes. The contents are just random noise unless you know the protocol used to store them all. It is a huge chain of conventions. And processing is just a set of binary instructions, which are executed without any understanding. And they're part of that data storage as well.

So, the trick is in optimizing understanding: the hard- and software needs to know what this sequence of bits represents and how they can interact with it. And, especially, what the programmer wants to happen.

Together with the pitfalls of unified resources, as discussed above, and transistors being dirt cheap, it makes a lot more sense to partition things as far as they go. And spend half your hardware budget on optimizing random communications between all the tasks, floating around and/or in the process of being executed.
« Last Edit: July 06, 2022, 10:48:55 pm by SymbolicFrank »

MarkMLl

  • Hero Member
  • *****
  • Posts: 4730
Re: Parallel programming model
« Reply #37 on: July 06, 2022, 10:52:42 pm »
The main difference with ARM is, that it has a lot of conditional opcodes. The good: much less branching. The bad: you have to execute both if you want maximum speed. But because it's just a single opcode and the registers and execution units on CPU's are highly virtualized anyway, that's a net gain. As long as you don't want to run a long pipeline, that is.

I know what you mean, but I suspect a better term is /predicated/... and I'm not sure it's survived in aarch64.

Quote
For the big picture: a current x86 CPU is a VM that emulates one. And it is filled with VM's that emulate execution units. Because the x86 architecture and opcodes do really poorly if you want to multiplex and multitask. But that's what everyone uses, so that's what you have to run.

Again, I think I'm with you. But if the stuff I was reading three or four years ago is still valid it's interesting that recent(ish) Intel chips appear to use microcode triplets internally, i.e. have surprising echoes of IA-64.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Thaddy

  • Hero Member
  • *****
  • Posts: 11772
Re: Parallel programming model
« Reply #38 on: July 08, 2022, 03:04:22 pm »
Pascal does not really have great support for parallel programming.
It has in trunk/main since a month or two or so.
Things like this https://docwiki.embarcadero.com/RADStudio/Sydney/en/Anonymous_Methods_in_Delphi#Variable_Binding now work.
(Alas you will either have to use trunk or wait for the next major release 4.0)

« Last Edit: July 08, 2022, 03:05:59 pm by Thaddy »
Black themes should be banned.

PascalDragon

  • Hero Member
  • *****
  • Posts: 4290
  • Compiler Developer
Re: Parallel programming model
« Reply #39 on: July 08, 2022, 05:37:29 pm »
Pascal does not really have great support for parallel programming.
It has in trunk/main since a month or two or so.
Things like this https://docwiki.embarcadero.com/RADStudio/Sydney/en/Anonymous_Methods_in_Delphi#Variable_Binding now work.
(Alas you will either have to use trunk or wait for the next major release 4.0)

Anonymous functions and function references by themselves have nothing to do with parallel programming. Sure, they can be utilized for it (like the ability to start an anonymous thread or using an anonymous function for TThread.Queue), but in principle all these can be implemented without anonymous functions as well (it's just a bit more involved). Warfley is talking about a deeper language integration (simple example: language based parallel-for).

Warfley

  • Hero Member
  • *****
  • Posts: 910
Re: Parallel programming model
« Reply #40 on: July 08, 2022, 07:18:02 pm »
Warfley is talking about a deeper language integration (simple example: language based parallel-for).
Not necessarily, some of the things discussed above could be build in language (as library), while others would either require language extensions or make it much easier.

For example a lightweight/micro threading model could be created using a library, but of course it would feel very tagged on (as the current threading model does through it's rigid way of having to overload the TThread class).
The second point about immutability is twofold, pascal does not provide many tools for creating immutable lazy copy based datastructures on the language level, but aside from that, one could build with a bit of effort also immutable types within the language. For example a linked list B->C->D could be ammended to be A1->B->C->D by one process and A2->B->C->D by another. These two processes could share the tail B->C->D because of it's immutability and only have a different pointer to the first element.
Such datastructures could (even though it is not that easy because of memory management) be built using existing language tools, and could be provided by a library. But there simply are not many datatypes that are immutable and lazy copy by design provided by RTL, FCL or LCL. While the standard libraries for freepascal and lazarus are very vast with respect of different datatypes, they are completely lacking in this area.
These things exist for many other languages, and with good language support it is much easier to build them, for example in a functional language like Haskell, all recursive datatypes use the same pattern as described above for the list, without the programmer having to write any code for this.
And well defined communication channels do also not necessarily require language support, on a very basic level TThread.Queue and QueueAsyncCall and Synchronize do provide such functionality, they just aren't generally for communicating between arbitrary threads and only for executing code in a context.
And the last point with the exception management, this can also be build manually, basically my STAX exception model is quite similar. But again here is the problem with doing it as such a hack, as exceptions are passed around different execution contexts and are raised again and again, the debugger is going to show the same message again and again, with the stack traces being completely useless after the first handover.

So at least to a certain degree of convinience, most of this could be build as libraries, but of course it would still be much more effort both in use as well as in debugging and maintainance, so it will never be as good as native language support.

But, Pascal is a general purpose language and that it is worse for some task than a language that was specifically designed to do that task as easy as possible should not come as a suprise to anybody.
My personal philosophy is always to use the right tools for the right job, and as such I simply would choose the language best suited for the task. If I have to write a highly parallelized system, I would make my life as easy as possible and use a language that is optimized for exactly this sort of thing.

That said, I find the usual threading model with overloading a thread class and doing all of the synchronization using primitive locks like Critical Sections is very archaic. A language based asynchronity model (which does not necessarily require threading) could do wonders. When using threading my code is full of:
Code: Pascal  [Select][+][-]
  1. EnterCriticalSection(CS);
  2. try
  3.    ...
  4. finally
  5.   LeaveCriticalSection(CS);
  6. end;
Something you just don't see in a lot of other languages. Not just with immutable types, but just take Java, the furthest away from poly-paradigm design you can get. There you just define your variable as synchronized, and Java takes care of all the locking, no new fancy paradigms like immutable lazy copy object programming needed.
Writing threaded code is my least favourite thing in all of programming. It is so easy to make mistakes and finding them is extremely hard. And the language does nothing to help you with that, so generally I try to avoid multithreading as much as possible, or use other languages if I really have to write something threaded.

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1043
Re: Parallel programming model
« Reply #41 on: July 11, 2022, 01:34:12 pm »
You can write a lightweight threading library with communication queues using only atomic operations, no locking. It's on my to-do list, but not with a high priority (just like a BigInt library). It's a lot of work as long as you can do without.

And, as you said, copy-on-write fits the bill much better than locking for data access anyway. And that includes reference counting, so safe pointers aren't hard that way. But that should be part of the compiler, for the best results.

The main problem with tasks is that it is hard to track their status, especially across systems. And if you use sockets for everything, you quickly run into the maximum allowed amount of sockets. So point-to-point communications is probably not the best idea. And like with NUMA, waiting for data can take a long time.

Then again, blocked tasks aren't a problem with unified memory. They don't prevent other tasks from running. They can even be swapped out to disk. On the other hand, switching a task from 1 core to another one does incur a cache miss penalty. Flush and reload, which can take quite a bit of time. For a multi-CPU system that penalty is bigger, as you also have to take care of the level 2 or 3 cache. And it's worst in a cluster, from one system to another, because you also have to move the data around.

But an OS like Linux can already do all that, through forking and paging. And you could use streams or NFS to move those pages across sockets. It is much faster to map the same page of memory in the address space of multiple processes, than it is to copy the data. And, do you want to know? You shouldn't count on it taking a specific time to access that data anyway.

The good: the OS takes care of it. You can move the whole process/task around. The bad: the size of data allocation or IPC messages is in 4k blocks.

The logical conclusion would be to give each function (our tasks) at least four pages of memory: one for code, one for data, one for the stack and one for the function result. More when needed. If the function is complete, map the result page to the process that created the task.

And that would go very well with the locality of the anonymous functions.

 

TinyPortal © 2005-2018