Lazarus

Free Pascal => Beginners => Topic started by: pascal111 on July 03, 2022, 10:04:15 pm

Title: Parallel programming model
Post by: pascal111 on July 03, 2022, 10:04:15 pm

Does free Pascal support "Parallel programming model (paradigm)", and how it looks like?

I was reading this link and I think I understood 10% of it:

https://en.wikipedia.org/wiki/Parallel_programming_model (https://en.wikipedia.org/wiki/Parallel_programming_model)

Title: Re: Parallel programming model
Post by: MarkMLl on July 03, 2022, 10:22:44 pm

It supports threads, hardware such as MMX/SSE, but not (automatic) parallelisation of arbitrary code.

MarkMLl

Title: Re: Parallel programming model
Post by: pascal111 on July 03, 2022, 10:28:14 pm

Quote from: MarkMLl on July 03, 2022, 10:22:44 pm

It supports threads, hardware such as MMX/SSE, but not (automatic) parallelisation of arbitrary code.

MarkMLl

Humanly speaking, I can say this is beyond my current knowledge and experience, I won't ask about it again except if once I'll study threads.

thanks @MarkMLl

Title: Re: Parallel programming model
Post by: MarkMLl on July 03, 2022, 10:44:40 pm

Quote from: pascal111 on July 03, 2022, 10:28:14 pm

Humanly speaking, I can say this is beyond my current knowledge and experience, I won't ask about it again except if once I'll study threads.

I'll try to break that down a bit, but the risk here is that if I simplify too much one of the "usual suspects" will jump in and muddy the water :-)

A thread is effectively a procedure that runs in the background. An example would be something that read input from a networking socket, serial port or physical device and placed it on a queue.

MMX/SSE etc. are "vector extensions" to the x86 architecture, other architectures often have something comparable. Using this you can, as an example, perform a sequence of mathematical operations on a 4-element array of numbers in parallel.

Automatic parallelisation is when you e.g. have a for i := 0 to 15 do loop, and the compiler works out how to distribute that over multiple ALUs ** . Cray had a compiler that did that, written in Pascal, but reputedly when SGI bought Cray they had a beach-party bonfire of most of their erstwhile competitor's engineering material.

MarkMLl

** I'm being careful with my terminology here. There's lots of ways parallelisation can work, including distributing calculations over a vector processing unit (MMX etc.) and- in traditional terms- distributing it over multiple CPUs. However these days PC-derived terminology tends to equate "CPU" with "chip", and to call a processing unit a core within a CPU.

Title: Re: Parallel programming model
Post by: pascal111 on July 03, 2022, 10:58:44 pm

Thank you @MarkMLl and forgive my aggressive responses sometimes, you know, you have something like what I can call heavy or dark fingerprint or impression, it makes the other side like me thinks you're evil person.

I know this topic is so rich and has some complex parts, but I get an idea about it from your explaining.

Title: Re: Parallel programming model
Post by: MarkMLl on July 03, 2022, 11:17:11 pm

Oh, I'm thoroughly evil: it's the Slav in me :-)

Quote

The worst kind of soul is the great Slav soul. People who suffer from it are usually very deep thinkers. They may say things like this: "Sometimes I am so merry and sometimes I am so sad. Can you explain why?" (You cannot, do not try.) Or they may say: "I am so mysterious ... I sometimes wish I were somewhere else than where I am." (Do not say: "I wish you were.")

http://f2.org/humour/howalien.html

You started it :-)

MarkMLl

Oh, and the Ll stands for Llwyd. Your problem :-)

Title: Re: Parallel programming model
Post by: pascal111 on July 03, 2022, 11:37:54 pm

Quote

Oh, I'm thoroughly evil: it's the Slav in me :-)

http://f2.org/humour/howalien.html (http://f2.org/humour/howalien.html)

It seems so beautiful piece of English literature.

Llwyd:"(ˈhluːɪd, English lɔid) (in Welsh legend) a magician who avenged his friend Gwawl upon Pryderi, the son of Pwyll, by casting various spells upon Pryderi and his estate."

Yes, you have an evil name :)

Voltaire talked about some of Pascal thoughts in one of his books and called him according to the Arabic translation I read "The most wonderful enemy of humanity". That because a secret that Pascal isn't human or what Pascal will be in last days. Pascal thought it's not good idea to make the beast free, so he put so hard difficult restrictions against him that the so freedom for the beast isn't good thing maybe because he has no inner conscience like humans so maybe he will do bad things because nothing internally directs him to do right things like other humans. Restrictions cause so horrible life and can cause death.

Title: Re: Parallel programming model
Post by: MarkMLl on July 03, 2022, 11:46:48 pm

Quote from: pascal111 on July 03, 2022, 11:37:54 pm

It seems so beautiful piece of English literature.

Albeit written by a Magyar.

Quote

Llwyd:"(ˈhluːɪd, English lɔid) (in Welsh legend) a magician who avenged his friend Gwawl upon Pryderi, the son of Pwyll, by casting various spells upon Pryderi and his estate."

Congratulations on your research :-)

Although /Morgan/ Llwyd was a renowned Christian preacher.

MarkMLl

Title: Re: Parallel programming model
Post by: alpine on July 04, 2022, 12:09:40 am

Quote from: MarkMLl on July 03, 2022, 11:17:11 pm

Oh, I'm thoroughly evil: it's the Slav in me :-)

Quote
The worst kind of soul is the great Slav soul. People who suffer from it are usually very deep thinkers. They may say things like this: "Sometimes I am so merry and sometimes I am so sad. Can you explain why?" (You cannot, do not try.) Or they may say: "I am so mysterious ... I sometimes wish I were somewhere else than where I am." (Do not say: "I wish you were.")

http://f2.org/humour/howalien.html

You started it :-)

MarkMLl

Oh, and the Ll stands for Llwyd. Your problem :-)

Is there something allegorical in the "great Slav soul" or it somehow connected to your personal ancestry?
Because I know Lloyd is a Welsh surname meaning gray? Welsh and Slav? It should have some hidden context :-\

Title: Re: Parallel programming model
Post by: pascal111 on July 04, 2022, 12:17:26 am

Quote

Although /Morgan/ Llwyd was a renowned Christian preacher.

By mentioning "Christian preacher" Christianity is the preferred religion to Pascal and it what he chose for beast, while Voltaire - as I thought - tried to free the beast from that choice and suggested many other religions, and we don't know what Pascal was thinking about when he chose Christianity. He wants something we don't know. Some sources said that Christianity is for removing the terrorist mind nature that can be cause because some other religions.

Reading history is so useful. I think you are so accurate reader and thinker like that piece said "...very deep thinkers.", you have so detailed information in any topic of computer you are been asked for, I noticed that.

Title: Re: Parallel programming model
Post by: dbannon on July 04, 2022, 04:19:32 am

I might (very boringly) return to parallel processing. Having spent a good part of my working life telling various researchers the 'sort' of parallel they have put all their effort into was a waste of time, its fun to re-live it.

There are two broad 'sorts' of parallel (although some dispute even the word) -

* Message Passing/b] - using an approach like MPI or MPICH (yes, works with FPC I believe)

* MultiProcessing, chmem - really just multithreaded but todays OSs know how to move a thread process onto another core of the same 'box', either the same CPU or another CPU in the same memory space.

MPI programming is the basis of almost all the Supercomputers around these days, even Cray and SGI sell vastly more such systems. They consist of lots of individual 'boxes', each with there own memory space and a fast interconnect. Generally harder to programme but some jobs can and do scale to thousands of processes.

MP on the other hand is always limited to the one box and while people like SGI mastered the art of cramming lots of CPUs in one box, it was always a more complicated, expensive and fragile solution.

Intel championed OpenMP with some nice 'self paralleling compilers' but for many jobs, the parallel efficiency was generally limited to 4 to 16 cpus. chmem is a library that shares memory between boxes but for many applications it still has quite a lot of overhead.

Oh, and then people started running parallel code on Video Cards.....

Davo

Title: Re: Parallel programming model
Post by: Zvoni on July 04, 2022, 10:01:52 am

In my experience, to understand something like Threads/Parallel computing in this case, i always come from "Real life".
Like this morning:
I entered my car, and started it.
After getting my morning coffee at the local gas-station i continued driving on towards my workplace.
Once driving again (one Process/Thread), i turned on the music in my car (launching a second Thread/Process).
Some minutes later while waiting at a red light, i lit a cigarette (Yes, i smoke in the car) (launching a third Thread/Process)

Bottom line: As described above, there are three distinct "Processes" happening at the same time

Or let's take driving only.
When you drive a car, and you want to switch lanes, you hit your direction indicator, because your Thread "driving" received a message "You have to get to the other lane, because you want to use the next exit".
This is happening WHILE you're still driving. It's not like you stop driving, hit the indicator, and then start driving again.

The human brain itself is a marvel of "parallel computing", though many women will argue that males are not capable of "multi-Tasking" :-)

EDIT: Or to use another analogy:
1) Breaking up a big problem into smaller parts
Imagine a big Open-Air-Show moving from one venue to the next. What do they do?
They load up the whole Production (="One Big Problem" --> Stage, Lighting, Speakers, Instruments) into separate, multiple Trucks, wherein each Truck has its own "Load" ("smaller Problems")
They drive in "parallel" (think "Convoy") from Point A to Point B. The "Big Problem" is "solved" when each Truck arrives at their destination (Note: Not neccessarily at the same time)
2) Parallel Processes passing messages
Take a Football-Team. You have 11 Threads/Processes running at the same time.
The left defender "passes" the ball to the right Midfielder, and sending this Midfielder the Message "Pass the ball back to me because i will be running forwards along the sideline" --> synchronous! It doesn't make sense for the right Midfielder to pass the Ball back, if the left Defender has'nt arrived at his intended Destination.
Now the left defender has gotten the ball back, and shoots a pass blindly into the penalty box in front of the goal --> asynchronous! The left defender passed the Message: "I'm done. Do with it whatever you want"

Title: Re: Parallel programming model
Post by: MarkMLl on July 04, 2022, 10:06:41 am

Quote from: dbannon on July 04, 2022, 04:19:32 am

* Message Passing/b] - using an approach like MPI or MPICH (yes, works with FPC I believe)

That is of course the dominant technology these days: while supercomputers in the 80s and 90s had some combination of vectorisation and multiple processors, the emphasis has switched to using a large number of commodity CPUs/GPUs and distributing the workload using one of the message-passing systems.

However this is entirely library-based, and I don't know what if any language implementations automate it.

Apropos threads, with the exception of some very early mainframes I'm unaware of any OS which was unable to migrate work between processors on closely-coupled systems i.e. within the context of a single system image.

Loosely-couple systems (where each node has its own OS) are a far more tricky matter, since the POSIX programming model allows threads to have state which can clash if they're moved. In part, that accounts for the rise of virtualisation and containerisation: in many cases a running VM can be fairly easily suspended and migrated.

I think I can fairly throw https://en.wikipedia.org/wiki/Vector_Pascal into the ring as one of very few free compilers which make at least some attempt to vectorise/parallelise. However it's some while since I last looked at it, and I can't remember whether it still builds on modern systems.

MarkMLl

Title: Re: Parallel programming model
Post by: dbannon on July 05, 2022, 02:52:38 am

I did not see any Pascal applications running on my machines although it may have happened. There is definitely an interface to MPI - https://wiki.lazarus.freepascal.org/MPICH - does not appear to have had any love for some time. Fortran and C/C++ dominated.

In terms of migrating processes on one box, yes, relatively easy but not always as efficient as you would expect. Speaking only of Linux here, twenty years ago, on (Ultrix, Compaq Unix) Unix systems, we were encouraging end users to use MPI even when they could request single image. Good MPI libraries were smart enough to skip the interconnect if they could and 8 cpus in a box was a sweetspot for cost effectiveness. Memory bandwidth was always an issue, even with chips like the Alpha.

> virtualisation and containerisation ....

Is, sort of, the opposite to parallel ?

> https://en.wikipedia.org/wiki/Vector_Pascal

While Intel do now have some vector support but nothing like that offered by, eg, Cray. I wonder if the speed up delivered would be less than the penalty for not using an optimized compiler like FPC ? Cray could do a complete step through a matrix with one op. But such machines are incredibly expensive, all purchased by US homeland defense and military. Apparently very efficient at listening in on trans Atlantic communications ....

I see absolutely no reason why FPC would not be just as useful in an MPI environment as C++. It would be far more reliable code (important on long running, compute intensive jobs) and, hitting the same MPI libraries, would be just as efficient through the interconnect.

I have seen, for example both python and java running MPI. But we would groan when we saw those users logging on ....

Davo

Title: Re: Parallel programming model
Post by: MarkMLl on July 05, 2022, 09:13:01 am

Quote from: dbannon on July 05, 2022, 02:52:38 am

> virtualisation and containerisation ....

Is, sort of, the opposite to parallel ?

Yes, but I mentioned those since they're the dominant technology for moving work between systems larger than a single system image (noting that that term gets very indistinct considering e.g. large IBM mainframes, which have had distributed operation for decades).

Quote

> https://en.wikipedia.org/wiki/Vector_Pascal

With the caveat- which I should possibly have made clear earlier- that there is a strong APL (or at least J) influence there. Nothing wrong with that: Wirth was involved with APL long before he designed Pascal.

In actual fact, I feel that an understanding of APL is no bad thing. Apart from anything else, it highlights two curious facts:

* A sizeable fraction of the CS fraternity is prepared to ignore memory requirement in the search for an improved O() applied to speed.

* A sizeable fraction of the CS and mathematical fraternities is prepared to throw transcendentals into an algorithm if doing so allows all elements of a vector to be handled in parallel.

Quote

While Intel do now have some vector support but nothing like that offered by, eg, Cray. I wonder if the speed up delivered would be less than the penalty for not using an optimized compiler like FPC ? Cray could do a complete step through a matrix with one op. But such machines are incredibly expensive, all purchased by US homeland defense and military. Apparently very efficient at listening in on trans Atlantic communications ....

I see absolutely no reason why FPC would not be just as useful in an MPI environment as C++. It would be far more reliable code (important on long running, compute intensive jobs) and, hitting the same MPI libraries, would be just as efficient through the interconnect.

I have seen, for example both python and java running MPI. But we would groan when we saw those users logging on ....

Noting obviously Chris Fenton's summary of the Cray https://www.chrisfenton.com/homebrew-cray-1a/

I agree that FPC support for this could benefit from some TLC. Unfortunately most of the workload distribution research over the last 20 years has focussed on better ways to download cat videos.

Apropos Python, Java etc.: however much we may loathe the syntax or architecture, it's possible to JIT-compile them to something at least half-way efficient. But my comment apropos APL applies: the resources required to do so might be so obscenely large that it would have been better to have an efficient toolchain in the first place.

But that, of course, is where the "Itanic" failed: the difficulty of an efficient toolchain.

MarkMLl

Title: Re: Parallel programming model
Post by: SymbolicFrank on July 05, 2022, 10:14:36 am

Long ago, at the brink of the Information Age, a computer only had a single CPU that only could run a single program at a time. If multiple people had to use it, you used time sharing: make an appointment for a time slot.

A bit later came cooperative multitasking. Which meant: we increase the frequency of the time slices a fair bit, to about 5 a second at first, but thousands later. So you can load and run multiple programs at the same time, as long as they're all considerate of one another. Give each other the room they need to run. Which was actually quite hard, as access to many resources wasn't shared. Hey, we have no printer drivers or print spooler, so be gentle when accessing it, ok?

To make that actually work, the Virtual Machine was born. Like, if you had a 80386 CPU, you could go big and wide, by giving all resources to a single process, or you could go many and small, by dividing the available resources into virtual 8086 machines. Chop up the ram into 1 MB chunks and use all the extended functionality for the OS. As long as the OS took care of all the devices, it worked very well. That requires that the OS runs in protected/supervisor mode and that it is automatically triggered if a client tries to access something it shouldn't.

Now is chopping up the memory into pieces that are all the maximum size a child process can access an easy thing to do, but it's not very efficient. It is much better if the client thinks it has all the RAM it can use, but actually only has the RAM it does use. And most often that is done by paging: if a client process tries to access memory somewhere, quickly insert a block of 4kB of RAM on that location.

While all of that works quite well, it is a bit slowish. Simply because you have to switch between user mode and protected mode all the time, to access resources. And because you have to do all that paging. And depending on the OS, communicating with other processes can be a hassle.

To fix that, threads were born: while you do get a new stack for each sub-process you create and the registers are saved when it gets activated, you share all the other resources with the main program. The good: it is faster and you only have to set up everything once. The bad: no protection! You have to supply that yourself. So, it's exactly like cooperative multitasking.

A thread is basically just a location the CPU starts executing commands from at intervals. A single "main" function. If you exit that, processing stops. So you normally make a loop, that is like this:

Code: Pascal [Select][+]

function Main();
var
  Terminated: Boolean;
begin
  Terminated := False;
  while not Terminated do
  begin
    // the work you want to do goes here
  end;
end;

Well, that's an endless loop if you never change the state of Terminated. But because the address space is shared between your thread and the main process, both can do that. The main process can order the thread to stop by setting it to True. The good and the bad, in one go. Because if both try to access it at the same time, Bad Things happen.

So, the main thing with threads is: make sure it behaves! Don't try to access any memory or resources you didn't allocate yourself within that tread. And prevent exceptions.

How to use them?

There's basically two ways: SMP (symmetric multiprocessing) and tasks. In the first instance, all the threads do the same thing, for the same amount of time. This is easiest to set up, but hardest to get good use out of. And the second one is that you create each thread when needed and give it all resources it needs at the start. When it is done, it reports the result and disappears. More bookkeeping with a chance of erratic threads, but it makes best use of the available resources (like the CPU time).

In the first case (SMP), you divide the RAM and data up front (like, thread 1 does everything starting with "A..C"). Accessing resources meant for your sibling threads is a good way to crash. And in the second case (tasks), you make a copy of all the data the thread is allowed to access up front.

That's about it. If you feel you need a lot of mechanisms to partition things, like critical sections, or to communicate, like semaphores, it's probably better to revise the design and see if you can do it without them.

Title: Re: Parallel programming model
Post by: 440bx on July 05, 2022, 11:48:33 am

Quote from: SymbolicFrank on July 05, 2022, 10:14:36 am

And in the second case (tasks), you make a copy of all the data the thread is allowed to access up front.

Having multiple copies of some data is usually a very bad idea because it requires keeping all those copies in synch. Database theory figured that out long ago. A program is better off having a single instance of the data and synchronization mechanisms to ensure every thread sees current and accurate information.

Title: Re: Parallel programming model
Post by: SymbolicFrank on July 05, 2022, 12:38:03 pm

Quote from: 440bx on July 05, 2022, 11:48:33 am

Quote from: SymbolicFrank on July 05, 2022, 10:14:36 am
And in the second case (tasks), you make a copy of all the data the thread is allowed to access up front.
Having multiple copies of some data is usually a very bad idea because it requires keeping all those copies in synch. Database theory figured that out long ago. A program is better off having a single instance of the data and synchronization mechanisms to ensure every thread sees current and accurate information.

In that case, they'll have to queue to access it. One by one. You just serialized your application.

Tasks are unpopular because it requires thinking differently. See it like asynchronous programming with callbacks.

Title: Re: Parallel programming model
Post by: 440bx on July 05, 2022, 01:15:33 pm

Quote from: SymbolicFrank on July 05, 2022, 12:38:03 pm

You just serialized your application.

No. protecting a piece of data with a synchronization object does not serialize an application. It only serializes access to that one piece of data and, that should be for an _extremely short_ time.

Title: Re: Parallel programming model
Post by: dbannon on July 05, 2022, 01:56:12 pm

Quote from: MarkMLl on July 05, 2022, 09:13:01 am

[a lot od sensible stuff]

But that, of course, is where the "Itanic" failed: the difficulty of an efficient toolchain.

Hmm, I tend to think the Itanium had a reasonable tool chain, Intel supplied its OpenMP code, HP purchased Compaq so they could get the DEC compiler technology (and kill of the Alpha Chip). The problem with the Itanium was its power consumption. A single Itanium powered box was one thing but a rack of them, in full flight, drained whole power stations. We operated a smallish cluster for an auto maker back then and had little trouble getting existing codes running on it. Difficult to say how efficiently but it did do reasonable linpack numbers. But scaling it up, supplying power in and cooling it made it a very poor solution. I was glad to see it go, long before its expected life time, from my machine room.

But its real offense was killing the Alpha ....

Davo

Title: Re: Parallel programming model
Post by: MarkMLl on July 05, 2022, 02:18:01 pm

Quote from: 440bx on July 05, 2022, 11:48:33 am

Having multiple copies of some data is usually a very bad idea because it requires keeping all those copies in synch. Database theory figured that out long ago. A program is better off having a single instance of the data and synchronization mechanisms to ensure every thread sees current and accurate information.

Generally speaking that's something that operating systems worked out fairly early: when a block of code or data is copied nothing really happens except that the original area is marked read-only, and then when one of the processes/threads wants to change it an actual copy is made in physical memory and both revert to their original permissions. The same happens with each of the cache levels.

That sort of arrangement fails if the OS doesn't have an accurate picture of which areas of memory are close to which CPUs, which is why explicit processor and interrupt affinity settings were added. It's a while since I've played with that sort of thing but the larger 32-bit Suns were a particular example: each comprised multiple identical cards with CPU, memory and a selection of I/O.

MarkMLl

Title: Re: Parallel programming model
Post by: PascalDragon on July 05, 2022, 02:23:53 pm

Quote from: MarkMLl on July 05, 2022, 02:18:01 pm

Quote from: 440bx on July 05, 2022, 11:48:33 am
Having multiple copies of some data is usually a very bad idea because it requires keeping all those copies in synch. Database theory figured that out long ago. A program is better off having a single instance of the data and synchronization mechanisms to ensure every thread sees current and accurate information.

Generally speaking that's something that operating systems worked out fairly early: when a block of code or data is copied nothing really happens except that the original area is marked read-only, and then when one of the processes/threads wants to change it an actual copy is made in physical memory and both revert to their original permissions. The same happens with each of the cache levels.

Automatically this applies only to memory where the OS can determine that this would be useful (e.g. for the whole address space for fork on Unix systems or NtCreateProcess on Windows (which can be used to implement fork)). However for your own code you'd need to manually tell the OS to do such Copy-On-Write mapping (assuming it supports this mechanism at all).

Title: Re: Parallel programming model
Post by: 440bx on July 05, 2022, 02:24:30 pm

Quote from: MarkMLl on July 05, 2022, 02:18:01 pm

Generally speaking that's something that operating systems worked out fairly early:...

I don't think his comment was directed at how an O/S manages code and data (copy on write and that kind of stuff.) Just run of the mill user threads and, he suggested making copies of the data for different threads to access. The O/S isn't going to poke its fingers in there and, having multiple copies of a piece of data is likely always a bad idea (if there is an exception, I cannot think of it at this time.)

Title: Re: Parallel programming model
Post by: SymbolicFrank on July 05, 2022, 03:37:03 pm

Quote from: 440bx on July 05, 2022, 02:24:30 pm

Quote from: MarkMLl on July 05, 2022, 02:18:01 pm
Generally speaking that's something that operating systems worked out fairly early:...
I don't think his comment was directed at how an O/S manages code and data (copy on write and that kind of stuff.) Just run of the mill user threads and, he suggested making copies of the data for different threads to access. The O/S isn't going to poke its fingers in there and, having multiple copies of a piece of data is likely always a bad idea (if there is an exception, I cannot think of it at this time.)

Unified memory is the culprit. As MarkMLI and PascalDragon said, you're best off with local RAM (resources in general) for maximum parallelism. The best example is probably how the Cell CPU did it: move all needed data to local storage with DMA (including the software), run the task and move the result back to main memory. And that's also how most recent micro-controllers do it. All devices have their own micro-micro-controller and local storage, RAM is segmented and stuff moves around through DMA transfers. It might sound horrible for bandwidth and storage requirements, but is is the best way to maximize throughput.

(That's why it is strange that Gabe Newell was very vehement that programming like that was a waste of everyone's time ;) )

Title: Re: Parallel programming model
Post by: MarkMLl on July 05, 2022, 10:24:12 pm

Quote from: SymbolicFrank on July 05, 2022, 03:37:03 pm

Unified memory is the culprit. As MarkMLI and PascalDragon said, you're best off with local RAM (resources in general) for maximum parallelism. The best example is probably how the Cell CPU did it: move all needed data to local storage with DMA (including the software), run the task and move the result back to main memory. And that's also how most recent micro-controllers do it. All devices have their own micro-micro-controller and local storage, RAM is segmented and stuff moves around through DMA transfers. It might sound horrible for bandwidth and storage requirements, but is is the best way to maximize throughput.

(That's why it is strange that Gabe Newell was very vehement that programming like that was a waste of everyone's time ;) )

I don't recognise that paradigm in the context of what are generally understood to be microcontrollers. Do you have an example?

MarkMLl

Title: Re: Parallel programming model
Post by: dbannon on July 06, 2022, 01:39:06 am

Generally, in a large MPI application (and we are talking about Parallel here folks) one process or even one node is appointed the data server, individual "compute nodes" get and maybe put back their data there. Each compute notde pulls down only the data it needs and works on a local copy. The interconnect is fast but nowhere near as fast as local access.

Alternatively, in some apps, the compute processes get given their data from the master process at startup.

But in either case, nodes using a lot of data would have their own copy of the data ( there are exceptions of course). Because a node will have multiple processes and good MPI apps make no assumptions about neighboring processes, it does mean sometimes duplicate copies exist on one node. Its possible to optimize that but doing so means your app must tell the scheduler it wants particular patterns of processors on nodes. And that often means you job will wait, queued up, for a lot longer before it starts.

Davo

Title: Re: Parallel programming model
Post by: SymbolicFrank on July 06, 2022, 10:13:49 am

Quote from: MarkMLl on July 05, 2022, 10:24:12 pm

I don't recognise that paradigm in the context of what are generally understood to be microcontrollers. Do you have an example?

MarkMLl

The big picture (https://electronics.stackexchange.com/questions/229125/is-the-stm32-dma-destination-address-restricted).

In detail (https://deepbluembedded.com/stm32-dma-tutorial-using-direct-memory-access-dma-in-stm32/).

Title: Re: Parallel programming model
Post by: MarkMLl on July 06, 2022, 10:37:46 am

Quote from: SymbolicFrank on July 06, 2022, 10:13:49 am

Quote from: MarkMLl on July 05, 2022, 10:24:12 pm
I don't recognise that paradigm in the context of what are generally understood to be microcontrollers. Do you have an example?

MarkMLl

The big picture (https://electronics.stackexchange.com/questions/229125/is-the-stm32-dma-destination-address-restricted).

In detail (https://deepbluembedded.com/stm32-dma-tutorial-using-direct-memory-access-dma-in-stm32/).

I feel that you are reading much more into that material than is justified. All it's basically saying is that the STM32 (an ARM derivative) has DMA, it can be used for transfer to or from peripheral devices, and that a peripheral (including a DMA controller) accessing main memory might have additional implications if the hardware has one or more layers of cache.

It's obviously noteworthy that low-end devices like that have cache and DMA, if for no reason other than it introduces uncertainties in opcode timing. But I don't agree that the links you've given can be interpreted as describing

Quote from: SymbolicFrank on July 05, 2022, 03:37:03 pm

... move all needed data to local storage with DMA (including the software), run the task and move the result back to main memory. And that's also how most recent micro-controllers do it. All devices have their own micro-micro-controller and local storage, RAM is segmented and stuff moves around through DMA transfers.

MarkMLl

Title: Re: Parallel programming model
Post by: SymbolicFrank on July 06, 2022, 11:19:24 am

So, how would you incorporate an UART in your micro-controller application? Using interrupts and reading/writing the single byte in the UART register?

Title: Re: Parallel programming model
Post by: MarkMLl on July 06, 2022, 12:01:08 pm

Quote from: SymbolicFrank on July 06, 2022, 11:19:24 am

So, how would you incorporate an UART in your micro-controller application? Using interrupts and reading/writing the single byte in the UART register?

Depends, among other things on the buffering capability of the peripheral concerned but also on clarity and maintainability.

But DMAing to a UART is a long way short of the level of sophistication that you implied. And it's particularly removed from the sort of thing OP is asking about, which can be loosely interpreted as relating to programming models where background activity can both read and write potentially-shared areas of memory.

MarkMLl

Title: Re: Parallel programming model
Post by: Warfley on July 06, 2022, 04:34:13 pm

Pascal does not really have great support for parallel programming. It has threads and synchronization mechanisms (like critcial sections and events) but thats the bare minimum. Thats a bit like saying "here you have some iron, copper, rubber, and other raw materials, go and build a car". Possible but tedious.

There are languages out there that incorporate parallel programming into their language design. One such example would be Erlang, which was specifically designed by Erricson for high performance networking code, and today around 90% of all internet traffic is handled by erlang code somehwere along the way. It is also commonly used in server software that requires a high uptime and managing many users like Amazons SimpleDB powering ther AWS EC2 systems or the Facebook and WhatApp Chat servers.

And I think it is always a great case study to look at what makes erlang so successful for parallel programming, and how this is lacking in Pascal.

So beginning with the first reason, Erlang is VM based, which completely abstracts it from the underlying OS and hardware. Pascal threads are OS threads, while Erlang threads are VM threads. The VM controlls the utilization of the hardware, and maps the software threads to actual OS level threads. This means you are limited by the OS and hardware. On linux you can only have a certain number of threads, spawning and killing threads is slow, and the scheduler will always take a certain amount of time due to context switches. Erlang VM threads on the other hand are really lightweight, can be spawned and released in no time and the scheduler is blazing fast and there are no context switches involved.
This allows you to spawn millions of Erlang threads on pretty much any hardware and OS really efficiently. People often associate VMs with being slow, this is probably due to the garbage collection in Java or .Net, but this is not necessarily the case. As stated above, erlang was developed for creating router firmware and today erlang is capable of handling up to terrabytes of data each second.
Also this is completely hardware independent, so your Erlang code will work the same on a single cpu microcontroller or on a large computing cluster.

For an example, if you write a server that handles tens-hundreds of thousands of TCP streams, in Erlang, each of those streams would have it's own thread. With standard OS threads like with Pascal, this wouldn't work, and what you usually do is to have a few hundred threads, each serving a few thousand streams (using polling to access those streams non blockingly).
So while this is theoretically possible to manage in Pascal, the effort involved is much greater.
This is also why other languages like C# introduced their Task system with the thread manager that does exactly the same, chopping of your code into tasks, which can then be software scheduled transparently by the VM to the real os threads.

The second reason is probably the most obvious, Erlang is a functional language, which means it is side effect free. This means each piece of Erlang code is completely independent from each other, all data structures are immutable and there is no such thing as shared memory (in fact in functional programming there isn't even such thing as a concept of memory). Therefore you don't need synchronization mechanisms. The only way for data to be communicated between parts of the program is through well defined communication channels on a language level. This allows first for a formal method of ananlysis, as all data flows can be mapped through these well defined channels. This makes also both optimisation as well as distribution much easier. As you only have these channels, it doesn't matter how this data is transmitted, it could be send through a message queue in memory, managed by the VM, it could be through IPC or even through network to connect completely different machines with one logical erlang program.
Also by the fact that the memory is immutable, it also means that you can do lazy copies. If a number of threads use the same data (e.g. some lookup table), but as it is immutable no one can change it, you can simply let all of them access the same data, without having to create a copy for each object.

Immutability is a concept that is also coming more and more to other languages like Java, C#, Python, C++ or Javascript, have concepts of immutable datatypes, exactly for that reason. Immutability makes a lot of things easier, and with language level support for it, the compiler can also highly optimize such structures.

In pascal this is actually kinda archivable, by using forks that communicate through pipes. This separates the memory of each of the processes, but fork will only copy memory laziely, meaning shared pages that are not altered will be shared in memory. This is great for separating tasks and allows for structured communication between those tasks, where the OS takes the job of all the hard threading and synchronization problems. I would highly advise doing this, but the only problem here is, that this is extremely inefficient. Spawning processes is really slow and takes a lot of resources. So while this is a very great solution that allows for creating extremely clean seperation of tasks, it is also not scalable.
So I recommend using this when possible, but scaling it to thousands let alone millions of processes will probably not end well.
Also Pascal does not really have much support for immutable Datatypes, there are pretty much None in the RTL, FCL and RTL, and also on a language level there is no real way to write immutable datatypes (like final in java or const in C++), so there is no way for the compiler to optimise for them.

If you want to do this scalable in a large scale system, you need to build your own scalable message queues and build your tasks/threads in a way that you don't touch shared memory, or if you do, you need to make sure to use synchronization or that it is immutable (which again is not supported on a language level)

The last big reason is the unoffical slogan of Erlang: "let it fail". This referrs to the error management of erlang. Each task is generally supervised by another task so for example with a server, you might have your server task, which spawns a new task for each incomming connection, here the server task is the supervisor of the connection tasks. If you have an error in one of the tasks, the task will simply fail and be stopped, and the supervisor will be notified. The supervisor can than choose to react if it wants to ignore it, handle it, or fail itself. For example in telephony systems, where erlang was also heaviely used, when there was an error during a call, the switch would simply drop the connection and ignore the error. The humans trying to phone would simply try to call again (or maybe even the phone software would do this automatically) so the connection switch does not need to bother trying to fix the error, it will probably go away on a new call. If to many errors occur, it can try to resolve this, or the switch completely fails, the rest of the telephony network reroutes all of the incomming connections and an engineer can look at what went wrong.

Pascal has exceptions, which can be used for something similar. In fact, for my STAX I've basically rebuilt that system (even before I knew it from Erlang), but trust me, this is really tedious. So again, something you can do but a lot more effort as you basically have to build everything yourself.

All of this makes Erlang a really great language for parallel systems. Erricson actually measured a decrease in development time of a factor 10-25 and a reduction in code size of 2-10 compared to their previous development models (mainly with imperative languages like C which have similar properties to pascal with respect to parallel programming).
So to summarize, you can write scalable parallel programs in Pascal, but Pascal only gives you the bare bones and it is not comparable to any language that is designed for this and has this built in on a language level, so you will have much more work to archive similar results.

Title: Re: Parallel programming model
Post by: SymbolicFrank on July 06, 2022, 05:35:55 pm

The two largest challenges to x86 class CPU's is branch prediction and keeping the caches coherent. Both require black magic and AI for the current ones. That's why a model like Cell is such an improvement. But it requires a different way of thinking and programming. And it isn't backwards compatible. That's basically why most of the parallel stuff runs on GPU's.

Another approach is distributing the objects and running them where convenient. Both .Net and JavaScript have that built in. In that way, they are virtual tasks. But communicating the result can be hard.

A CPU that consists of ~1000 simple ARM cores (with wide busses and lots of DMA) would be interesting. Bandwidth is going to be the main problem if you're going many-core. And if size is not a problem, a lot of micro-controllers. Optical connections and switches would rule.

Golang uses the same ideas as Erlang, but in a procedural programming model. Every function is a thread and parameters can be streams. No classes, though, so you have to keep it small.

Then again, as you said, the inter-task/process communication could use a massive overhaul in most current programming languages.

Title: Re: Parallel programming model
Post by: Warfley on July 06, 2022, 05:49:06 pm

Quote from: SymbolicFrank on July 06, 2022, 05:35:55 pm

A CPU that consists of ~1000 simple ARM cores (with wide busses and lots of DMA) would be interesting. Bandwidth is going to be the main problem if you're going many-core. And if size is not a problem, a lot of micro-controllers. Optical connections and switches would rule.

To a certain degree these problems can be targeted by NUMA architectures where you have different memory regions assigned, or near to to different CPUs and connected between a shared memory bridge. Access to the "near" memory is extremly fast, while access to the more "distant" memory regions that need to go through the shared bridge. So the only question is how to map threads to cores such that their local memory access is maximized while shared memory access it minimized.

To the best of my knowledge the only solutions for this right now is manual mapping of data onto NUMA memory regions and building your whole application around it, but this is also quite tedious and you are building your application for specifically the kind of chip it is going to run on

Title: Re: Parallel programming model
Post by: MarkMLl on July 06, 2022, 06:18:25 pm

I'm not convinced of the benefit of a VM in this context, and the industry as a whole seems to be tending towards the LLVM model.

However I like the idea of a supervisor/worker arrangement, with some way to pass the worker's grumbles to the supervisor rather than propagating it up the call chain in a less controlled manner.

MarkMLl

Title: Re: Parallel programming model
Post by: Warfley on July 06, 2022, 06:44:35 pm

Quote from: MarkMLl on July 06, 2022, 06:18:25 pm

I'm not convinced of the benefit of a VM in this context, and the industry as a whole seems to be tending towards the LLVM model.

You do not necessarily need a VM, you can build a similar software threading models also natively (there are some C userspace scheduling libraries out there that are basically ready to use), but a vm makes this much easier and cleaner, which is why erlang went this route

Title: Re: Parallel programming model
Post by: MarkMLl on July 06, 2022, 07:05:26 pm

Quote from: Warfley on July 06, 2022, 06:44:35 pm

You do not necessarily need a VM, you can build a similar software threading models also natively (there are some C userspace scheduling libraries out there that are basically ready to use), but a vm makes this much easier and cleaner, which is why erlang went this route

I'd suggest that, more importantly, at the time of its inception in the mid-80s there were still multiple commercial VMs including (obviously) the P-System but obviously more obscure ones like that used on https://en.wikipedia.org/wiki/Burroughs_B1700 , a terminal range from the same stable, ICL mainframes and so on. However this was supplanted a few years later by JIT compilation to native code which relegated the VM bytecodes to an intermediate representation which was not interpreted directly, followed a few years later by replacement of bytecode with a serialised representation of the AST.

I don't know whether its a coincidence that JIT rose to prominence at roughly the same time that computing became an x86 monoculture, or that it was superceded by LLVM etc. at roughly the same time that ARM became a major challenger (with RISCV etc. minor ones, and IBM mainframes still shuffling on with zombie-like persistence).

MarkMLl

Title: Re: Parallel programming model
Post by: SymbolicFrank on July 06, 2022, 10:38:17 pm

The main difference with ARM is, that it has a lot of conditional opcodes. The good: much less branching. The bad: you have to execute both if you want maximum speed. But because it's just a single opcode and the registers and execution units on CPU's are highly virtualized anyway, that's a net gain. As long as you don't want to run a long pipeline, that is.

For comparison: CMOVE on x86 is one of the major speedups.

For the big picture: a current x86 CPU is a VM that emulates one. And it is filled with VM's that emulate execution units. Because the x86 architecture and opcodes do really poorly if you want to multiplex and multitask. But that's what everyone uses, so that's what you have to run.

On a higher level, a modern PC is just like an ancient one, with a bunch of stuff tacked on. Or a collection of them, like the virtual 8086 modes of an 80386, but with less fencing. Sharing all resources.

We're used to big cores that use time slices for multitasking, with unified memory and other resources. It's like the right-wing idea of liberalism: everyone should be able to do whatever they want. Like, driving on each side of the road, however they feel like.

Your storage units (disks, memory, websites), all contain a long string of ones and zeroes. The contents are just random noise unless you know the protocol used to store them all. It is a huge chain of conventions. And processing is just a set of binary instructions, which are executed without any understanding. And they're part of that data storage as well.

So, the trick is in optimizing understanding: the hard- and software needs to know what this sequence of bits represents and how they can interact with it. And, especially, what the programmer wants to happen.

Together with the pitfalls of unified resources, as discussed above, and transistors being dirt cheap, it makes a lot more sense to partition things as far as they go. And spend half your hardware budget on optimizing random communications between all the tasks, floating around and/or in the process of being executed.

Title: Re: Parallel programming model
Post by: MarkMLl on July 06, 2022, 10:52:42 pm

Quote from: SymbolicFrank on July 06, 2022, 10:38:17 pm

The main difference with ARM is, that it has a lot of conditional opcodes. The good: much less branching. The bad: you have to execute both if you want maximum speed. But because it's just a single opcode and the registers and execution units on CPU's are highly virtualized anyway, that's a net gain. As long as you don't want to run a long pipeline, that is.

I know what you mean, but I suspect a better term is /predicated/... and I'm not sure it's survived in aarch64.

Quote

For the big picture: a current x86 CPU is a VM that emulates one. And it is filled with VM's that emulate execution units. Because the x86 architecture and opcodes do really poorly if you want to multiplex and multitask. But that's what everyone uses, so that's what you have to run.

Again, I think I'm with you. But if the stuff I was reading three or four years ago is still valid it's interesting that recent(ish) Intel chips appear to use microcode triplets internally, i.e. have surprising echoes of IA-64.

MarkMLl

Title: Re: Parallel programming model
Post by: Thaddy on July 08, 2022, 03:04:22 pm

Quote from: Warfley on July 06, 2022, 04:34:13 pm

Pascal does not really have great support for parallel programming.

It has in trunk/main since a month or two or so.
Things like this https://docwiki.embarcadero.com/RADStudio/Sydney/en/Anonymous_Methods_in_Delphi#Variable_Binding now work.
(Alas you will either have to use trunk or wait for the next major release 4.0)

Title: Re: Parallel programming model
Post by: PascalDragon on July 08, 2022, 05:37:29 pm

Quote from: Thaddy on July 08, 2022, 03:04:22 pm

Quote from: Warfley on July 06, 2022, 04:34:13 pm
Pascal does not really have great support for parallel programming.
It has in trunk/main since a month or two or so.
Things like this https://docwiki.embarcadero.com/RADStudio/Sydney/en/Anonymous_Methods_in_Delphi#Variable_Binding now work.
(Alas you will either have to use trunk or wait for the next major release 4.0)

Anonymous functions and function references by themselves have nothing to do with parallel programming. Sure, they can be utilized for it (like the ability to start an anonymous thread or using an anonymous function for TThread.Queue), but in principle all these can be implemented without anonymous functions as well (it's just a bit more involved). Warfley is talking about a deeper language integration (simple example: language based parallel-for).

Title: Re: Parallel programming model
Post by: Warfley on July 08, 2022, 07:18:02 pm

Quote from: PascalDragon on July 08, 2022, 05:37:29 pm

Warfley is talking about a deeper language integration (simple example: language based parallel-for).

Not necessarily, some of the things discussed above could be build in language (as library), while others would either require language extensions or make it much easier.

For example a lightweight/micro threading model could be created using a library, but of course it would feel very tagged on (as the current threading model does through it's rigid way of having to overload the TThread class).
The second point about immutability is twofold, pascal does not provide many tools for creating immutable lazy copy based datastructures on the language level, but aside from that, one could build with a bit of effort also immutable types within the language. For example a linked list B->C->D could be ammended to be A1->B->C->D by one process and A2->B->C->D by another. These two processes could share the tail B->C->D because of it's immutability and only have a different pointer to the first element.
Such datastructures could (even though it is not that easy because of memory management) be built using existing language tools, and could be provided by a library. But there simply are not many datatypes that are immutable and lazy copy by design provided by RTL, FCL or LCL. While the standard libraries for freepascal and lazarus are very vast with respect of different datatypes, they are completely lacking in this area.
These things exist for many other languages, and with good language support it is much easier to build them, for example in a functional language like Haskell, all recursive datatypes use the same pattern as described above for the list, without the programmer having to write any code for this.
And well defined communication channels do also not necessarily require language support, on a very basic level TThread.Queue and QueueAsyncCall and Synchronize do provide such functionality, they just aren't generally for communicating between arbitrary threads and only for executing code in a context.
And the last point with the exception management, this can also be build manually, basically my STAX exception model is quite similar. But again here is the problem with doing it as such a hack, as exceptions are passed around different execution contexts and are raised again and again, the debugger is going to show the same message again and again, with the stack traces being completely useless after the first handover.

So at least to a certain degree of convinience, most of this could be build as libraries, but of course it would still be much more effort both in use as well as in debugging and maintainance, so it will never be as good as native language support.

But, Pascal is a general purpose language and that it is worse for some task than a language that was specifically designed to do that task as easy as possible should not come as a suprise to anybody.
My personal philosophy is always to use the right tools for the right job, and as such I simply would choose the language best suited for the task. If I have to write a highly parallelized system, I would make my life as easy as possible and use a language that is optimized for exactly this sort of thing.

That said, I find the usual threading model with overloading a thread class and doing all of the synchronization using primitive locks like Critical Sections is very archaic. A language based asynchronity model (which does not necessarily require threading) could do wonders. When using threading my code is full of:

Code: Pascal [Select][+]

EnterCriticalSection(CS);
try
   ...
finally
  LeaveCriticalSection(CS);
end;

Something you just don't see in a lot of other languages. Not just with immutable types, but just take Java, the furthest away from poly-paradigm design you can get. There you just define your variable as synchronized, and Java takes care of all the locking, no new fancy paradigms like immutable lazy copy object programming needed.
Writing threaded code is my least favourite thing in all of programming. It is so easy to make mistakes and finding them is extremely hard. And the language does nothing to help you with that, so generally I try to avoid multithreading as much as possible, or use other languages if I really have to write something threaded.

Title: Re: Parallel programming model
Post by: SymbolicFrank on July 11, 2022, 01:34:12 pm

You can write a lightweight threading library with communication queues using only atomic operations, no locking. It's on my to-do list, but not with a high priority (just like a BigInt library). It's a lot of work as long as you can do without.

And, as you said, copy-on-write fits the bill much better than locking for data access anyway. And that includes reference counting, so safe pointers aren't hard that way. But that should be part of the compiler, for the best results.

The main problem with tasks is that it is hard to track their status, especially across systems. And if you use sockets for everything, you quickly run into the maximum allowed amount of sockets. So point-to-point communications is probably not the best idea. And like with NUMA, waiting for data can take a long time.

Then again, blocked tasks aren't a problem with unified memory. They don't prevent other tasks from running. They can even be swapped out to disk. On the other hand, switching a task from 1 core to another one does incur a cache miss penalty. Flush and reload, which can take quite a bit of time. For a multi-CPU system that penalty is bigger, as you also have to take care of the level 2 or 3 cache. And it's worst in a cluster, from one system to another, because you also have to move the data around.

But an OS like Linux can already do all that, through forking and paging. And you could use streams or NFS to move those pages across sockets. It is much faster to map the same page of memory in the address space of multiple processes, than it is to copy the data. And, do you want to know? You shouldn't count on it taking a specific time to access that data anyway.

The good: the OS takes care of it. You can move the whole process/task around. The bad: the size of data allocation or IPC messages is in 4k blocks.

The logical conclusion would be to give each function (our tasks) at least four pages of memory: one for code, one for data, one for the stack and one for the function result. More when needed. If the function is complete, map the result page to the process that created the task.

And that would go very well with the locality of the anonymous functions.