Recent

Author Topic: Parallel programming model  (Read 4598 times)

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: Parallel programming model
« Reply #15 on: July 05, 2022, 10:14:36 am »
Long ago, at the brink of the Information Age, a computer only had a single CPU that only could run a single program at a time. If multiple people had to use it, you used time sharing: make an appointment for a time slot.

A bit later came cooperative multitasking. Which meant: we increase the frequency of the time slices a fair bit, to about 5 a second at first, but thousands later. So you can load and run multiple programs at the same time, as long as they're all considerate of one another. Give each other the room they need to run. Which was actually quite hard, as access to many resources wasn't shared. Hey, we have no printer drivers or print spooler, so be gentle when accessing it, ok?

To make that actually work, the Virtual Machine was born. Like, if you had a 80386 CPU, you could go big and wide, by giving all resources to a single process, or you could go many and small, by dividing the available resources into virtual 8086 machines. Chop up the ram into 1 MB chunks and use all the extended functionality for the OS. As long as the OS took care of all the devices, it worked very well. That requires that the OS runs in protected/supervisor mode and that it is automatically triggered if a client tries to access something it shouldn't.

Now is chopping up the memory into pieces that are all the maximum size a child process can access an easy thing to do, but it's not very efficient. It is much better if the client thinks it has all the RAM it can use, but actually only has the RAM it does use. And most often that is done by paging: if a client process tries to access memory somewhere, quickly insert a block of 4kB of RAM on that location.

While all of that works quite well, it is a bit slowish. Simply because you have to switch between user mode and protected mode all the time, to access resources. And because you have to do all that paging. And depending on the OS, communicating with other processes can be a hassle.

To fix that, threads were born: while you do get a new stack for each sub-process you create and the registers are saved when it gets activated, you share all the other resources with the main program. The good: it is faster and you only have to set up everything once. The bad: no protection! You have to supply that yourself. So, it's exactly like cooperative multitasking.

A thread is basically just a location the CPU starts executing commands from at intervals. A single "main" function. If you exit that, processing stops. So you normally make a loop, that is like this:

Code: Pascal  [Select][+][-]
  1. function Main();
  2. var
  3.   Terminated: Boolean;
  4. begin
  5.   Terminated := False;
  6.   while not Terminated do
  7.   begin
  8.     // the work you want to do goes here
  9.   end;
  10. end;

Well, that's an endless loop if you never change the state of Terminated. But because the address space is shared between your thread and the main process, both can do that. The main process can order the thread to stop by setting it to True. The good and the bad, in one go. Because if both try to access it at the same time, Bad Things happen.

So, the main thing with threads is: make sure it behaves! Don't try to access any memory or resources you didn't allocate yourself within that tread. And prevent exceptions.

How to use them?

There's basically two ways: SMP (symmetric multiprocessing) and tasks. In the first instance, all the threads do the same thing, for the same amount of time. This is easiest to set up, but hardest to get good use out of. And the second one is that you create each thread when needed and give it all resources it needs at the start. When it is done, it reports the result and disappears. More bookkeeping with a chance of erratic threads, but it makes best use of the available resources (like the CPU time).

In the first case (SMP), you divide the RAM and data up front (like, thread 1 does everything starting with "A..C"). Accessing resources meant for your sibling threads is a good way to crash. And in the second case (tasks), you make a copy of all the data the thread is allowed to access up front.

That's about it. If you feel you need a lot of mechanisms to partition things, like critical sections, or to communicate, like semaphores, it's probably better to revise the design and see if you can do it without them.

440bx

  • Hero Member
  • *****
  • Posts: 3944
Re: Parallel programming model
« Reply #16 on: July 05, 2022, 11:48:33 am »
And in the second case (tasks), you make a copy of all the data the thread is allowed to access up front.
Having multiple copies of some data is usually a very bad idea because it requires keeping all those copies in synch.  Database theory figured that out long ago.  A program is better off having a single instance of the data and synchronization mechanisms to ensure every thread sees current and accurate information.

(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: Parallel programming model
« Reply #17 on: July 05, 2022, 12:38:03 pm »
And in the second case (tasks), you make a copy of all the data the thread is allowed to access up front.
Having multiple copies of some data is usually a very bad idea because it requires keeping all those copies in synch.  Database theory figured that out long ago.  A program is better off having a single instance of the data and synchronization mechanisms to ensure every thread sees current and accurate information.
In that case, they'll have to queue to access it. One by one. You just serialized your application.

Tasks are unpopular because it requires thinking differently. See it like asynchronous programming with callbacks.

440bx

  • Hero Member
  • *****
  • Posts: 3944
Re: Parallel programming model
« Reply #18 on: July 05, 2022, 01:15:33 pm »
You just serialized your application.
No.  protecting a piece of data with a synchronization object does not serialize an application.  It only serializes access to that one piece of data and, that should be for an _extremely short_ time.

(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

dbannon

  • Hero Member
  • *****
  • Posts: 2786
    • tomboy-ng, a rewrite of the classic Tomboy
Re: Parallel programming model
« Reply #19 on: July 05, 2022, 01:56:12 pm »
[a lot od sensible stuff]

But that, of course, is where the "Itanic" failed: the difficulty of an efficient toolchain.


Hmm, I tend to think the Itanium had a reasonable tool chain, Intel supplied its OpenMP code, HP purchased Compaq so they could get the DEC compiler technology (and kill of the Alpha Chip). The problem with the Itanium was its power consumption. A single Itanium powered box was one thing but a rack of them, in full flight, drained whole power stations. We operated a smallish cluster for an auto maker back then and had little trouble getting existing codes running on it. Difficult to say how efficiently but it did do reasonable linpack numbers.  But scaling it up, supplying power in and cooling it made it a very poor solution. I was glad to see it go, long before its expected life time, from my machine room.

But its real offense was killing the Alpha ....

Davo

Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

MarkMLl

  • Hero Member
  • *****
  • Posts: 6676
Re: Parallel programming model
« Reply #20 on: July 05, 2022, 02:18:01 pm »
Having multiple copies of some data is usually a very bad idea because it requires keeping all those copies in synch.  Database theory figured that out long ago.  A program is better off having a single instance of the data and synchronization mechanisms to ensure every thread sees current and accurate information.

Generally speaking that's something that operating systems worked out fairly early: when a block of code or data is copied nothing really happens except that the original area is marked read-only, and then when one of the processes/threads wants to change it an actual copy is made in physical memory and both revert to their original permissions. The same happens with each of the cache levels.

That sort of arrangement fails if the OS doesn't have an accurate picture of which areas of memory are close to which CPUs, which is why explicit processor and interrupt affinity settings were added. It's a while since I've played with that sort of thing but the larger 32-bit Suns were a particular example: each comprised multiple identical cards with CPU, memory and a selection of I/O.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

PascalDragon

  • Hero Member
  • *****
  • Posts: 5446
  • Compiler Developer
Re: Parallel programming model
« Reply #21 on: July 05, 2022, 02:23:53 pm »
Having multiple copies of some data is usually a very bad idea because it requires keeping all those copies in synch.  Database theory figured that out long ago.  A program is better off having a single instance of the data and synchronization mechanisms to ensure every thread sees current and accurate information.

Generally speaking that's something that operating systems worked out fairly early: when a block of code or data is copied nothing really happens except that the original area is marked read-only, and then when one of the processes/threads wants to change it an actual copy is made in physical memory and both revert to their original permissions. The same happens with each of the cache levels.

Automatically this applies only to memory where the OS can determine that this would be useful (e.g. for the whole address space for fork on Unix systems or NtCreateProcess on Windows (which can be used to implement fork)). However for your own code you'd need to manually tell the OS to do such Copy-On-Write mapping (assuming it supports this mechanism at all).

440bx

  • Hero Member
  • *****
  • Posts: 3944
Re: Parallel programming model
« Reply #22 on: July 05, 2022, 02:24:30 pm »
Generally speaking that's something that operating systems worked out fairly early:...
I don't think his comment was directed at how an O/S manages code and data (copy on write and that kind of stuff.)  Just run of the mill user threads and, he suggested making copies of the data for different threads to access.  The O/S isn't going to poke its fingers in there and, having multiple copies of a piece of data is likely always a bad idea (if there is an exception, I cannot think of it at this time.)


(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: Parallel programming model
« Reply #23 on: July 05, 2022, 03:37:03 pm »
Generally speaking that's something that operating systems worked out fairly early:...
I don't think his comment was directed at how an O/S manages code and data (copy on write and that kind of stuff.)  Just run of the mill user threads and, he suggested making copies of the data for different threads to access.  The O/S isn't going to poke its fingers in there and, having multiple copies of a piece of data is likely always a bad idea (if there is an exception, I cannot think of it at this time.)
Unified memory is the culprit. As MarkMLI and PascalDragon said, you're best off with local RAM (resources in general) for maximum parallelism. The best example is probably how the Cell CPU did it: move all needed data to local storage with DMA (including the software), run the task and move the result back to main memory. And that's also how most recent micro-controllers do it. All devices have their own micro-micro-controller and local storage, RAM is segmented and stuff moves around through DMA transfers. It might sound horrible for bandwidth and storage requirements, but is is the best way to maximize throughput.

(That's why it is strange that Gabe Newell was very vehement that programming like that was a waste of everyone's time ;) )
« Last Edit: July 05, 2022, 03:39:03 pm by SymbolicFrank »

MarkMLl

  • Hero Member
  • *****
  • Posts: 6676
Re: Parallel programming model
« Reply #24 on: July 05, 2022, 10:24:12 pm »
Unified memory is the culprit. As MarkMLI and PascalDragon said, you're best off with local RAM (resources in general) for maximum parallelism. The best example is probably how the Cell CPU did it: move all needed data to local storage with DMA (including the software), run the task and move the result back to main memory. And that's also how most recent micro-controllers do it. All devices have their own micro-micro-controller and local storage, RAM is segmented and stuff moves around through DMA transfers. It might sound horrible for bandwidth and storage requirements, but is is the best way to maximize throughput.

(That's why it is strange that Gabe Newell was very vehement that programming like that was a waste of everyone's time ;) )

I don't recognise that paradigm in the context of what are generally understood to be microcontrollers. Do you have an example?

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

dbannon

  • Hero Member
  • *****
  • Posts: 2786
    • tomboy-ng, a rewrite of the classic Tomboy
Re: Parallel programming model
« Reply #25 on: July 06, 2022, 01:39:06 am »
Generally, in a large MPI application (and we are talking about Parallel here folks) one process or even one node is appointed the data server, individual "compute nodes" get and maybe put back their data there. Each compute notde pulls down only the data it needs and works on a local copy. The interconnect is fast but nowhere near as fast as local access.

Alternatively, in some apps, the compute processes  get given their data from the master process at startup.

But in either case, nodes using a lot of data would have their own copy of the data ( there are exceptions of course). Because a node will have multiple processes and good MPI apps make no assumptions about neighboring processes, it does mean sometimes duplicate copies exist on one node. Its possible to optimize that but doing so means your app must tell the scheduler it wants particular patterns of processors on nodes. And that often means you job will wait, queued up, for a lot longer before it starts.

Davo

Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: Parallel programming model
« Reply #26 on: July 06, 2022, 10:13:49 am »
I don't recognise that paradigm in the context of what are generally understood to be microcontrollers. Do you have an example?

MarkMLl

The big picture.

In detail.

MarkMLl

  • Hero Member
  • *****
  • Posts: 6676
Re: Parallel programming model
« Reply #27 on: July 06, 2022, 10:37:46 am »
I don't recognise that paradigm in the context of what are generally understood to be microcontrollers. Do you have an example?

MarkMLl

The big picture.

In detail.

I feel that you are reading much more into that material than is justified. All it's basically saying is that the STM32 (an ARM derivative) has DMA, it can be used for transfer to or from peripheral devices, and that a peripheral (including a DMA controller) accessing main memory might have additional implications if the hardware has one or more layers of cache.

It's obviously noteworthy that low-end devices like that have cache and DMA, if for no reason other than it introduces uncertainties in opcode timing. But I don't agree that the links you've given can be interpreted as describing

... move all needed data to local storage with DMA (including the software), run the task and move the result back to main memory. And that's also how most recent micro-controllers do it. All devices have their own micro-micro-controller and local storage, RAM is segmented and stuff moves around through DMA transfers.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: Parallel programming model
« Reply #28 on: July 06, 2022, 11:19:24 am »
So, how would you incorporate an UART in your micro-controller application? Using interrupts and reading/writing the single byte in the UART register?

MarkMLl

  • Hero Member
  • *****
  • Posts: 6676
Re: Parallel programming model
« Reply #29 on: July 06, 2022, 12:01:08 pm »
So, how would you incorporate an UART in your micro-controller application? Using interrupts and reading/writing the single byte in the UART register?

Depends, among other things on the buffering capability of the peripheral concerned but also on clarity and maintainability.

But DMAing to a UART is a long way short of the level of sophistication that you implied. And it's particularly removed from the sort of thing OP is asking about, which can be loosely interpreted as relating to programming models where background activity can both read and write potentially-shared areas of memory.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

 

TinyPortal © 2005-2018