Recent

Author Topic: assembler; + inline; How do I make friends? :-)  (Read 2909 times)

simsee

  • Full Member
  • ***
  • Posts: 183
Re: assembler; + inline; How do I make friends? :-)
« Reply #15 on: September 29, 2022, 11:51:48 pm »
What about this version without any assembler?
Code: Pascal  [Select][+][-]
  1. procedure StrSwap(var S1, S2: widestring); inline;
  2. var
  3.   S3: Pointer;
  4. begin
  5.   S3 := Pointer(S2);
  6.   Pointer(S2) := Pointer(S1);
  7.   Pointer(S1) := S3;
  8. end;
  9.  

Very interesting example. One aspect that is not clear to me: why does casting a string to a pointer return the pointer to the string itself? Thank you.

beria

  • Jr. Member
  • **
  • Posts: 70
Re: assembler; + inline; How do I make friends? :-)
« Reply #16 on: September 30, 2022, 12:04:22 am »

And trust the RTL and some tuned libraries to be fast enough for most of the purposes.
I already know in rtl some completely irrational and close to religion things, on which I have already been burned and had to do the functionality myself....
By the way, the trick with the Paintner is also not a standard approach.
Anyway, I still didn't understand why such an artificial restriction for assembler, and I didn't even find its description in the manual...  I remember practically in my childhood, I used to insert "Inline code", and in general as processor codes, into Turbo-Pascal on the old IBM PC-XT. And it all worked.
If one uses it - he already knows all the risks and a priori knows what he's doing, and if you ban everything dangerous, it's easier to ban all the pointers, as it is in some super high-level scripting languages......

beria

  • Jr. Member
  • **
  • Posts: 70
Re: assembler; + inline; How do I make friends? :-)
« Reply #17 on: September 30, 2022, 12:06:52 am »
What about this version without any assembler?
Code: Pascal  [Select][+][-]
  1. procedure StrSwap(var S1, S2: widestring); inline;
  2. var
  3.   S3: Pointer;
  4. begin
  5.   S3 := Pointer(S2);
  6.   Pointer(S2) := Pointer(S1);
  7.   Pointer(S1) := S3;
  8. end;
  9.  

Very interesting example. One aspect that is not clear to me: why does casting a string to a pointer return the pointer to the string itself? Thank you.
It's very simple. WideString is the most common pointer to some binary structure, which is described in the manual....
It is very similar to Pchar, that is, it is also only a pointer, but it also includes its own length and not just a null marker at the end, like Pchar.
« Last Edit: September 30, 2022, 12:14:15 am by beria »

jamie

  • Hero Member
  • *****
  • Posts: 6077
Re: assembler; + inline; How do I make friends? :-)
« Reply #18 on: September 30, 2022, 02:23:17 am »
Interlockexchangepointer ?
The only true wisdom is knowing you know nothing

abouchez

  • Full Member
  • ***
  • Posts: 110
    • Synopse
Re: assembler; + inline; How do I make friends? :-)
« Reply #19 on: September 30, 2022, 01:53:06 pm »
What about this version without any assembler?
Very interesting example. One aspect that is not clear to me: why does casting a string to a pointer return the pointer to the string itself? Thank you.
If you assign two WideString variables, on Windows, it will use OleStr/BStr assignment, which is to allocate a new buffer using the slow Windows global heap, then copy all UTF-16 codepoints.
So it is a very slow process, and not needed to swap variables. You end up allocating 3 BSTR instances!
(not that on Linux/POSIX, WideString is not using BSTR but is an alias to UnicodeString for FPC, so it uses reference counting, so will be MUCH faster)

If you write pointer() then it will just copy the pointer of it. This is what FPC (and Delphi) does for AnsiString, UnicodeString, WideString, interface, and dynamic arrays, which are all stored as pointers in variables.
Which is as fast as assembly. Perfect to swap two variables.
This is exactly what the initial asm does.

abouchez

  • Full Member
  • ***
  • Posts: 110
    • Synopse
Re: assembler; + inline; How do I make friends? :-)
« Reply #20 on: September 30, 2022, 02:06:02 pm »
I already know in rtl some completely irrational and close to religion things, on which I have already been burned and had to do the functionality myself....
Fair Enough. This is why we recoded most of the RTL within mORMot. ;)
(also for perfect switch between FPC and Delphi - because their RTL is not always compatible)

By the way, the trick with the Paintner is also not a standard approach.
It is a perfectly valid and standard approach, perfectly documented and used in several places in the RTL. It is also common to FPC and Delphi for AnsiString, UnicodeString, WideString, dynamic arrays and interfaces.

Anyway, I still didn't understand why such an artificial restriction for assembler, and I didn't even find its description in the manual...  I remember practically in my childhood, I used to insert "Inline code", and in general as processor codes, into Turbo-Pascal on the old IBM PC-XT. And it all worked.
I see several reasons:
1) On 8086/8087 it made sense because asm was much more needed, e.g. to call the OS, or for better performance, since the TP compiler was fast but not so optimized.
2) Delphi followed an even worse pattern: asm end blocks are not allowed with begin end blocks on Win64 - and asm is not available on ARM/AARCH64.
3) Register allocation is a hard work for the compiler, and using pure pascal code is easier to optimize when inlining than opaque assembly - one obvious example is constant propagation.
4) In practice, an algorithm with some loops or complex opcodes (e.g. AVX2) is better in its own sub-function that inlined. You can verify the AVX2 registers allocation and vzeroupper mandatory opcodes.
5) Making asm working on several OS can be a PITA for instance. You won't be able to make something more complex that your exchange sample. Otherwise, you are likely to be stuck with the ABI differences (calling convention, volatile registers...)
6) I already wrote about the right way to use complex opcodes, from the compiler point of view: it is by using intrinsics, not manual asm. No one is making huge asm code in VC/GCC using inlined asm - they use intrinsics.
7) Just look at the asm I wrote in mORMot, and you will see it is not so easy to write such assembly. And to be honest, I never needed to have inlined asm for real programming. But intrinsics may have helped.
« Last Edit: September 30, 2022, 02:09:23 pm by abouchez »

PascalDragon

  • Hero Member
  • *****
  • Posts: 5444
  • Compiler Developer
Re: assembler; + inline; How do I make friends? :-)
« Reply #21 on: September 30, 2022, 03:47:47 pm »
Assembler routines can not be inlined, because the compiler has no real knowledge about what you're doing in that assembly code and what side effects some instruction might have. This would cause problems when the compiler assumes a certain state. A function call boundary protects here in most cases and thus assembly code either needs to be in a asmend-block (preferably with a list of modified registers) or it needs to be in a separate function which simply will not be inlined.
Understandable and very reasonable.  However, it would be nice if there was a way to tell the compiler "inline it anyway, I take full responsibility".

We won't add something like that, because users will abuse it and then complain if something goes wrong.

Anyway, I still didn't understand why such an artificial restriction for assembler, and I didn't even find its description in the manual...

The inline directive is merely a hint for the compiler. It is in no way required to inline a function and if for whatever reason it decides that it won't or can't do it then it won't. And currently one of the reasons not to inline a function is if it's an assembly function or contains an assembly block.

I remember practically in my childhood, I used to insert "Inline code", and in general as processor codes, into Turbo-Pascal on the old IBM PC-XT. And it all worked.

FPC also allows you to write inline assembly for all targets (compared to Delphi as abouchez wrote), however if you don't add a register list to an assembly block then the generated code might have worse performance than expected due to potential unnecessary spilling of registers to the stack.

beria

  • Jr. Member
  • **
  • Posts: 70
Re: assembler; + inline; How do I make friends? :-)
« Reply #22 on: October 01, 2022, 12:35:55 am »
I already know in rtl some completely irrational and close to religion things, on which I have already been burned and had to do the functionality myself....

7) Just look at the asm I wrote in mORMot, and you will see it is not so easy to write such assembly. And to be honest, I never needed to have inlined asm for real programming. But intrinsics may have helped.

I do about the same thing myself. For example to exchange values between data buffers, I am currently working on a telemetry server for industrial equipment, I use a very simple procedure... Not even the best one, because without data alignment.


Code: Pascal  [Select][+][-]
  1. procedure Swap32(P1, P2: Pointer);
  2. asm
  3.          VMOVDQU ymm0, YMMWORD PTR [P1]
  4.          VMOVDQU ymm1, YMMWORD PTR [P2]
  5.          VMOVDQU YMMWORD PTR [P2], ymm0
  6.          VMOVDQU YMMWORD PTR [P1], ymm1
  7. end;
  8.  
  9.  

beria

  • Jr. Member
  • **
  • Posts: 70
Re: assembler; + inline; How do I make friends? :-)
« Reply #23 on: October 01, 2022, 12:50:59 am »


We won't add something like that, because users will abuse it and then complain if something goes wrong.


This is certainly not my business, but the assembler in the code, it is not something that is at all widespread and is usually done firstly, if one knows assembler of specific processors at all, by the way I do not particularly pretend, and secondly, only if the standard ways do not realize something at all or standard implementation is extremely bad, even if universal. Another concrete example in FPC, which I encountered a lot earlier, is the library of complex arithmetic, which is exceptionally bad by performance, that is, without using AVX. But it is necessary to rewrite just a few necessary functions - everything instantly changes because there is almost 100 times performance difference...

PascalDragon

  • Hero Member
  • *****
  • Posts: 5444
  • Compiler Developer
Re: assembler; + inline; How do I make friends? :-)
« Reply #24 on: October 01, 2022, 07:34:05 pm »
Another concrete example in FPC, which I encountered a lot earlier, is the library of complex arithmetic, which is exceptionally bad by performance, that is, without using AVX. But it is necessary to rewrite just a few necessary functions - everything instantly changes because there is almost 100 times performance difference...

FPC also targets x86 systems that do not provide AVX. If you have improvements that also work with older processors or that provides a dynamic detection then you can report an issue together with patches or provide a merge request.

beria

  • Jr. Member
  • **
  • Posts: 70
Re: assembler; + inline; How do I make friends? :-)
« Reply #25 on: October 03, 2022, 05:15:44 pm »
Another concrete example in FPC, which I encountered a lot earlier, is the library of complex arithmetic, which is exceptionally bad by performance, that is, without using AVX. But it is necessary to rewrite just a few necessary functions - everything instantly changes because there is almost 100 times performance difference...

FPC also targets x86 systems that do not provide AVX. If you have improvements that also work with older processors or that provides a dynamic detection then you can report an issue together with patches or provide a merge request.
This is done very simply with conditional compilation flags.
It's enough to have flags not only win64/win32/linix/maccos, but also flags AVX, AVX2 and AVX512.... globally, at copyleader kernel level And any procedure, including the system one, can be implemented in several variants.  Moreover, I have to do it right now, because of performance problems... Moreover, when you set flags, you can even compile only base code for processor with avx, and you need it too. This is the first one.
Second, as in C++ you can have an "inline" library of elementary vector operations, which can be used to program the AVX without going to assembler.  An example is immintrin.h and even the whole set is not important.  Just from my experience only less than 5% of commands are used massively. But the possibility to implement this (inline+assembler) in general is artificially turned off in FPC for reasons I can't understand, although there are implementations of this deep inside the compiler...



Thaddy

  • Hero Member
  • *****
  • Posts: 14161
  • Probably until I exterminate Putin.
Re: assembler; + inline; How do I make friends? :-)
« Reply #26 on: October 03, 2022, 05:59:46 pm »
The latter is nonsense for Intel types since you would have to work aroing the cpuid instruction, which not all supported Xxxx intel family support.
Specialize a type, not a var.

beria

  • Jr. Member
  • **
  • Posts: 70
Re: assembler; + inline; How do I make friends? :-)
« Reply #27 on: October 04, 2022, 10:48:56 am »
The latter is nonsense for Intel types since you would have to work aroing the cpuid instruction, which not all supported Xxxx intel family support.

???? As far as I know always and for any Intel processor family, you can get information about the supported instruction set.  That is, the compiler, depending on its revision for a particular type of processor, always knows this...  And then it's up to the programmer whether or not to use conditional compilation... What's more, the compiler already has...

Code: Pascal  [Select][+][-]
  1.    fputypestr : array[tfputype] of string[7] = (
  2.      'NONE',
  3. // 'SOFT',
  4.      'SSE64',
  5.      'SSE3',
  6.      'SSSE3',
  7.      'SSE41',
  8.      'SSE42',
  9.      'AVX',
  10.      'AVX2',
  11.      'AVX512F'
  12.    );
  13.  
  14.  
  15.  


PascalDragon

  • Hero Member
  • *****
  • Posts: 5444
  • Compiler Developer
Re: assembler; + inline; How do I make friends? :-)
« Reply #28 on: October 04, 2022, 01:25:13 pm »
Another concrete example in FPC, which I encountered a lot earlier, is the library of complex arithmetic, which is exceptionally bad by performance, that is, without using AVX. But it is necessary to rewrite just a few necessary functions - everything instantly changes because there is almost 100 times performance difference...

FPC also targets x86 systems that do not provide AVX. If you have improvements that also work with older processors or that provides a dynamic detection then you can report an issue together with patches or provide a merge request.
This is done very simply with conditional compilation flags.

That's why I wrote “or that provides a dynamic detection”. But if there isn't a solution that covers that from the beginning then we will not even think about integrating it.

But the possibility to implement this (inline+assembler) in general is artificially turned off in FPC for reasons I can't understand, although there are implementations of this deep inside the compiler...

We have explained our reasons and we stand by those reasons and will not make exceptions for them. And I don't know what you mean with “implementations of this deep inside the compiler”.

The latter is nonsense for Intel types since you would have to work aroing the cpuid instruction, which not all supported Xxxx intel family support.

The CPUID instruction was introduced with the 486 processors and any Intel compatible CPU since then supports this instruction. So for the context of this discussion (namely accelerated functions) you can simply fall back to the non-accelerated code path if there is no CPUID instruction available, cause any of the acceleration functionalities in question won't be available either.

That is, the compiler, depending on its revision for a particular type of processor, always knows this...  And then it's up to the programmer whether or not to use conditional compilation... What's more, the compiler already has...

That is for compile time. However the default RTL we distribute needs to be able to run also on systems that don't support that (and not everyone will recompile the RTL with their preferred setting). And thus it either needs to be compiled with the lowest setting for the oldest supported processor or it needs to have runtime detection.

 

TinyPortal © 2005-2018