Recent

Author Topic: Windows API and wide/unicode strings  (Read 4215 times)

Nadar

  • New Member
  • *
  • Posts: 17
Windows API and wide/unicode strings
« on: November 22, 2023, 06:36:30 am »
I need to make a small NSIS plugin ("C style DLL"), and chose Lazarus/FPC for the job because my C/C++ knowledge is terrible. I don't write much Pascal these days, but I used to code in Delphi back in the days.

I'm sure this has a very simple solution, but I seem to be incapable of figuring it out. I need to build both "Unicode" and "Ansi" versions of the DLL, since NSIS itself can run in "both modes". "Unicode" here means that all strings are "Windows wide strings" aka 16 bit chars. My problem is how to make sure that all "external" string handling use one or the other type based on the chosen "build mode". I've defined a custom compiler option "-dUNICODE" for the "Unicode" build, which seems to make some things happen.

But not all. In particular, I'm currently struggling with "TLVITEM" defined in "struct.inc". It points to "LV_ITEM" which is a record where one of the fields is a string - "pszText". It is defined as a "LPTSTR" - which is defined as either "Pchar" or "Pwidechar" depending on the "UNICODE" define. But, whatever I do, I can't seem to get it to actually USE that "UNICODE" version, so I get a compile error if I try to assign a PWideChar to "pszText" in "Unicode build mode".

I've searched and read lots and lots about string types, encodings etc - with lots of discussions about the different encodings - but I've not managed to find anything that points me in the right direction as to how to get the compiler to "use the types I want".

Any help/hints are appreciated.

AlexTP

  • Hero Member
  • *****
  • Posts: 2488
    • UVviewsoft
Re: Windows API and wide/unicode strings
« Reply #1 on: November 22, 2023, 06:50:25 am »
Maybe you can show _small_ project, which has problems you told?
And: -dUnicode define applies to your project, not to LCL.

Nadar

  • New Member
  • *
  • Posts: 17
Re: Windows API and wide/unicode strings
« Reply #2 on: November 22, 2023, 04:38:39 pm »
As to the example, I think https://nsis.sourceforge.io/Examples/Plugin/nsis.pas shows the essentials. This unit takes care of the "interface" between the NSIS caller and the DLL - all the rest is essentially "normal Pascal code". I'm currently trying to make "nsis.pas" work correctly with and without wide strings - and the location which is commented with "Unicode bug?" is the location that gives the compile error because of the type "confusion" if I try to cast to for example PWideChar instead of PChar (the cast might not be "valid", I don't really have enough overview over implicit conversions to tell - but the problem is the same if I create a new PWideChar local variable and try to assign it).

For completeness, here's an example code that uses the "nsis.pas" unit - but I really don't think that is relevant to this problem: https://nsis.sourceforge.io/Examples/Plugin/exdll_with_unit.dpr

And: -dUnicode define applies to your project, not to LCL.
This might be something. As I'm quite unfamiliar with Lazarus - the LCL, RTL etc references are confusing. I don't know which files are and aren't part of these "components". Since this plugin is to provide logic/functions only, it will have no graphics and as such I'd think that I didn't use the LCL at all? I don't use and "components" or have any forms. The RTL on the other hand I assume is always involved somehow - and if "struct.inc" and "base.inc" is precompiled it explains why nothing I do has any impact. But, that begs the question why does "base.inc" use "Unicode ifdefs" if there's no way to "activate" them? The definition of "LPTSTR" seems to be the crux of this exact problem, lines 185-190 in "my version" of "base.inc".

Nadar

  • New Member
  • *
  • Posts: 17
Re: Windows API and wide/unicode strings
« Reply #3 on: November 22, 2023, 06:40:56 pm »
If the problem is the precompiled parts - is there anything I can do do make that code with the "UNICODE" define "active"? Or do I have to define my own structures, API calls etc "in parallell" and not actually use the standard libs to make this work?

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1432
    • Lebeau Software
Re: Windows API and wide/unicode strings
« Reply #4 on: November 23, 2023, 09:49:17 pm »
But not all. In particular, I'm currently struggling with "TLVITEM" defined in "struct.inc". It points to "LV_ITEM" which is a record where one of the fields is a string - "pszText". It is defined as a "LPTSTR" - which is defined as either "Pchar" or "Pwidechar" depending on the "UNICODE" define. But, whatever I do, I can't seem to get it to actually USE that "UNICODE" version, so I get a compile error if I try to assign a PWideChar to "pszText" in "Unicode build mode".

I may be wrong here, but TLVITEM is likely pre-compiled based on whether the RTL/LCL itself is built in ANSI or UNICODE mode, regardless of whether your project is built in ANSI or UNICODE mode.  That would explain the error you are seeing.

So, you will have to instead use TLVITEMA/LPSTR or TLVITEMW/LPWSTR explicitly, based on your project's build mode.
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

Nadar

  • New Member
  • *
  • Posts: 17
Re: Windows API and wide/unicode strings
« Reply #5 on: November 29, 2023, 05:41:25 am »
Thanks Remy - I wasn't notified of your reply for some reason, but it's what I came to as well. Basically all the structs, types and methods that can be either ANSI or Wide points to the ANSI version in the RTL. So, to make things work, I have to explicitly call the "W" versions of the methods and use the "wide" types based on my "build mode".

This, although a bit "cumbersome" (lots of {$IFDEFS}), works well now. String handling on the other hand is still not solved.

I'm trying to twist my mind info finding a way to basically disable implicit codepage conversion, or at least prevent it - but I haven't yet seen the light. Using the LazUTF8 isn't really an option since I need the "ANSI version" of the DLL to leave the everything in its codepage encoding. At the same time, I need the "Wide version" to maintain whatever Unicode information it comes across. I would also prefer to keep the DLL as small as possible, and not include anything that isn't strictly necessary.

Since "everything" in FPC seems to be built around "8 bit strings", it would seem logical for me to convert everything between UTF-8 and UTF-16 very close to the Win API calls, and then use normal string handling internally. But, the same code (except the conversion UTF16/UTF8 conversion which would happen within ifdefs) would also need to deal with the system codepage encoded ANSI strings. This would all work out nicely as far as I can tell if I could prevent implicit conversion, since I'm not going to go down the rabbithole of string sorting etc.

But, there must be something fundamental I'm missing with the logic surrounding the code page handling. I get that you can define a string type to "be" a certain code page at compile time (type somestring : string(CP_XX)) - but I can hardly imagine when that's useful. Controlling the codepage at runtime seems harder - if you just use AnsiString I understand it if its codepage is initially "CP_ACP". But, since an empty string is just a null pointer, you can't actually set the codepage for an empty string (before you put something in it). If you first assign something to it and then set the codepage (and does a conversion), the implicit conversion has already taken place as far as I understand - and information might already be lost.

At this point I'm seriously wondering if I have to resort to just using PChar/PWideChar or similar, and do all string processing using own implementations or using Win API calls. But, I would really, really prefer not to have to go that way.

PS! It's quite "strange" for me that Remy Lebeau replies to me. In my world I guess you're some kind of celebrity or something ;P I used to read a lot of what you wrote in relation with the Indy components some 20+ years ago, and considered you to be pretty close to the allmighty ;)
« Last Edit: November 29, 2023, 06:45:50 am by Nadar »

cdbc

  • Hero Member
  • *****
  • Posts: 1673
    • http://www.cdbc.dk
Re: Windows API and wide/unicode strings
« Reply #6 on: November 29, 2023, 09:37:44 am »
Hi
+1 to your ps. ;)
PPS: He still is today  :D
Regards Benny
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE5 -> FPC 3.2.2 -> Lazarus 2.2.6 up until Jan 2024 from then on it's: KDE5/QT5 -> FPC 3.3.1 -> Lazarus 3.0

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11947
  • FPC developer.
Re: Windows API and wide/unicode strings
« Reply #7 on: November 29, 2023, 10:49:07 am »
Nadar:  for Windows 10 (*) and later you can also have your application run in UTF8 only mode.

You can do this by going to project options and then to "application", enable manifest and turn on the "ansi code page is UTF8" option, as shown in http://www.stack.nl/~marcov/files/lazopts.png picture.

I generally also enable the long file name option to not hit 260 char limits.

In this case you could use the same 1-byte codebase for both options, and switching between using local encoding and unicode would only be toggling these tickmarks

(*) Windows 1905+ or so, so basically all supported versions

I may be wrong here, but TLVITEM is likely pre-compiled based on whether the RTL/LCL itself is built in ANSI or UNICODE mode, regardless of whether your project is built in ANSI or UNICODE mode.  That would explain the error you are seeing.

Correct. There is ongoing work into a dual precompiled RTL for 1-byte and 2-byte Delphi compatibility, but that is not finished yet.

Nadar

  • New Member
  • *
  • Posts: 17
Re: Windows API and wide/unicode strings
« Reply #8 on: November 29, 2023, 10:08:16 pm »
Nadar:  for Windows 10 (*) and later you can also have your application run in UTF8 only mode.
Thanks, but that's not really viable here since I'm trying to make a NSIS plugin that should work "universally". If I were only to consider "my needs", I would just make the "Unicode version" (aka wide strings) work - but it is customary to provide NSIS plugins in 32 and 64 bit versions, in "ANSI" and "Unicode" mode (4 versions in total). NSIS itself works with everything from Win95/NT4 and later, so backwards compatibility is important. It's hard to find proper documentation for the really old stuff, so I've decided to make the "cap" at Windows 2000. I'm only using API calls that's Windows 2000 compatible.

My ideology is that if I need to make something like this, I make it open source and make it available to others. Thus I want it to work under "all" (or at least as many as I can) circumstances.

All I really need is to stop the implicit string conversions and I think it will all work out. I'm just not sure how to be certain that I've stopped all such conversions.

I guess my problem "boils down to" what codepage new strings "are assigned". In "Unicode mode" they should be UTF-8, in "ANSI mode" they should be the system codepage. I'm wondering if setting "DefaultSystemCodePage" is what controls what new strings are considered (that they are created with "codepage 0" which then defaults to "DefaultSystemCodePage"), but I'm not sure and I haven't managed to get to the bottom of this. If that IS the case, it should be enough for me to call "SetMultiByteConversionCodePage" with "CP_UTF8" when running in "Unicode mode" - and to leave it alone in "ANSI mode". But, I'm unable to figure out if this really is the case or not.

When I look at the Laz8UTF initialization code, it sets a lot of stuff that I honestly have no idea what does, but it seems to be stuff to make sure that everything is automatically converted to UTF-8. I don't need that - I can convert it explicitly - I just need it to be left alone.

440bx

  • Hero Member
  • *****
  • Posts: 4750
Re: Windows API and wide/unicode strings
« Reply #9 on: November 30, 2023, 01:02:42 am »
I'm only using API calls that's Windows 2000 compatible.
<snip>
All I really need is to stop the implicit string conversions and I think it will all work out. I'm just not sure how to be certain that I've stopped all such conversions.
<snip>
I just need it to be left alone.
from what you've described I think the only _reliable_ solution is to use null terminated character arrays (a la C) and forego Pascal strings.
(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

Nadar

  • New Member
  • *
  • Posts: 17
Re: Windows API and wide/unicode strings
« Reply #10 on: November 30, 2023, 02:57:28 am »
from what you've described I think the only _reliable_ solution is to use null terminated character arrays (a la C) and forego Pascal strings.

That's exactly what I was hoping to avoid. If I have to manually allocate and deallocate strings and use all the C-style string handling functions, it might be easier to just make the whole thing in C/C++. But I'm terrible at that, so everything goes extremely slow since I have to read and study to do the smallest stuff. And I have no idea what IDE to use, since I really don't like VS - in addition to that I'd have to use a really old VS to avoid getting a runtime that isn't limited to recent Windows versions (Microsoft leverages their "power" with making VS to cause incompatibility and force people to upgrade).

So, in short, I was really hoping that I could avoid that and get this working using FPC. If it's not doable, I guess I'll just have to bite the bullet - but I'm still hoping.

I'm kind of "clinging" to the statement found on the mailing list here: https://www.mail-archive.com/fpc-pascal@lists.freepascal.org/msg42244.html
Quote
Quote
Best for me would be to be able to turn the conversions off completely.

You cannot, but you can set DefaultSystemCodePage to CP_UTF8.
Then no conversions will be done for all ansistrings that contain UTF8.

IF this is actually true, there should be a way.

440bx

  • Hero Member
  • *****
  • Posts: 4750
Re: Windows API and wide/unicode strings
« Reply #11 on: November 30, 2023, 03:20:10 am »
That's exactly what I was hoping to avoid. If I have to manually allocate and deallocate strings and use all the C-style string handling functions, it might be easier to just make the whole thing in C/C++.
There is little, if any, difference in handling null terminated char arrays in Pascal and C.

As far as the language itself, it's still much easier to deal with Pascal than C++.

True that, using null terminated char arrays, you wouldn't get the convenience of Pascal strings but, at least you get the certainty that no unexpected "conversions" happen behind your back.

Just FYI, I personally never use Pascal strings - I only use null terminated char arrays - and the _small_ amount of additional work is well worth the control, flexibility and predictability of the code but, I concede I'm the "odd" programmer in this forum.

An advantage of doing it that way is that if you're building a dll/library, it would be compatible not only with FPC but also with C.  That should be a welcome plus.

(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

Nadar

  • New Member
  • *
  • Posts: 17
Re: Windows API and wide/unicode strings
« Reply #12 on: November 30, 2023, 03:41:56 am »
There is little, if any, difference in handling null terminated char arrays in Pascal and C.

As far as the language itself, it's still much easier to deal with Pascal than C++.

True that, using null terminated char arrays, you wouldn't get the convenience of Pascal strings but, at least you get the certainty that no unexpected "conversions" happen behind your back.

Just FYI, I personally never use Pascal strings - I only use null terminated char arrays - and the _small_ amount of additional work is well worth the control, flexibility and predictability of the code but, I concede I'm the "odd" programmer in this forum.

An advantage of doing it that way is that if you're building a dll/library, it would be compatible not only with FPC but also with C.  That should be a welcome plus.

The DLLs I'm building has to be "C style" - they are called from NSIS which has no idea about FPC. So, everything that "interfaces with the outside world" must mimic C - but that doesn't mean that the internal handling has to. This part already works fine.

Unfortunately (?) I've gotten used to Java syntax over the years, so even though I started out with Pascal (or, actually Basic 8-) ), I'm now struggling a bit to write Pascal. Naming conventions, curly braces, the "?" operator, assignments in expressions, "==", "!=" etc. has become "second nature" to me so I'm having to correct myself a lot when writing Pascal. As such, writing C/C++ would in many ways be easier for me. What I don't like about it is the string handling in particular, the facts that's it's so easy to create bugs - and the (to me) very confusing type naming "system". I usually end up having to browse through lots and lots of header files to try to figure out what types to use. I get that this would get easier quite quickly with some experience, but I was only planning to make a small plugin, not to teach myself a whole different "world" ;)

So, for me, the benefit of using Pascal is somewhat marginalized if I have to do C-style string handling.

I really think the whole implicit conversion stuff should be possible to switch of with some directive. I see that it can be convenient in many situations, but when you need to have full control, it makes a mess out of everything.

cdbc

  • Hero Member
  • *****
  • Posts: 1673
    • http://www.cdbc.dk
Re: Windows API and wide/unicode strings
« Reply #13 on: November 30, 2023, 08:33:30 am »
Hi
Hmmm, how about "shortstring"?!?
It's a value type and as such limited to 255 chars, but AFAICS there's no codepage stuff connected to it, a sort of "easy Short pchar" with some string-convenience...
Maybe... I dunno, it's worth a try  ::)
Regards Benny
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE5 -> FPC 3.2.2 -> Lazarus 2.2.6 up until Jan 2024 from then on it's: KDE5/QT5 -> FPC 3.3.1 -> Lazarus 3.0

Nadar

  • New Member
  • *
  • Posts: 17
Re: Windows API and wide/unicode strings
« Reply #14 on: November 30, 2023, 08:36:00 am »
While 255 would probably suffice for most strings, I often don't know the length and I know for sure that some are longer. Otherwise I admit that shortstring looks "tempting".

 

TinyPortal © 2005-2018