Recent

Author Topic: [SOLVED] Unicodestring type and Array of Widechar type - pros and cons?  (Read 1937 times)

Gizmo

  • Hero Member
  • *****
  • Posts: 831
Hi

For a Windows application that commonly requires UTF16, I've been using Unicodestring types, and also array of widechar types. But I have to be entirely honest - I am a bit confused as to what the differences are, and when one would use one over the other?

I've read https://wiki.freepascal.org/Character_and_string_types#UnicodeString that states "UnicodeStrings are reference counted, null-terminated arrays, but they are implemented as arrays of WideChars" and for WideChar it states "A variable of type WideChar, also referred to as UnicodeChar, is exactly 2 bytes in size and usually contains one Unicode code point (normally a character) in UTF-16 encoding."

So, if I have an array that is
Code: Pascal  [Select][+][-]
  1. var
  2.   strA : array [0..99] of widechar; // strA can not be any larger than 100 bytes, i.e. 50 characters
  3.   strB : unicodestring;                  // strB is unrestricted in size
  4.  
  5. begin
  6.   strA := 'HellÖ';
  7.   strB := 'HellÖ';
  8. end;
  9.  

So in this example, under what circumstances would you opt for UnicodeString over array of WideChar, and visa versa, aside from not wanting to limit the amount of data that can go into the string which of course a fixed length array does?
« Last Edit: January 23, 2021, 01:01:16 pm by Gizmo »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11351
  • FPC developer.
Re: Unicodestring type and Array of Widechar type - pros and cons?
« Reply #1 on: January 22, 2021, 11:42:25 am »
In Delphi/FPC string operations work on string types, not on static array buffers. It is exactly the same as ansistrings vs static arrays of (ansi)char.   

One is an automated type with many operations defined for it, the other is an array with some minor magic so that assignment to a literal works, and for other operations,  you either need to convert it to unicodestring or to roll your own routines.

 


MarkMLl

  • Hero Member
  • *****
  • Posts: 6646
Re: Unicodestring type and Array of Widechar type - pros and cons?
« Reply #2 on: January 22, 2021, 12:07:53 pm »
I think the point that the wiki is trying to make is that anything based on WideChar has characters of uniform size (16-bit etc.).

Normal strings default to UTF-8 encoding, where characters have non-uniform size hence in the general case it's not safe to step through using sequential index values.

If I am wrong in that I'd appreciate being corrected, since this is a topic with which I'm not entirely happy: and I know I'm not alone in that.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Zvoni

  • Hero Member
  • *****
  • Posts: 2300
Re: Unicodestring type and Array of Widechar type - pros and cons?
« Reply #3 on: January 22, 2021, 02:27:52 pm »
Code: [Select]
strA : array [0..99] of widechar; // strA can not be any larger than 100 bytes, i.e. 50 charactersSorry, but this doesn't make sense to me.
It's an Array of 100 Wide-Characters (WideChars), so the correct thing to say is: It's 200 Bytes (=100 WideChars)
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

lucamar

  • Hero Member
  • *****
  • Posts: 4219
Re: Unicodestring type and Array of Widechar type - pros and cons?
« Reply #4 on: January 22, 2021, 03:15:56 pm »
I think the point that the wiki is trying to make is that anything based on WideChar has characters of uniform size (16-bit etc.).

Normal strings default to UTF-8 encoding, where characters have non-uniform size hence in the general case it's not safe to step through using sequential index values.

If I am wrong in that I'd appreciate being corrected, since this is a topic with which I'm not entirely happy: and I know I'm not alone in that.

WideString/UnicodeString don't have uniform-sized characters either, but 16-bit code-points coded in UTF16. For most normal (read: Western) scripts that means that yes, a character can be represented as a single code-point, but that isn't true for other languages or if you use surrogates, composites (e.g. to represent tilded characters), etc.

So "it's not safe to step through using sequential index values" is valid also for Wide/UnicodeString, though perhaps a little less so.
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1311
    • Lebeau Software
Re: Unicodestring type and Array of Widechar type - pros and cons?
« Reply #5 on: January 23, 2021, 01:49:02 am »
WideString/UnicodeString don't have uniform-sized characters either, but 16-bit code-points coded in UTF16.

More accurately, they have 16-bit code-units encoded in UTF-16.  A code-point and a code-unit are two different things.

For most normal (read: Western) scripts that means that yes, a character can be represented as a single code-point, but that isn't true for other languages or if you use surrogates, composites (e.g. to represent tilded characters), etc.

Again, code-units, not code-points.  Code-points are the character values that Unicode defines.  Code-units are how those code-points are encoded in the various UTF-X encodings (where X is the bit size of each code-unit).
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

lucamar

  • Hero Member
  • *****
  • Posts: 4219
Re: Unicodestring type and Array of Widechar type - pros and cons?
« Reply #6 on: January 23, 2021, 04:29:54 am »
More accurately, they have 16-bit code-units encoded in UTF-16.  A code-point and a code-unit are two different things.

You're right, o.c.. Sorry, one gets confused oftentimes with Unicode concepts and terminology :-[
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

PascalDragon

  • Hero Member
  • *****
  • Posts: 5444
  • Compiler Developer
Re: Unicodestring type and Array of Widechar type - pros and cons?
« Reply #7 on: January 23, 2021, 11:16:47 am »
WideString/UnicodeString don't have uniform-sized characters either, but 16-bit code-points coded in UTF16. For most normal (read: Western) scripts that means that yes, a character can be represented as a single code-point, but that isn't true for other languages or if you use surrogates, composites (e.g. to represent tilded characters), etc.

You don't even need to look at non-western scripts, just look at the widespread use of Emojis nowadays. ;)

MarkMLl

  • Hero Member
  • *****
  • Posts: 6646
Re: Unicodestring type and Array of Widechar type - pros and cons?
« Reply #8 on: January 23, 2021, 12:46:09 pm »
You don't even need to look at non-western scripts, just look at the widespread use of Emojis nowadays. ;)

I always find myself wanting to pluralise those as "emojim" for some reason.

You're right of course, I'm skewed there by the fact I'm rarely likely to import something that includes them... although I should be able to handle them if they appear in a comment field.

I've found myself using combining characters for output, but fortunately not for input... my above comment should still apply :-/

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

lucamar

  • Hero Member
  • *****
  • Posts: 4219
Re: Unicodestring type and Array of Widechar type - pros and cons?
« Reply #9 on: January 23, 2021, 02:39:41 pm »
I've found myself using combining characters for output, but fortunately not for input... my above comment should still apply :-/

The networking moto should apply here too; paraphrasing: be liberal in what you accept even if you're strict in what you produce. ;)
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

MarkMLl

  • Hero Member
  • *****
  • Posts: 6646
Re: Unicodestring type and Array of Widechar type - pros and cons?
« Reply #10 on: January 23, 2021, 02:58:41 pm »
The networking moto should apply here too; paraphrasing: be liberal in what you accept even if you're strict in what you produce. ;)

Yes, agreed.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

 

TinyPortal © 2005-2018