Recent

Author Topic: issue with diacritical mark  (Read 305 times)

jacquesbg

  • Newbie
  • Posts: 2
issue with diacritical mark
« on: May 25, 2026, 07:30:24 pm »
I have an issue with diacritical marks in my code. The following code

Code: Pascal  [Select][+][-]
  1. program RembrandtSearch;
  2.  
  3. {$mode objfpc}
  4.  
  5. const
  6.   chunksize = 75;
  7.  
  8. var
  9.   chunk, chunktail: string;
  10.   message: string;
  11.   i: integer;
  12.  
  13. begin
  14.   message := 'levendige scènes vol dramatiek. Zijn opmerkelijke beheersing van het spel met licht en donker, waarbij hij vaak scherpe contrasten (clair-obscur) gebruikte om zo de toeschouwer de voorstelling binnen te voeren, leidde tot levendige scènes vol dramatiek.';
  15.  
  16.   chunk := message;
  17.   for i := 1 to Length(chunk) do
  18.   begin
  19.     chunktail := Copy(chunk, 2, Length(chunk));
  20.     chunk := Copy(chunk, 1, chunksize);
  21.     Writeln(chunk);
  22.     chunk := chunktail;
  23.   end;
  24.   Writeln(chunk);
  25. end.
  26.  
  27.  

The codes prints chunks of the message, shifting the chunk by one character at a time through the message.
When run gives the following output (snippet):

ènes vol dramatiek. Zijn opmerkelijke beheersing van het spel met licht en
�nes vol dramatiek. Zijn opmerkelijke beheersing van het spel met licht en
nes vol dramatiek. Zijn opmerkelijke beheersing van het spel met licht en d


the second line of the snippet starts with: � (which causes a crash in another application).

It seems that è is seen as two characters, and copy('è') breaks this in two, leaving  �.
(The same happens when è is at the end of the chunk.)

How can I avoid this behavior, i.e. how can I have copy('è') regard this as one character instead of two?

Thanks for any help.
Jacques


tetrastes

  • Hero Member
  • *****
  • Posts: 766
Re: issue with diacritical mark
« Reply #1 on: May 25, 2026, 07:55:08 pm »
Use UTF8Copy (in LazUTF8 unit from LazUtils package).

cdbc

  • Hero Member
  • *****
  • Posts: 2818
    • http://www.cdbc.dk
Re: issue with diacritical mark
« Reply #2 on: May 25, 2026, 07:56:15 pm »
Hi
Use 'UTF8Copy' from LazUtils package (lazutf8.pas), instead of the normal ansi one.
Regards Benny
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE6/QT6 -> FPC Release -> Lazarus Release &  FPC Main -> Lazarus Main

Lutz Mändle

  • Jr. Member
  • **
  • Posts: 99
Re: issue with diacritical mark
« Reply #3 on: May 25, 2026, 08:11:57 pm »
And use UTF8Length also from the unit LazUTF8.

If you compile from the commandline you have to provide the search path for the unit LazUTF8 and the output path for the compiled units with additional parameters to fpc like this:

Quote
fpc -Fu/usr/share/lazarus/components/lazutils/ -FU./lib rembrandtsearch.pas

Check whether the given path to the -Fu parameter is the right for your system.

Here comes the adapted program:

Code: Pascal  [Select][+][-]
  1. program RembrandtSearch;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   LazUTF8;
  7.  
  8. const
  9.   chunksize = 75;
  10.  
  11. var
  12.   chunk, chunktail: string;
  13.   message: string;
  14.   i: integer;
  15.  
  16. begin
  17.   message := 'levendige scènes vol dramatiek. Zijn opmerkelijke beheersing van het spel met licht en donker, waarbij hij vaak scherpe contrasten (clair-obscur) gebruikte om zo de toeschouwer de voorstelling binnen te voeren, leidde tot levendige scènes vol dramatiek.';
  18.  
  19.   chunk := message;
  20.   for i := 1 to UTF8Length(chunk) do
  21.   begin
  22.     chunktail := UTF8Copy(chunk, 2, Length(chunk));
  23.     chunk := UTF8Copy(chunk, 1, chunksize);
  24.     Writeln(chunk);
  25.     chunk := chunktail;
  26.   end;
  27.   Writeln(chunk);
  28. end.
  29.  

jacquesbg

  • Newbie
  • Posts: 2
Re: issue with diacritical mark
« Reply #4 on: May 25, 2026, 10:23:32 pm »
Thank you all for the fast response, your suggestions works.

But it was a bit complicated, as i am working in 'raw freepascal'.

So I had to download:
  • lazutf8.pas
  • lazutils_defines.inc
  • fpcadds.pas
  • unixlazutf8.inc
and compile the suggested code, and it worked  :).

Bart

  • Hero Member
  • *****
  • Posts: 5731
    • Bart en Mariska's Webstek
Re: issue with diacritical mark
« Reply #5 on: May 25, 2026, 11:24:14 pm »
Alternatively: jst install Lazarus and use Lazarus IDE to create your "raw" fpc program.
Then all you need to is add LazUtf8 package as a requierement for your application in the Project Inspector.
You can then use all units that are in LazUtils (the IDE will setup the correct -Fu parameters for you).

Lazarus IDE has advanced features like CodeTools which make life so much easier.

Bart

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 12428
  • Debugger - SynEdit - and more
    • wiki
Re: issue with diacritical mark
« Reply #6 on: May 26, 2026, 09:13:28 pm »
Well, the alternative is to read up on unicode, and implement your own parsing.

You may hear the advice to use Utf16 (if there is conversion code, since the files you read appear to be in utf8).

But Utf16 does not solve the issue. It may well hide it, but that isn't the same as solving. E.g. you then get surrogate pairs. => You can avoid those by going to utf32. But then you still may have to deal with combining codepoints (and they are in each and every encoding).

Btw, even UtfCopy does not take care of combining. So some chars, that exist only as combining sequence, will get broken even then.

 

TinyPortal © 2005-2018