Lazarus

Free Pascal => Beginners => Topic started by: CapelliC on October 27, 2020, 05:00:08 pm

Title: [solved?] fairly general conversion of utf8 to U+xxxx
Post by: CapelliC on October 27, 2020, 05:00:08 pm
For my application, UTF8ToWinCP, WinCPToUTF8 solved most of the problems, since they cover reading/writing strings correctly on Windows.

Now I'd need to refine the logic, translating - for instance - the UTF8 string 'è' (UTF8ToWinCP('è') correctly outputs 'è') to 'U+00E8', that is the literal I must use for a regex match issued by a VBS script.

I managed to perform such translations directly, hardcoding some patterns, but I'd like to play a bit more generally.

Have tried to read the source code of UTF8ToWinCP, and found that it's a fancy interface to the Windows API WideCharToMultiByte, so not really useful here. I could be missing something, of course.

I'm a bit lost in the many helper functions... I could extend the hardcoded patterns scraping the table (from https://www.utf8-chartable.de/, for instance) and performing a lookup of the output of UTF8ToWinCP, but I hope there is something more direct available.

Thanks, Carlo

Title: Re: fairly general conversion of utf8 to U+xxxx
Post by: Bart on October 27, 2020, 05:42:27 pm
UTF8CodepointToUnicode() maybe (unit LazUtf8).

Bart
Title: Re: fairly general conversion of utf8 to U+xxxx
Post by: CapelliC on October 27, 2020, 06:03:09 pm
Thanks Bart

It worked, once combined with UTF8ToWinCP

Code: Pascal  [Select][+][-]
  1.   x := UTF8ToWinCP(s);
  2.   p := PChar(x);
  3.   n := UTF8CodepointToUnicode(p, l);
  4.   writeln('n:', n, ' l:', l)
  5.  

outputs 232, the correct decimal value.

Title: Re: fairly general conversion of utf8 to U+xxxx
Post by: JuhaManninen on October 27, 2020, 08:42:31 pm
Code: Pascal  [Select][+][-]
  1.   x := UTF8ToWinCP(s);
  2.   p := PChar(x);
  3.   n := UTF8CodepointToUnicode(p, l);
  4.   writeln('n:', n, ' l:', l)
That looks fishy. You first convert to WinCP, then use it as UTF8.
Title: Re: [solved] fairly general conversion of utf8 to U+xxxx
Post by: CapelliC on October 28, 2020, 04:46:11 pm
@JuhaManninen

I agree it's weird, have spent a lot of time examining the bits, in the end I didn't found anything else apart UTF8ToWinCP capable of coalescing the 2 codepoints in 'è' to the single in 'è'.

Maybe it works by chance...  could you suggest an alternative, or suggest some data where I could verify it could misbehave ?

Also, I had some trouble because apparently ReplaceSubstring (from LazUTF8) misbehaves when there are UTF8 sequences *before* the target replacement position (as given by UTF8Pos), but, I don't understand enough of  this stuff to flag and report this as error. I changed my implementation, looks simpler now...
Title: Re: [solved] fairly general conversion of utf8 to U+xxxx
Post by: JuhaManninen on October 29, 2020, 12:23:13 pm
I agree it's weird, have spent a lot of time examining the bits, in the end I didn't found anything else apart UTF8ToWinCP capable of coalescing the 2 codepoints in 'è' to the single in 'è'.
Maybe it works by chance...  could you suggest an alternative, or suggest some data where I could verify it could misbehave ?
Do you use LCL? You started your 1. post by "For my application, ..." which hints you do.
In that case all strings have UTF-8 encoding.
{$codepage utf8} / -FcUTF8 should not be used. You should not need UTF8ToWinCP() unless reading/writing external data with WinCP encoding. Just convert all data to UTF-8 and things go smooth.
In a non-GUI program you can still use the same UTF-8 system. Only Windows console with its own codepage will pose a challenge.
It is all explained here :
 https://wiki.freepascal.org/Unicode_Support_in_Lazarus

Quote
Also, I had some trouble because apparently ReplaceSubstring (from LazUTF8) misbehaves when there are UTF8 sequences *before* the target replacement position (as given by UTF8Pos), but, I don't understand enough of  this stuff to flag and report this as error. I changed my implementation, looks simpler now...
ReplaceSubstring is clearly in a wrong place as it does not deal with UTF-8. Will move to unit LazStringUtils. The one in LazUTF8 must be deprecated.
Note however that you can use ReplaceSubstring and other normal string functions for UTF-8 strings. You just have to understand what you are doing.
In your case call Pos instead of UTF8Pos. There is no error in ReplaceSubstring, you only used it wrongly.
Please see examples here :
 https://wiki.freepascal.org/UTF8_strings_and_characters
TinyPortal © 2005-2018