Recent

Author Topic: [solved?] fairly general conversion of utf8 to U+xxxx  (Read 1364 times)

CapelliC

  • Jr. Member
  • **
  • Posts: 58
[solved?] fairly general conversion of utf8 to U+xxxx
« on: October 27, 2020, 05:00:08 pm »
For my application, UTF8ToWinCP, WinCPToUTF8 solved most of the problems, since they cover reading/writing strings correctly on Windows.

Now I'd need to refine the logic, translating - for instance - the UTF8 string 'è' (UTF8ToWinCP('è') correctly outputs 'è') to 'U+00E8', that is the literal I must use for a regex match issued by a VBS script.

I managed to perform such translations directly, hardcoding some patterns, but I'd like to play a bit more generally.

Have tried to read the source code of UTF8ToWinCP, and found that it's a fancy interface to the Windows API WideCharToMultiByte, so not really useful here. I could be missing something, of course.

I'm a bit lost in the many helper functions... I could extend the hardcoded patterns scraping the table (from https://www.utf8-chartable.de/, for instance) and performing a lookup of the output of UTF8ToWinCP, but I hope there is something more direct available.

Thanks, Carlo

« Last Edit: October 28, 2020, 04:46:24 pm by CapelliC »

Bart

  • Hero Member
  • *****
  • Posts: 5274
    • Bart en Mariska's Webstek
Re: fairly general conversion of utf8 to U+xxxx
« Reply #1 on: October 27, 2020, 05:42:27 pm »
UTF8CodepointToUnicode() maybe (unit LazUtf8).

Bart

CapelliC

  • Jr. Member
  • **
  • Posts: 58
Re: fairly general conversion of utf8 to U+xxxx
« Reply #2 on: October 27, 2020, 06:03:09 pm »
Thanks Bart

It worked, once combined with UTF8ToWinCP

Code: Pascal  [Select][+][-]
  1.   x := UTF8ToWinCP(s);
  2.   p := PChar(x);
  3.   n := UTF8CodepointToUnicode(p, l);
  4.   writeln('n:', n, ' l:', l)
  5.  

outputs 232, the correct decimal value.


JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: fairly general conversion of utf8 to U+xxxx
« Reply #3 on: October 27, 2020, 08:42:31 pm »
Code: Pascal  [Select][+][-]
  1.   x := UTF8ToWinCP(s);
  2.   p := PChar(x);
  3.   n := UTF8CodepointToUnicode(p, l);
  4.   writeln('n:', n, ' l:', l)
That looks fishy. You first convert to WinCP, then use it as UTF8.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

CapelliC

  • Jr. Member
  • **
  • Posts: 58
Re: [solved] fairly general conversion of utf8 to U+xxxx
« Reply #4 on: October 28, 2020, 04:46:11 pm »
@JuhaManninen

I agree it's weird, have spent a lot of time examining the bits, in the end I didn't found anything else apart UTF8ToWinCP capable of coalescing the 2 codepoints in 'è' to the single in 'è'.

Maybe it works by chance...  could you suggest an alternative, or suggest some data where I could verify it could misbehave ?

Also, I had some trouble because apparently ReplaceSubstring (from LazUTF8) misbehaves when there are UTF8 sequences *before* the target replacement position (as given by UTF8Pos), but, I don't understand enough of  this stuff to flag and report this as error. I changed my implementation, looks simpler now...
« Last Edit: October 28, 2020, 05:04:51 pm by CapelliC »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: [solved] fairly general conversion of utf8 to U+xxxx
« Reply #5 on: October 29, 2020, 12:23:13 pm »
I agree it's weird, have spent a lot of time examining the bits, in the end I didn't found anything else apart UTF8ToWinCP capable of coalescing the 2 codepoints in 'è' to the single in 'è'.
Maybe it works by chance...  could you suggest an alternative, or suggest some data where I could verify it could misbehave ?
Do you use LCL? You started your 1. post by "For my application, ..." which hints you do.
In that case all strings have UTF-8 encoding.
{$codepage utf8} / -FcUTF8 should not be used. You should not need UTF8ToWinCP() unless reading/writing external data with WinCP encoding. Just convert all data to UTF-8 and things go smooth.
In a non-GUI program you can still use the same UTF-8 system. Only Windows console with its own codepage will pose a challenge.
It is all explained here :
 https://wiki.freepascal.org/Unicode_Support_in_Lazarus

Quote
Also, I had some trouble because apparently ReplaceSubstring (from LazUTF8) misbehaves when there are UTF8 sequences *before* the target replacement position (as given by UTF8Pos), but, I don't understand enough of  this stuff to flag and report this as error. I changed my implementation, looks simpler now...
ReplaceSubstring is clearly in a wrong place as it does not deal with UTF-8. Will move to unit LazStringUtils. The one in LazUTF8 must be deprecated.
Note however that you can use ReplaceSubstring and other normal string functions for UTF-8 strings. You just have to understand what you are doing.
In your case call Pos instead of UTF8Pos. There is no error in ReplaceSubstring, you only used it wrongly.
Please see examples here :
 https://wiki.freepascal.org/UTF8_strings_and_characters
« Last Edit: October 29, 2020, 01:44:32 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018