Forum > Unix

Example how to decode GB-2312-HZ Chinese text?

(1/1)

AlexTP:
I need the example code for reading HZ-encoded files - ie to convert contents of such file to UTF8String. I attach the small example file. I need that by using libiconv as the main Unix method.

HZ files have Asian chars encoded to ~{....} ascii parts.

I prefer cross-platform way using SetCodepage() RTL procedure.

paweld:
unit LConvEncoding: https://github.com/alrieckert/lazarus/blob/master/components/lazutils/lconvencoding.pas#L200

AlexTP:
No! See code: 'Chinese, essentially the same as GB 2312' - but GB 2312-HZ is different one!

tetrastes:
It seems that libiconv doesn't support GB 2312-HZ, at least at Ubuntu 22.04.1, which has rather fresh glibc:

--- Code: Bash  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---$ iconv -Viconv (Ubuntu GLIBC 2.35-0ubuntu3.1) 2.35Copyright (C) 2022 Free Software Foundation, Inc. $ iconv -l | grep 2312CSGB2312//GB2312//
For comparison

--- Code: Bash  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---$ uconv -l | grep 2312ibm-1383_P110-1999 ibm-1383 GB2312 csGB2312 cp1383 1383 EUC-CN ibm-eucCN hp15CN ibm-1383_VPUA ibm-5478_P100-1995 ibm-5478 GB_2312-80 chinese iso-ir-58 csISO58GB231280 gb2312-1980 GB2312.1980-0 HZ HZ-GB-2312
So I'm afraid that you have to use libicu, but the good news is that it is portable (and windows 10 has it already: https://learn.microsoft.com/en-us/windows/win32/intl/international-components-for-unicode--icu-).

Just in case I attach the result of

--- Code: Bash  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---$ uconv -f HZ -t UTF-8 -o hz2ut8.txt test-gb2312-hz.txtIs it correct?

AlexTP:
So I need to use ICU on Windows, but  for my cross-platform app CudaText it is not the good way.

Navigation

[0] Message Index

Go to full version