Recent

Author Topic: Codepages, encodings, Unicode etc.  (Read 10244 times)

vick

  • New Member
  • *
  • Posts: 15
Codepages, encodings, Unicode etc.
« on: August 26, 2014, 07:01:35 pm »
I started to learn Pascal because I wanted to create cross-platform GUI applications (where 'cross-platform' means Linux and Windows - I don't care about OS X). I intend to work with lots of text. So, I've already read the refrence manual and I'm starting to explore the standard library and Lazarus library. But this made me think twice:
http://wiki.freepascal.org/FPC_Unicode_support
Really, I don't understand any of that, but this all sure sounds like bad news. Can you please give some examples of the following:
Quote
If a string with a static code page X1 is assigned to a string with static code page X2 and X1<>X2, the string data will generally first be converted to said code page X2 before assignment
Anyway, is there any difference between 'code page' and encoding in the Wiki's usage of these terms?
Quote
The dynamic code page is a property of the AnsiString which, similar to the length and the reference count, defines the actual code page of the data currently held by that AnsiString.
How does it 'define' 'code page' and what the compiler is actually DOING with that 'definition'?
Quote
Code page-aware strings are the only exception to this rule: concatenating two or more strings always occurs without data loss, although afterwards the resulting string will of course still be converted to the static code page of the destination (which may result in data loss).
Can you give some examples of concatenating strings with different 'code pages' without data loss and then losing data anyways?
I've searched this forum (and some others) and it seems lots of people have troubes with Unicode in Free Pascal. Some workarounds involve temporarily renaming files to ASCII (!), and then back... That's pretty mindboggling in 2014, to be frank. Maybe Free Pascal is simply not suitable for what I want to do.
Sorry if that sounds a little inflammatory. I understand that the Free Pascal team is working on Unicode support. Still, I have to figure out the state of string handling in FPC before I can proceed any futher with it.

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Codepages, encodings, Unicode etc.
« Reply #1 on: August 26, 2014, 08:03:41 pm »
I started to learn Pascal because I wanted to create cross-platform GUI applications (where 'cross-platform' means Linux and Windows - I don't care about OS X). I intend to work with lots of text. So, I've already read the refrence manual and I'm starting to explore the standard library and Lazarus library. But this made me think twice:
http://wiki.freepascal.org/FPC_Unicode_support
Really, I don't understand any of that, but this all sure sounds like bad news. Can you please give some examples of the following:
Quote
If a string with a static code page X1 is assigned to a string with static code page X2 and X1<>X2, the string data will generally first be converted to said code page X2 before assignment

its simple lets assume the following definitions
Code: [Select]
var
  Txt1 : widestring; //utf16 string type
  Txt2 : Utf8String;
Now you want to do something simple as Txt2 := Txt1; this will be converted bythe compiler to something like
Code: [Select]
  Txt2 := UTF8Encode(Txt1);//utf8encode converts between utf16 to utf8 encoding.
farther more if you try to concatenate two strings they are first converted to the result string and then concatenated to one. eg
Code: [Select]
const
  str1 : ansistring = 'This is an Ansi String ';
  str2 : ansistring = ' to be converted';
var
  Txt : utf8String;
begin
   txt := str1+str2;
end;
will be converted to
Code: [Select]
const
  str1 : ansistring = 'This is an Ansi String ';
  str2 : ansistring = ' to be converted';
var
  Txt : utf8String;
begin
   txt := utf8encode(str1)+utf8encode(str2);
end;
 

Anyway, is there any difference between 'code page' and encoding in the Wiki's usage of these terms?

yes code page is a property of the encoding. eg utf16 has no code pages it is designed to support languages in a single continues space as far as I know, ascii and ansi on the other hand have code pages. For example Ascii describes only the first 127 characters the rest (from 128 to 255) are code page aware loading different code page characters in that space you change the secondary language support.


Quote
The dynamic code page is a property of the AnsiString which, similar to the length and the reference count, defines the actual code page of the data currently held by that AnsiString.
How does it 'define' 'code page' and what the compiler is actually DOING with that 'definition'?

I have no idea I was under the impression that if you did not define a code page on your ansi string definition then the default windows ansi code page will be used. On linux there should be something similar but I did not looked it up yet so I'll leave to more knowledgable people to answer that.


Quote
Code page-aware strings are the only exception to this rule: concatenating two or more strings always occurs without data loss, although afterwards the resulting string will of course still be converted to the static code page of the destination (which may result in data loss).
Can you give some examples of concatenating strings with different 'code pages' without data loss and then losing data anyways?

I've searched this forum (and some others) and it seems lots of people have troubes with Unicode in Free Pascal. Some workarounds involve temporarily renaming files to ASCII (!), and then back... That's pretty mindboggling in 2014, to be frank. Maybe Free Pascal is simply not suitable for what I want to do.
Sorry if that sounds a little inflammatory. I understand that the Free Pascal team is working on Unicode support. Still, I have to figure out the state of string handling in FPC before I can proceed any futher with it.

I never had any problems with the string in fpc, then again I don't have international applications to worry about only national.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

vick

  • New Member
  • *
  • Posts: 15
Re: Codepages, encodings, Unicode etc.
« Reply #2 on: August 26, 2014, 09:25:07 pm »
Quote
farther more if you try to concatenate two strings they are first converted to the result string and then concatenated to one. eg
Code:
Code: [Select]
const
  str1 : ansistring = 'This is an Ansi String ';
  str2 : ansistring = ' to be converted';
var
  Txt : utf8String;
begin
   txt := str1+str2;
end;
will be converted to
Code:
Code: [Select]
const
  str1 : ansistring = 'This is an Ansi String ';
  str2 : ansistring = ' to be converted';
var
  Txt : utf8String;
begin
   txt := utf8encode(str1)+utf8encode(str2);
end;
I see... So, by default, on Windows, FPC will insert some invisible code to covert ansistrings to/from whatever GetACP function returns. And on Linux, it will use $LANG. Don't think I want any of that... So I guess my best option is to avoid ansistrings altogether?
Quote
yes code page is a property of the encoding. eg utf16 has no code pages it is designed to support languages in a single continues space as far as I know, ascii and ansi on the other hand have code pages. For example Ascii describes only the first 127 characters the rest (from 128 to 255) are code page aware loading different code page characters in that space you change the secondary language support.
ASCII and UTF-16 are encodings, though... And all this legacy one-byte Windows stuff bothers me. All of that looks extremely error-prone.
What's the deal with filenames on Windows? For example, this thread says
Quote
As Taazz explained, FPC used ANSI version of Windows API functions. To be precise, when you call reset it calls CreateFileA instead of CreateFileW.
http://forum.lazarus.freepascal.org/index.php?topic=24765.0
While the Wiki says
Quote
the RTL uses UTF-16 OS API calls
I was under the impression that CreateFileA accepts filenames in a one-byte encoding, while CreateFileW accepts UTF-16... Can I create a Windows file that's called 'C:\Program Files\فارسی\ქართული\日本語\Русский\Türkçe.txt' ? Can I open this file?

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Codepages, encodings, Unicode etc.
« Reply #3 on: August 26, 2014, 10:22:55 pm »
@vick, are you familiar with ANSI, UTF8, UTF16? if not, then I encourage you to read this article:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Meanwhile, the path you mentioned above that includes a few different languages is using, according to the meta code of the page, UTF8 encoding:
Quote
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

vick

  • New Member
  • *
  • Posts: 15
Re: Codepages, encodings, Unicode etc.
« Reply #4 on: August 26, 2014, 11:04:19 pm »
Quote
@vick, are you familiar with ANSI, UTF8, UTF16?
Yes. The latter two are encodings for Unicode (Unicode Transformation Format), while the former is American National Standartization Institute, which is some USA organization :)

Quote
Meanwhile, the path you mentioned above that includes a few different languages is using, according to the meta code of the page, UTF8 encoding:
Quote
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
PHP's Unicode story is pretty pathetic, though... to put it mildly :)

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Codepages, encodings, Unicode etc.
« Reply #5 on: August 26, 2014, 11:11:37 pm »
I see... So, by default, on Windows, FPC will insert some invisible code to covert ansistrings to/from whatever GetACP function returns. And on Linux, it will use $LANG. Don't think I want any of that... So I guess my best option is to avoid ansistrings altogether?
No the idea is that if your ansistring has no type it is by default the system default and there is no need for conversion so there is no convertion from ansistring to ansistring unless there is specified that one is needed by setting the code page when the variable is declared. eg
Code: [Select]
function RTrim(cosnt inStr:ansistring):ansistring;
begin
  Result := trim(instr);
end;

the above function will have no convertion from string to string farther more since ansistring is a managed type there is no in memory copy just a reference increment until a change is made.
But
Code: [Select]
function RTrim(cosnt inStr:ansitring(utf8)):ansistring;
begin
  Result := trim(instr);
end;

will force a conversion to utf8 when the rtrim is called then reforce one when the instr is passed to trim but not when the result from trim is assigned to the rtrim result.

Every conversion means that a new string is created in memory slowing down thing even more.

Quote
yes code page is a property of the encoding. eg utf16 has no code pages it is designed to support languages in a single continues space as far as I know, ascii and ansi on the other hand have code pages. For example Ascii describes only the first 127 characters the rest (from 128 to 255) are code page aware loading different code page characters in that space you change the secondary language support.
ASCII and UTF-16 are encodings, though... And all this legacy one-byte Windows stuff bothers me. All of that looks extremely error-prone.
What's the deal with filenames on Windows? For example, this thread says
Quote
As Taazz explained, FPC used ANSI version of Windows API functions. To be precise, when you call reset it calls CreateFileA instead of CreateFileW.
http://forum.lazarus.freepascal.org/index.php?topic=24765.0
While the Wiki says
Quote
the RTL uses UTF-16 OS API calls
I was under the impression that CreateFileA accepts filenames in a one-byte encoding, while CreateFileW accepts UTF-16... Can I create a Windows file that's called 'C:\Program Files\فارسی\ქართული\日本語\Русский\Türkçe.txt' ? Can I open this file?

The wiki talks about the latest development, as far as I know the 2.6.X series of the compiler is ansi based and not utf16 on windows while the 2.7.X is utf16 baseds. On linux I guess that both versions are utf8 based since it is the default system encoding.

When 2.8.X series is released I would expect all the xxxxxA windows API calls will be converted to xxxxxW and the string type will be an alias of unicodestring instead of ansistring avoiding unnecessary string conversions.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

vick

  • New Member
  • *
  • Posts: 15
Re: Codepages, encodings, Unicode etc.
« Reply #6 on: August 26, 2014, 11:59:46 pm »
Quote
No the idea is that if your ansistring has no type it is by default the system default and there is no need for conversion so there is no convertion from ansistring to ansistring unless there is specified that one is needed by setting the code page when the variable is declared
Thank you. I was thinking about the scenario where all my strings are utf8 (or utf16). So, let's say I want to read a utf-8 encoded file into memory, and work with it as either array of code points or (if thats impossible) just bytes, but I certainly don't want FPC to change it in any way (when passing strings from the file to some functions in the standard library).
Quote
The wiki talks about the latest development, as far as I know the 2.6.X series of the compiler is ansi based and not utf16 on windows while the 2.7.X is utf16 baseds. On linux I guess that both versions are utf8 based since it is the default system encoding.
That's a pity... oh well. I'll see if I can deal with it.

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Codepages, encodings, Unicode etc.
« Reply #7 on: August 27, 2014, 12:13:55 am »
Quote
No the idea is that if your ansistring has no type it is by default the system default and there is no need for conversion so there is no convertion from ansistring to ansistring unless there is specified that one is needed by setting the code page when the variable is declared
Thank you. I was thinking about the scenario where all my strings are utf8 (or utf16). So, let's say I want to read a utf-8 encoded file into memory, and work with it as either array of code points or (if thats impossible) just bytes, but I certainly don't want FPC to change it in any way (when passing strings from the file to some functions in the standard library).

If by standard library you mean the fpc rtl, fcl and the rest of the libraries that come by default with a lazarus installation then you do have to be careful if the encoding you will use is different that the default. There is no magic bullet for that.

Quote
The wiki talks about the latest development, as far as I know the 2.6.X series of the compiler is ansi based and not utf16 on windows while the 2.7.X is utf16 baseds. On linux I guess that both versions are utf8 based since it is the default system encoding.
That's a pity... oh well. I'll see if I can deal with it.
deal with what I still don't understand what you miss.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

vick

  • New Member
  • *
  • Posts: 15
Re: Codepages, encodings, Unicode etc.
« Reply #8 on: August 27, 2014, 12:35:59 am »
Quote
If by standard library you mean the fpc rtl, fcl and the rest of the libraries that come by default with a lazarus installation then you do have to be careful if the encoding you will use is different that the default. There is no magic bullet for that.
Sure, no magic bullets... still, I consider all silent conversions of text a misfeature. That reminds me of Perl:
Code: [Select]
$ perl -E 'binmode STDOUT, ":encoding(utf-8)"; say "Привет, мир!"'
Привет, мир!
Does it have to work like that?  :( Is that really useful? Not for me. Still, I like Perl, but this thing here is a pretty severe desing flaw, IMO.
Python 3 completly broke backwards compatibility to fix very similar issue.
Quote
deal with what I still don't understand what you miss.
For example, with rather obscure stuff like that:
Quote
As I said in my last post rewriting the fpc function HashFile to use streams instead of typed files is the only way to make your application work for any unicode filename now

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Codepages, encodings, Unicode etc.
« Reply #9 on: August 27, 2014, 01:12:13 am »
You will always have problems like that on all languages/IDEs it only means that the people behind the project can not respond fast enough to changes in the operating system, common problem on non commercially supported open source projects.

At this point I get the impression that you haven't tried to implement your solution yet, although there is a strong community behind the project which has already proved that it can and will help you (as you can see in the thread you are quoting) you have fallen prey of your fears and you miss the simplicity that the current implementation offers.
My advice to you is start at least the design create something concrete that people can poke holes in it and post it here you will get a lot more accurate help that way.
In any way I would like to hear from you what you choose to use for your project and why when your evaluation is over.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

vick

  • New Member
  • *
  • Posts: 15
Re: Codepages, encodings, Unicode etc.
« Reply #10 on: August 27, 2014, 01:24:14 am »
Quote
My advice to you is start at least the design create something concrete that people can poke holes in it and post it here you will get a lot more accurate help that way.
In any way I would like to hear from you what you choose to use for your project and why when your evaluation is over.
That sounds very reasonable, I'll do that. Thank you for your time.

mse

  • Sr. Member
  • ****
  • Posts: 286
Re: Codepages, encodings, Unicode etc.
« Reply #11 on: August 27, 2014, 07:16:48 am »
I started to learn Pascal because I wanted to create cross-platform GUI applications (where 'cross-platform' means Linux and Windows - I don't care about OS X).
MSEide+MSEgui is a GUI framework for Free Pascal completely made with the 16 bit "UnicodeString". It also has own UnicodeString file and system access functions and components. The MSEgui database framework is based on UnicodeString too. Maybe it suits your needs.
http://sourceforge.net/projects/mseide-msegui/
Lazarus and fpGUI use utf-8 in 8 bit "AnsiString" everywhere and provide utf-8 versions of the FPC system and utility functions.

« Last Edit: August 27, 2014, 07:55:14 am by mse »

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Codepages, encodings, Unicode etc.
« Reply #12 on: August 27, 2014, 01:36:54 pm »
... made with the 16 bit "UnicodeString".
That's bad and wasteful.
The 5 characters word "Hello" gets translated into 10 bytes
48 00 65 00 6C 00 6C 00 6F 00

That's why we have UTF8 where:
The 5 characters word "Hello" gets translated into 5 bytes
48 65 6C 6C 6F.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: Codepages, encodings, Unicode etc.
« Reply #13 on: August 27, 2014, 05:07:55 pm »
That's why we have UTF8 where:

The trouble is that utf8 on Windows is only an encoding used in documents, it is not really used in Windows (-API) itself.

Same goes for QT, .NET Java and COCOA which also are default 2-byte.

For economics of encodings see http://www.stack.nl/~marcov/unicode.pdf   paragraph 0.1.3

 

TinyPortal © 2005-2018