Forum > Beginners

Codepages, encodings, Unicode etc.

(1/3) > >>

vick:
I started to learn Pascal because I wanted to create cross-platform GUI applications (where 'cross-platform' means Linux and Windows - I don't care about OS X). I intend to work with lots of text. So, I've already read the refrence manual and I'm starting to explore the standard library and Lazarus library. But this made me think twice:
http://wiki.freepascal.org/FPC_Unicode_support
Really, I don't understand any of that, but this all sure sounds like bad news. Can you please give some examples of the following:

--- Quote ---If a string with a static code page X1 is assigned to a string with static code page X2 and X1<>X2, the string data will generally first be converted to said code page X2 before assignment
--- End quote ---
Anyway, is there any difference between 'code page' and encoding in the Wiki's usage of these terms?

--- Quote ---The dynamic code page is a property of the AnsiString which, similar to the length and the reference count, defines the actual code page of the data currently held by that AnsiString.
--- End quote ---
How does it 'define' 'code page' and what the compiler is actually DOING with that 'definition'?

--- Quote ---Code page-aware strings are the only exception to this rule: concatenating two or more strings always occurs without data loss, although afterwards the resulting string will of course still be converted to the static code page of the destination (which may result in data loss).
--- End quote ---
Can you give some examples of concatenating strings with different 'code pages' without data loss and then losing data anyways?
I've searched this forum (and some others) and it seems lots of people have troubes with Unicode in Free Pascal. Some workarounds involve temporarily renaming files to ASCII (!), and then back... That's pretty mindboggling in 2014, to be frank. Maybe Free Pascal is simply not suitable for what I want to do.
Sorry if that sounds a little inflammatory. I understand that the Free Pascal team is working on Unicode support. Still, I have to figure out the state of string handling in FPC before I can proceed any futher with it.

taazz:

--- Quote from: vick on August 26, 2014, 07:01:35 pm ---I started to learn Pascal because I wanted to create cross-platform GUI applications (where 'cross-platform' means Linux and Windows - I don't care about OS X). I intend to work with lots of text. So, I've already read the refrence manual and I'm starting to explore the standard library and Lazarus library. But this made me think twice:
http://wiki.freepascal.org/FPC_Unicode_support
Really, I don't understand any of that, but this all sure sounds like bad news. Can you please give some examples of the following:

--- Quote ---If a string with a static code page X1 is assigned to a string with static code page X2 and X1<>X2, the string data will generally first be converted to said code page X2 before assignment
--- End quote ---

--- End quote ---

its simple lets assume the following definitions

--- Code: ---var
  Txt1 : widestring; //utf16 string type
  Txt2 : Utf8String;

--- End code ---
Now you want to do something simple as Txt2 := Txt1; this will be converted bythe compiler to something like

--- Code: ---  Txt2 := UTF8Encode(Txt1);//utf8encode converts between utf16 to utf8 encoding.

--- End code ---
farther more if you try to concatenate two strings they are first converted to the result string and then concatenated to one. eg

--- Code: --- const
  str1 : ansistring = 'This is an Ansi String ';
  str2 : ansistring = ' to be converted';
var
  Txt : utf8String;
begin
   txt := str1+str2;
end;

--- End code ---
will be converted to

--- Code: --- const
  str1 : ansistring = 'This is an Ansi String ';
  str2 : ansistring = ' to be converted';
var
  Txt : utf8String;
begin
   txt := utf8encode(str1)+utf8encode(str2);
end;

--- End code ---
 


--- Quote from: vick on August 26, 2014, 07:01:35 pm ---Anyway, is there any difference between 'code page' and encoding in the Wiki's usage of these terms?

--- End quote ---

yes code page is a property of the encoding. eg utf16 has no code pages it is designed to support languages in a single continues space as far as I know, ascii and ansi on the other hand have code pages. For example Ascii describes only the first 127 characters the rest (from 128 to 255) are code page aware loading different code page characters in that space you change the secondary language support.


--- Quote from: vick on August 26, 2014, 07:01:35 pm ---

--- Quote ---The dynamic code page is a property of the AnsiString which, similar to the length and the reference count, defines the actual code page of the data currently held by that AnsiString.
--- End quote ---
How does it 'define' 'code page' and what the compiler is actually DOING with that 'definition'?

--- End quote ---

I have no idea I was under the impression that if you did not define a code page on your ansi string definition then the default windows ansi code page will be used. On linux there should be something similar but I did not looked it up yet so I'll leave to more knowledgable people to answer that.


--- Quote from: vick on August 26, 2014, 07:01:35 pm ---

--- Quote ---Code page-aware strings are the only exception to this rule: concatenating two or more strings always occurs without data loss, although afterwards the resulting string will of course still be converted to the static code page of the destination (which may result in data loss).
--- End quote ---
Can you give some examples of concatenating strings with different 'code pages' without data loss and then losing data anyways?

I've searched this forum (and some others) and it seems lots of people have troubes with Unicode in Free Pascal. Some workarounds involve temporarily renaming files to ASCII (!), and then back... That's pretty mindboggling in 2014, to be frank. Maybe Free Pascal is simply not suitable for what I want to do.
Sorry if that sounds a little inflammatory. I understand that the Free Pascal team is working on Unicode support. Still, I have to figure out the state of string handling in FPC before I can proceed any futher with it.

--- End quote ---

I never had any problems with the string in fpc, then again I don't have international applications to worry about only national.

vick:

--- Quote ---farther more if you try to concatenate two strings they are first converted to the result string and then concatenated to one. eg
Code:

--- Code: ---const
  str1 : ansistring = 'This is an Ansi String ';
  str2 : ansistring = ' to be converted';
var
  Txt : utf8String;
begin
   txt := str1+str2;
end;

--- End code ---
will be converted to
Code:

--- Code: ---const
  str1 : ansistring = 'This is an Ansi String ';
  str2 : ansistring = ' to be converted';
var
  Txt : utf8String;
begin
   txt := utf8encode(str1)+utf8encode(str2);
end;

--- End code ---

--- End quote ---
I see... So, by default, on Windows, FPC will insert some invisible code to covert ansistrings to/from whatever GetACP function returns. And on Linux, it will use $LANG. Don't think I want any of that... So I guess my best option is to avoid ansistrings altogether?

--- Quote ---yes code page is a property of the encoding. eg utf16 has no code pages it is designed to support languages in a single continues space as far as I know, ascii and ansi on the other hand have code pages. For example Ascii describes only the first 127 characters the rest (from 128 to 255) are code page aware loading different code page characters in that space you change the secondary language support.
--- End quote ---
ASCII and UTF-16 are encodings, though... And all this legacy one-byte Windows stuff bothers me. All of that looks extremely error-prone.
What's the deal with filenames on Windows? For example, this thread says

--- Quote ---As Taazz explained, FPC used ANSI version of Windows API functions. To be precise, when you call reset it calls CreateFileA instead of CreateFileW.
--- End quote ---
http://forum.lazarus.freepascal.org/index.php?topic=24765.0
While the Wiki says

--- Quote ---the RTL uses UTF-16 OS API calls

--- End quote ---
I was under the impression that CreateFileA accepts filenames in a one-byte encoding, while CreateFileW accepts UTF-16... Can I create a Windows file that's called 'C:\Program Files\فارسی\ქართული\日本語\Русский\Türkçe.txt' ? Can I open this file?

engkin:
@vick, are you familiar with ANSI, UTF8, UTF16? if not, then I encourage you to read this article:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Meanwhile, the path you mentioned above that includes a few different languages is using, according to the meta code of the page, UTF8 encoding:

--- Quote ---<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
--- End quote ---

vick:

--- Quote ---@vick, are you familiar with ANSI, UTF8, UTF16?
--- End quote ---
Yes. The latter two are encodings for Unicode (Unicode Transformation Format), while the former is American National Standartization Institute, which is some USA organization :)


--- Quote ---Meanwhile, the path you mentioned above that includes a few different languages is using, according to the meta code of the page, UTF8 encoding:

--- Quote ---<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
--- End quote ---

--- End quote ---
PHP's Unicode story is pretty pathetic, though... to put it mildly :)

Navigation

[0] Message Index

[#] Next page

Go to full version