Forum > Beginners

Codepages, encodings, Unicode etc.

<< < (2/3) > >>

taazz:

--- Quote from: vick on August 26, 2014, 09:25:07 pm ---I see... So, by default, on Windows, FPC will insert some invisible code to covert ansistrings to/from whatever GetACP function returns. And on Linux, it will use $LANG. Don't think I want any of that... So I guess my best option is to avoid ansistrings altogether?

--- End quote ---
No the idea is that if your ansistring has no type it is by default the system default and there is no need for conversion so there is no convertion from ansistring to ansistring unless there is specified that one is needed by setting the code page when the variable is declared. eg

--- Code: ---function RTrim(cosnt inStr:ansistring):ansistring;
begin
  Result := trim(instr);
end;

--- End code ---

the above function will have no convertion from string to string farther more since ansistring is a managed type there is no in memory copy just a reference increment until a change is made.
But

--- Code: ---function RTrim(cosnt inStr:ansitring(utf8)):ansistring;
begin
  Result := trim(instr);
end;

--- End code ---

will force a conversion to utf8 when the rtrim is called then reforce one when the instr is passed to trim but not when the result from trim is assigned to the rtrim result.

Every conversion means that a new string is created in memory slowing down thing even more.


--- Quote from: vick on August 26, 2014, 09:25:07 pm ---
--- Quote ---yes code page is a property of the encoding. eg utf16 has no code pages it is designed to support languages in a single continues space as far as I know, ascii and ansi on the other hand have code pages. For example Ascii describes only the first 127 characters the rest (from 128 to 255) are code page aware loading different code page characters in that space you change the secondary language support.
--- End quote ---
ASCII and UTF-16 are encodings, though... And all this legacy one-byte Windows stuff bothers me. All of that looks extremely error-prone.
What's the deal with filenames on Windows? For example, this thread says

--- Quote ---As Taazz explained, FPC used ANSI version of Windows API functions. To be precise, when you call reset it calls CreateFileA instead of CreateFileW.
--- End quote ---
http://forum.lazarus.freepascal.org/index.php?topic=24765.0
While the Wiki says

--- Quote ---the RTL uses UTF-16 OS API calls

--- End quote ---
I was under the impression that CreateFileA accepts filenames in a one-byte encoding, while CreateFileW accepts UTF-16... Can I create a Windows file that's called 'C:\Program Files\فارسی\ქართული\日本語\Русский\Türkçe.txt' ? Can I open this file?

--- End quote ---

The wiki talks about the latest development, as far as I know the 2.6.X series of the compiler is ansi based and not utf16 on windows while the 2.7.X is utf16 baseds. On linux I guess that both versions are utf8 based since it is the default system encoding.

When 2.8.X series is released I would expect all the xxxxxA windows API calls will be converted to xxxxxW and the string type will be an alias of unicodestring instead of ansistring avoiding unnecessary string conversions.

vick:

--- Quote ---No the idea is that if your ansistring has no type it is by default the system default and there is no need for conversion so there is no convertion from ansistring to ansistring unless there is specified that one is needed by setting the code page when the variable is declared
--- End quote ---
Thank you. I was thinking about the scenario where all my strings are utf8 (or utf16). So, let's say I want to read a utf-8 encoded file into memory, and work with it as either array of code points or (if thats impossible) just bytes, but I certainly don't want FPC to change it in any way (when passing strings from the file to some functions in the standard library).

--- Quote ---The wiki talks about the latest development, as far as I know the 2.6.X series of the compiler is ansi based and not utf16 on windows while the 2.7.X is utf16 baseds. On linux I guess that both versions are utf8 based since it is the default system encoding.
--- End quote ---
That's a pity... oh well. I'll see if I can deal with it.

taazz:

--- Quote from: vick on August 26, 2014, 11:59:46 pm ---
--- Quote ---No the idea is that if your ansistring has no type it is by default the system default and there is no need for conversion so there is no convertion from ansistring to ansistring unless there is specified that one is needed by setting the code page when the variable is declared
--- End quote ---
Thank you. I was thinking about the scenario where all my strings are utf8 (or utf16). So, let's say I want to read a utf-8 encoded file into memory, and work with it as either array of code points or (if thats impossible) just bytes, but I certainly don't want FPC to change it in any way (when passing strings from the file to some functions in the standard library).

--- End quote ---

If by standard library you mean the fpc rtl, fcl and the rest of the libraries that come by default with a lazarus installation then you do have to be careful if the encoding you will use is different that the default. There is no magic bullet for that.


--- Quote from: vick on August 26, 2014, 11:59:46 pm ---
--- Quote ---The wiki talks about the latest development, as far as I know the 2.6.X series of the compiler is ansi based and not utf16 on windows while the 2.7.X is utf16 baseds. On linux I guess that both versions are utf8 based since it is the default system encoding.
--- End quote ---
That's a pity... oh well. I'll see if I can deal with it.

--- End quote ---
deal with what I still don't understand what you miss.

vick:

--- Quote ---If by standard library you mean the fpc rtl, fcl and the rest of the libraries that come by default with a lazarus installation then you do have to be careful if the encoding you will use is different that the default. There is no magic bullet for that.
--- End quote ---
Sure, no magic bullets... still, I consider all silent conversions of text a misfeature. That reminds me of Perl:

--- Code: ---$ perl -E 'binmode STDOUT, ":encoding(utf-8)"; say "Привет, мир!"'
Привет, мир!
--- End code ---
Does it have to work like that?  :( Is that really useful? Not for me. Still, I like Perl, but this thing here is a pretty severe desing flaw, IMO.
Python 3 completly broke backwards compatibility to fix very similar issue.

--- Quote ---deal with what I still don't understand what you miss.
--- End quote ---
For example, with rather obscure stuff like that:

--- Quote ---As I said in my last post rewriting the fpc function HashFile to use streams instead of typed files is the only way to make your application work for any unicode filename now
--- End quote ---

taazz:
You will always have problems like that on all languages/IDEs it only means that the people behind the project can not respond fast enough to changes in the operating system, common problem on non commercially supported open source projects.

At this point I get the impression that you haven't tried to implement your solution yet, although there is a strong community behind the project which has already proved that it can and will help you (as you can see in the thread you are quoting) you have fallen prey of your fears and you miss the simplicity that the current implementation offers.
My advice to you is start at least the design create something concrete that people can poke holes in it and post it here you will get a lot more accurate help that way.
In any way I would like to hear from you what you choose to use for your project and why when your evaluation is over.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version