Recent

Author Topic: Unicode Support  (Read 26149 times)

Astral

  • New Member
  • *
  • Posts: 49
Unicode Support
« on: June 22, 2009, 09:13:29 pm »
In Dec. of 2007 I posted here about my attempts to convert my translator program to Lazarus/FPC.  I had some luck then, under Windows XP and Ubuntu Linux.  But I ended up abandoning the effort for lack of a reasonable way to replace TRichEdit.  I think at the time SynEdit didn't support Unicode as it does not, via UTF8.

It's been quite a while, so I thought I would give it another try.

I use Unicode extensively in my Delphi program.  I use TNTware components in Delphi 7 and recently started using Delphi 2009.

However, I still want to build a Lazarus/FPC version for use on Linux and possibly MAC OSX.  I'm considering making my program open source at some point, but not quite sure I want to do it now.

I use a Pascal unit called Unicode.pas written by Mike Lischke (from Germany).  I think it's the same one used in the TNTware, and it includes support for TWideStringList, which I use in my program.

But I also need the capability to classify characters, based on ranges, as to what "codepage" they fall into, Cyrillic, Hangul, Greek, Hebrew, Arabic, Japanese, Chinese, etc.

And I need the routines which support upper and lower case tests and conversions, ispper, islower, toupper, tolower, ispunctuation, isdigit, isalpha, isalphanum, etc.

Delphi 2009 has many of these routines in their new Character unit.

Unfortunately, I have to be able to function without all the new stuff in Delphi 2009, so I would like to get Mike Lischke's unit to compile under Lazarus/FPC.

I have succeeded at that, but still can't get it to work, due to some sort of problem with the included Unicode.res file.  This is a 68 KB binary resource file containing information about all the Unicode characters.

I don't know how to adapt this resource file, or if it's even worth the trouble.  But I figured I'd ask on this forum and see if anyone else has tried to convert this Unicode.pas and succeeded.

There exist also many other versions of this unit, include JclUnicode.pas and SynUnicode.pas.  I tried compiling the Jedi Code Library version, but ran into some problems and finally gave up.

Has anyone else succeeded at getting the JCL to compile and work on Lazarus/FPC?  I know it's been talked about for years, but I have yet to find any definitive info about it.

I'd love to use the JCL in my program, especially some of the Unicode functions and date/time routines, but if I can't get it to compile, then it's not very useful.

I have made a lot of progress, and I could make my program independent of this unit Unicode.pas, but I particularly wanted to use the TWideStringList, for storing lists of Unicode strings, for things like spell checking and translation dictionaries.  If I have to live without TWideStringList, I can still make my program work, but it is a great convenience.

If anyone has any comments or suggestions, don't be shy.  I can use all the help I can get.  I have been working on my program as long as anyone can remember, and it's still not "finished" yet, nor will it ever be until it can run on all the major platforms.

In the past few months I've tried porting my program to C++/CLI for .NET under Microsoft Visual C++ 2008 Express Edition.  I also tried NetBeans/Java, and Eclipse, without much success.  I do have most of the code running in C++/CLI, but lost interest in it, in favor of trying out Delphi 2009, which turns out to have all the required functionality.

I want my program to be easily portable between at least Windows, Linux, and Mac OSX, and Lazarus/FPC is one possible solution, maybe not the best one.

I'm using SynEdit right now, but might end up using something else for my editor windows.  I allow the user to input text in one editor and produce translated results in another editor window.  I have other hidden windows internally, which can be selectively displayed.

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1890
Re: Unicode Support
« Reply #1 on: June 22, 2009, 09:54:17 pm »
But I also need the capability to classify characters, based on ranges, as to what "codepage" they fall into, Cyrillic, Hangul, Greek, Hebrew, Arabic, Japanese, Chinese, etc.

And I need the routines which support upper and lower case tests and conversions, ispper, islower, toupper, tolower, ispunctuation, isdigit, isalpha, isalphanum, etc.

This part of your problem can probably be solved with my utf8tools:
http://www.theo.ch/lazarus/utf8tools.zip

Have a look at the demo in the "charandscan" folder.
If you move the caret in the text, you can see in the statusbar what you want to know about each character at the caret position.

Astral

  • New Member
  • *
  • Posts: 49
Re: Unicode Support
« Reply #2 on: June 22, 2009, 10:29:43 pm »
Thanks very much!  I will look into your unit and see what I can use.

Meanwhile, I have managed to get past some more problems and actually got something running using a bit of magic (muttering under my breath) and commenting out stuff that I could not get to compile.

I just translated a line of French into English!  That's not such a tremendous accomplishment, it might seem, but getting my Delphi program to run under Lazarus has not been easy, nor is it anywhere near finished.

I'm hoping to hear from some others about their Unicode experiences.

Right now, my program seems to run very slow under Lazarus.  Maybe it's hung in a loop somewhere.  lol

Astral

  • New Member
  • *
  • Posts: 49
Re: Unicode Support
« Reply #3 on: June 23, 2009, 12:08:30 am »

I'm having some luck with my program, but it seems to hang
in odd spots, like in an if statement that has to return true or false.

I did notice that the word being translated is était,
which has an acute accent on the first character of the string.

It is stored as a WideString, yet the debugger seems to be displaying it as an Ansi String (using the UTF-8 codes).

It looks like this:  Ã@tait   or something similar to that.

The é seems to be replaced with the ANSI equivalent of the UTF-8 representation of the character.

The program appears to hang on a simple comparison:

Code: [Select]
             for I := 1 to Length(InpWord) do begin
                TestWord := iUnicode.ToLower(InpWord);
                Recognized := False;
                for J := Low(RomanceCodes) to High(RomanceCodes) do begin
                  if TestWord[I] = RomanceCodes[J].Normal then begin
                    TestWord[I] := RomanceCodes[J].Lower;
                    iLook := LookUp( ForwardDictFile, TestWord );
                    if iLook <> nil then begin
                      Recognized := True;
                      break;
                    end
                    else begin
                      TestWord[I] := RomanceCodes[J].Normal;
                    end;
                  end;
                end;
                if Recognized then break;
              end;
            end;


I need to buy a clue as to why it would hang there.

Note that break is used to exit the inner loop when a match is found, and an if statement is used to break out of the second loop when a character is recognized.

It may be a coding error, but I've used this code pretty extensively in the Delphi version of my program.



It seems to be hung on the if statement just inside the "for J" loop.

I have gone through this section of code previously with other words, but this is the first time the WideString InpWord has contained an accented character.


Astral

  • New Member
  • *
  • Posts: 49
Re: Unicode Support
« Reply #4 on: June 23, 2009, 08:50:18 am »
Thanks for all the help.

I have gotten past a lot of problems today, but I still think that Lazaras/FPC is not quite ready for me yet.  In my humble opinion, it seems unstable, but it might just be a matter of learning curve.  I've been working with Delphi since it first came out and it's had its share of problems, and I've often been tempted to give up on it, but it's the best game in town.

The debugger is not quite as good as what I would expect.  For an array of WideStrings, it kindly shows me a list of addresses, which are virtually useless for figuring out what is going on.

I had to write an UTF8ToUTF16 routine today, to take strings from the SynEdit control and convert them back into WideStrings, so that I have a common representation across all languages.  UTF-8 is fine for a lot of things, but it's easier to work with a one-for-one relationship between array elements and characters (with the exception of surrogates, and those don't pop up often in my experience).

I'm managed to translate at least 20 lines of input so far, but the debugger seems to go off into Never Never Land.  If I Reset the Debugger, it tells me there's some kind of addressing exception, but it's not very good at informing me where to look for the problem.

I'll keep trying.

Thanks again for the help and the sample code.

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1890
Re: Unicode Support
« Reply #5 on: June 23, 2009, 09:50:35 am »
UTF-8 is fine for a lot of things, but it's easier to work with a one-for-one relationship between array elements and characters (with the exception of surrogates, and those don't pop up often in my experience).

Index-based access is possible with utf8tools. You don't need widestring for this. But you need to rewrite your code.

Example:

 
Code: [Select]
s := TUTF8Scanner.Create(Memo1.text);
  for i := 1 to s.Length do
    if (not TCharacter.IsLetterOrDigit(s[i])) and
      (not TCharacter.IsControl(s[i])) then s[i] := '.';
  Memo1.Text := s.UTF8String;
  s.free;

Astral

  • New Member
  • *
  • Posts: 49
Re: Unicode Support
« Reply #6 on: June 23, 2009, 10:29:11 am »
Thanks for the tips.  I will consider doing that.

My goal was to AVOID having to rewrite 100,000 lines of code using WideString.   Perhaps I could do it, but I much prefer working with the 16-bit characters (and I mostly just ignore the characters outside the first 65,536).  I suppose it would be nice to deal with those too.

Right now my program runs fine for awhile, then seems to go off into la la land after translating about 30 lines of French to English, and based on the log files, it looks like everything is working the same as in the Delphi version, but then I have to click on Stop or Reset Debugger to get control again, and then it reports an exception.  I supppose I will find the problem eventually, by trial and error, divide and conquer, etc.

My goal was to see how far I could get converting the Delphi code without a major rewrite of thousands of lines of code dealing with WideStrings.  I would like to avoid having to think about the internal representation of UTF-8, just transform it back and forth as necessary, using foolproof conversion routines.

My dictionaries are all stored in UTF-8 and I read them in and convert them to WideString format.  I suppose they might be more efficient in some ways if I dealt with the UTF-8 format directly, but it's a pain, and I'd prefer to avoid it.

I am particularly concerned that the debugger often shows no information in the Call Stack display under View Debug Windows.

I don't like that an array of WideStrings is shown to me as a list of hex addresses.  Having a display of the actual strings is really helpful for debugging.

I'm not a complete stranger to gdb or Linux.  I've used them in the past, using command line compilers.

I miss the Run till Return feature of Delphi.

I just got through rebooting after a hard lockup of the Lazarus IDE, after attempting to do a replace on a string.  It just locked up and there was no way to recover but kill the environment and restart it.  I rebooted just to make sure I was starting fresh, because there were two other processes that refused to go way.  I don't know if they were part of Lazarus or not.

I"m very happy with the progress I've made, but it's still not there yet.  I suppose it's likely that I'm doing something questionable in my code that works fine on Delphi, but doesn't work on Lazarus/FPC, but I can't seem to isolate what that might be.

Astral

  • New Member
  • *
  • Posts: 49
Re: Unicode Support
« Reply #7 on: June 23, 2009, 10:33:05 am »
One more question:

I'm using SynEdit.  I see a column on the left of each SynEdit window.  It seems to be line numbers, but they are truncated to only show the first digit, 1 repeats 10 times, then 2 repeats 10 times, etc.

Is there a trick to turning off line numbering?  Or a way to make the line numbering show all the digits?

Also the character spacing doesn't seem to work very well in SynEdit.  SynEdit works fine for loading my input files, but never gets past the first page on the output window.  That's when the program seems to go off into la la land.

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1890
Re: Unicode Support
« Reply #8 on: June 23, 2009, 10:59:16 am »
Is there a trick to turning off line numbering? 

No, it's not a trick, it's a property which you might find in
Gutter -> ShowLineNumbers

The lazarus IDE is using synedit and is not going to the land you mentioned.
So I don't know what your problem is.

And no, afaik the Debugger doesn't show WideStrings. One more reason not to use them.

Astral

  • New Member
  • *
  • Posts: 49
Re: Unicode Support
« Reply #9 on: June 23, 2009, 09:07:23 pm »
Thanks.

What about the character spacing?  I changed from fixed to variable width characters and it seemed to have no effect whatsoever.  In one case the letters M and i were jammed together and there was a big space between the i and the next character, which makes the input almost unreadable.  Fixed spacing is great for some things, but for human readable text variable width is easier on the eye.

I do appreciate your help, but we each have our own outlook on things and I happen to like dealing with WideStrings.   I should not be faulted for that.  The debugger could easily display the strings, if it can display UTF-8, it can display UTF-16.  UTF-8 works great, and takes up less space, is more efficient, etc,  but it looks ugly when it gets shown incorrectly.  Why can the debugger not show UTF-8 strings correctly?  Or am I doing something wrong when I see a capital A with a tilde over it, followed by an @.

I'm hoping that two years from now things will be greatly improved.

BTW, I don't understand why Lazarus doesn't support resource files.  I have a set of message strings in an rc files, which I convert to a res file using the "resource compiler".  I don't see why Lazarus/FPC doesn't support accessing data from a resource file.  Many Delphi programs rely on res files and it seems very crippling to disallow them.  I managed to convert my own messages into a unit by compiling them as an array of strings.  But for something like Unicode.res, for which I have no source code, it's very difficult to work around.

Vincent Snijders

  • Administrator
  • Hero Member
  • *
  • Posts: 2661
    • My Lazarus wiki user page
Re: Unicode Support
« Reply #10 on: June 23, 2009, 09:15:50 pm »
BTW, I don't understand why Lazarus doesn't support resource files.
Because the compiler supports for resource files only for win32, unless you use fpc 2.3.1, which added support for windows like resource for other targets.
« Last Edit: June 23, 2009, 09:17:21 pm by Vincent Snijders »

Astral

  • New Member
  • *
  • Posts: 49
Re: Unicode Support
« Reply #11 on: June 23, 2009, 10:06:25 pm »
Is there a trick to turning off line numbering? 

No, it's not a trick, it's a property which you might find in
Gutter -> ShowLineNumbers

The lazarus IDE is using synedit and is not going to the land you mentioned.
So I don't know what your problem is.

And no, afaik the Debugger doesn't show WideStrings. One more reason not to use them.

Thanks for all your help.

I don't know what my problem is either, that's why I am here.  :)

I think SynEdit is working fine, that it's doing exactly what I expect, except for the character spacing is too wide.  If I knew how to fix that it would be nice.  I'm sure there's a way to do it, just not sure how.






Astral

  • New Member
  • *
  • Posts: 49
Re: Unicode Support
« Reply #12 on: June 23, 2009, 10:34:30 pm »
BTW, I don't understand why Lazarus doesn't support resource files.
Because the compiler supports for resource files only for win32, unless you use fpc 2.3.1, which added support for windows like resource for other targets.


Thanks!  I will go look for fpc 2.3.1!

Astral

  • New Member
  • *
  • Posts: 49
Re: Unicode Support
« Reply #13 on: June 23, 2009, 11:09:20 pm »
Incidentally:

My program is still "hanging" on that same innocuous looking line of code I reported earlier.

I single step up to the line, then hit F8 to step over it, and it just sits there and does nothing, no indication of any problem.

The data in the record is valid (I logged it, and I can examine it in the debugger).

I'm comparing the first character of a WideChar with another WideChar in a table, and it's just hung.  The values of the array indices appear to be correct.  I'm not accessing outside the bounds of the WideString or the WideChar in the record.

I don't know how else to describe it.  This is no reflection upon my character or my Character handling abilities.

Something is wrong and I have no clue how to make it go any further.  I'm sure I could make it avoid this section of the code, but I'm wondering what are the possibilities?

Is there a debugger feature I could use here to figure out what is happening?  It's obviously hung, but I'm not sure what it's doing and why.

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1890
Re: Unicode Support
« Reply #14 on: June 23, 2009, 11:36:12 pm »
What about the character spacing?  I changed from fixed to variable width characters and it seemed to have no effect whatsoever.  In one case the letters M and i were jammed together and there was a big space between the i and the next character, which makes the input almost unreadable.  Fixed spacing is great for some things, but for human readable text variable width is easier on the eye.

Synedit is a source code editor and supports only fixed-width fonts.
Variable-width fonts are not displayed correctly.

I don't know why your program hangs. How should I?