Recent

Author Topic: Considerations on IME - Japanese in SynEdit (probably all other IME users too)  (Read 25490 times)

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9855
  • Debugger - SynEdit - and more
    • wiki
Please add it to the bug report. So I can add it, time permitting.

The problem here is to convert all codepoints to utf16 first. This will add to the time consumption of this code.

The current implementation already is on my "must be changed" list, because it turned out to be too slow. 
It did not matter with just one caret (except when you paste huge text from clipboard, or when you indent many 1000 lines)

But now, that there is multi caret, and time consumption is multiplied for each caret, it matters more.

But I will add it, at least to be able to compare. Just need to find the time to to it.

DO you know what license this code has? I need something compatible with SynEdit.

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
The problem here is to convert all codepoints to utf16 first. This will add to the time consumption of this code.
So are you looking for ability to find the width from UTF8.
a function like these:
Code: [Select]
  // utf8 - is the pointer to the first byte in utf8 sequence
  // bytesinchar - is length of utf8 character in bytes. Though it could be easily read from the first byte itself.
  function GetCJKWidth(utf8; pchar: bytesinchar: Integer): TCharWidth; overload;   
  function GetCJKWidth(utf8; pchar: TCharWidth; overload;

DO you know what license this code has? I need something compatible with SynEdit.
It's whatever license needed, since I'm the author. 

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9855
  • Debugger - SynEdit - and more
    • wiki
look at the current code, it has a nested case.

However I suspect that is not really good. And also it can not be modified at run time (see below).

So what I want to do, is a tree like lookup table. It needs however to be optimize to use little memory, so that it nicely fits into as few cache lines as possible.

lookup: array  [byte] of ... // firstbyte

each page then has a low and high byte, and further pages.  I have a tree like this in another place already, for searching multiply search terms in a string within one search run.

But thats just one idea. I need to see, if it really speeds up things.

----------
Making it configurable also will allow to mix in font info. You can ask which unicode ranges are supported by a font. (eastern chars are in a fallback font)

Because on my system, some but not all ambiguous are full width. Maybe that can be detected. But thats extra

-----------
Currently there are several runs
- find codepoint borders (in byte array)
- find ltr/rtl infro
- find charwidth

maybe those can be combined. But again that needs testing.

----------
The code is also in the wrong place. It was a hack to start adding some support at all.

But then, I dont want the above to defer improving the current code.

So the next step would be to aim at a similar lookup like current (never mind if the structure is in data or code).

----------
Anyway I have one or 2 other items on my list, that I want to finish first.

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Here you go! utf8 based look up.
As promised
Code: [Select]
function GetCJKWidth(utf8: PChar; defWidth: TCharWidth = cwN): TCharWidth; overload;
function GetCJKWidth(utf8: PChar; charLen: Integer; defWidth: TCharWidth = cwN): TCharWidth; overload;
.
charLen is size of utf8 character in bytes.

Internally it stores utf8 character ranges as a pair of Int64. Whenever a character is looked-up it's also converted to Int64 and then binary search is performed over an array. Similar to the original version.

a smaller improvement could be done for the search.
So for each character a smaller section of the array is searched (based of the character length), rather than the full array.

upd: attached to the bug tracker
« Last Edit: April 12, 2015, 06:45:11 am by skalogryz »

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9855
  • Debugger - SynEdit - and more
    • wiki
I added your code. But for now left it in the ifdef.

I did some measurements.
- add 250.000 lines to SynEdit (inside BegintUpdate)
- add 250.000 Carets (column bode selection on 3rd column, top to bottom, zero width)
  ecEditorTop,  ecRight,  ecRight,  ecColSelEditorBottom,
- type an "X" (with all 250 chars)

I admit 250k Carets is not normal, but inserting that many line can happen. This will also affect copy and paste or changing indent of selected lines.

times include the entire execution. Since the double width code makes only part of the time, the actual different on that code alone is much bigger.

Also for fairness, I did not optimize the embedding of your code, That is inlined the most inner call into the loop. No idea how much that would affect the result.

Accuracy can be +/- 0.1 sec

Times when compiled with all kind of debug -Criot -Sa -O1 and others
Old:
 1.24
 0.85
 5.00

Yours:
 2.48
 2.05
 8.83


Times when compiled without any of those and -O3
Old:
 0.70
 0.57
 3.21

Yours:
 1.34
 1.21
 5.21


skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Ok. here's an update.

The performance gain should be achieved by eliminating unnecessary searches.

For example: if character is one byte length and it's code out of range of "Na" widths , then don't search and just return width as "N"

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9855
  • Debugger - SynEdit - and more
    • wiki
Can you supply a patch to the unit please (Your previous code is already committed to svn, so it should be a small patch)

 

TinyPortal © 2005-2018