Forum > General

[Solved] StringReplace and Unicode Characters with changing Up/Low Byte Length

(1/2) > >>

LazProgger:
When calling the standard StringReplace function in the following way, we get unexpected results:


--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---StringReplace('ı_abc', 'a', 'x', [rfIgnoreCase]);  // returns ıxabcStringReplace('ſ_abc', 'a', 'x', [rfIgnoreCase]);  // returns ſxabcStringReplace('ɐ_abc', 'a', 'x', [rfIgnoreCase]);  // returns ɐ_axcStringReplace('ȿ_abc', 'a', 'x', [rfIgnoreCase]);  // returns ȿ_axc
It's the same for all strings in which there is a lowercase letter somewhere in the string before a replacement position which uppercase version has another number of bytes than the lowercase one. The first two lines are examples for the case that the uppercase variant has less letters than the lowercase one (making the replacement turning left), the next two lines are examples for the opposite, when the bytes get more after turning to uppercase (making the replacement turning right). The more of those letters you have before your replacement, the more the replacement moves.

The reason for this seems to be that internally, when using the rfIgnoreCase flag, the function compares an uppercase variant of the input string with the replace string so that due to the byte size changes, the position of the string to match is not the same like in the original string.

Is this behavior known and are there any workarounds for that?

jamie:
yes, it does not work with UTF8 characters

Code a Unicode replacement using Unicode string or widestring.

paweld:
This is not a bug - for UTF8 string use UTF8StringReplace from the LazUTF8 unit.
There are other functions in this unit that allow you to operate on UTF-8 strings, such as: UTF8Length, UTF8Copy, UTF8Pos, UTF8CompareText, UTF8UpperCase and other useful ones.
--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---Memo1.Lines.Add(UTF8StringReplace('ı_abc', 'a', 'x', [rfIgnoreCase]));  // ı_xbcMemo1.Lines.Add(UTF8StringReplace('ſ_abc', 'a', 'x', [rfIgnoreCase]));  // ſ_xbcMemo1.Lines.Add(UTF8StringReplace('ɐ_abc', 'a', 'x', [rfIgnoreCase]));  // ɐ_xbcMemo1.Lines.Add(UTF8StringReplace('ȿ_abc', 'a', 'x', [rfIgnoreCase]));  // ȿ_xbc

LazProgger:
Good to know! Thank you very much for that hint!

I even know some of the UTF8 functions, I often used UTF8Length, for example. But since StringReplace() always worked for me with all characters with which I have used it, I came not to the idea to search for a UTF8 replace function. My fault.

vfclists:
If it is all alright then better change the title to mark it solved.

Navigation

[0] Message Index

[#] Next page

Go to full version