Recent

Author Topic: PosEx variant for case-insensitive search  (Read 481 times)

Alextp

  • Hero Member
  • *****
  • Posts: 1149
    • UVviewsoft
PosEx variant for case-insensitive search
« on: September 17, 2020, 09:37:10 pm »
PosEx is ASM based so it's very fast. (Uses IndexWord ASM based func.)
For CudaText, I need variant with case-insensitive match, with WideChar/UnicodeString params.
It can avoid WidestringManager by using some callback (CudaText has such callback to make UpperCase/LowerCase for widechar. It don't use WidestringManager. It uses table lookup).
Please?
« Last Edit: September 17, 2020, 09:39:44 pm by Alextp »

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8810
  • FPC developer.
Re: PosEx variant for case-insensitive search
« Reply #1 on: September 17, 2020, 09:41:05 pm »
PosEx is ASM based so it's very fast. (Uses IndexWord ASM based func.)
For CudaText, I need variant with case-insensitive match, with WideChar/UnicodeString params.
Please?

Nope, it uses indexbyte.

But for unicodestring you would need to based on indexword, but that assumes there is a word based value to search for.

And this is hard because unicode (and unicode based case sensitivity) is simply hard. There is no chance that such version would even be in the same ballpark as the ascii version

Alextp

  • Hero Member
  • *****
  • Posts: 1149
    • UVviewsoft
Re: PosEx variant for case-insensitive search
« Reply #2 on: September 17, 2020, 11:23:36 pm »
Then we can make a trick- pass TWO UnicodeString params to PosExI (example name) - str1, str2 (uppercase and lowercase) - it is app's work to prepare them. CudaText will prepare them using its table lookup.

ASBzone

  • Sr. Member
  • ****
  • Posts: 476
  • Automation leads to relaxation...
    • Free Console Utilities for Windows from BrainWaveCC
Re: PosEx variant for case-insensitive search
« Reply #3 on: September 18, 2020, 02:33:16 am »
Then we can make a trick- pass TWO UnicodeString params to PosExI (example name) - str1, str2 (uppercase and lowercase) - it is app's work to prepare them. CudaText will prepare them using its table lookup.


Okay, but UPPERCASE and lowercase are only two options in the case-insensitive continuum.   What about CamelCase, or jUsTmIxEdUpCaSe?   
-ASB: https://www.BrainWaveCC.com

Lazarus v2.0.11 r64032 / FPC v3.2.1-r47152 (via FpcUpDeluxe) -- Windows 64-bit install w/32-bit cross-compile
Primary System: Windows 10 Pro x64, Version 2009 (Build 19042.572)
Other Systems: Windows 10 Pro x64, Version 2004 or greater

CM630

  • Hero Member
  • *****
  • Posts: 917
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: PosEx variant for case-insensitive search
« Reply #4 on: September 18, 2020, 09:12:48 am »

Just to mention:
In English the capital lettor for „i“ is „I“.
In Turkish the capital letter for „i“ is „İ“, while the capital letter for „ı“ is „I“. This is only a single exception, that I am aware of, there might be hundreds.
So lowercase and uppercase might be problematic.
Лазар 2,0,10; W10 or W7 64bit; FPC3,2,0; rev 63526

Alextp

  • Hero Member
  • *****
  • Posts: 1149
    • UVviewsoft
Re: PosEx variant for case-insensitive search
« Reply #5 on: September 18, 2020, 10:08:22 am »
Quote
>Okay, but UPPERCASE and lowercase are only two options in the case-insensitive continuum.   What about CamelCase, or jUsTmIxEdUpCaSe?   
PosExI wil search for Widechar - using chars from str1+str2 - it will need the Len(str1)=Len(str2) and will compare next chars wil pairs - str1_i and str2_i. If both compares are False, next char is bad. Otherwise, next char is ok.
« Last Edit: September 18, 2020, 10:11:38 am by Alextp »

Alextp

  • Hero Member
  • *****
  • Posts: 1149
    • UVviewsoft
Re: PosEx variant for case-insensitive search
« Reply #6 on: September 18, 2020, 10:10:59 am »
Quote
>In English the capital lettor for „i“ is „I“. In Turkish the capital letter for „i“ is „İ“,
No, in Unicode we have single result for UpperCase(wchar).

Thaddy

  • Hero Member
  • *****
  • Posts: 10528
Re: PosEx variant for case-insensitive search
« Reply #7 on: September 18, 2020, 10:17:28 am »
Quote
>In English the capital lettor for „i“ is „I“. In Turkish the capital letter for „i“ is „İ“,
No, in Unicode we have single result for UpperCase(wchar).
No, wchar does not expand to unicodechar by itself. So that only partially works (UCS2 subset of UTF16 afaik)
« Last Edit: September 18, 2020, 10:22:04 am by Thaddy »

Alextp

  • Hero Member
  • *****
  • Posts: 1149
    • UVviewsoft
Re: PosEx variant for case-insensitive search
« Reply #8 on: September 18, 2020, 10:20:32 am »
If wchar is not in unicode surrogate range (my code has functions IsCharSurrogateLow/...High), then it's mapped to unicodechar. If it is in, we need next wchar2 to make unicodechar from 2 wchars.

Thaddy

  • Hero Member
  • *****
  • Posts: 10528
Re: PosEx variant for case-insensitive search
« Reply #9 on: September 18, 2020, 10:26:14 am »
If wchar is not in unicode surrogate range (my code has functions IsCharSurrogateLow/...High), then it's mapped to unicodechar. If it is in, we need next wchar2 to make unicodechar from 2 wchars.
Maybe UTF32 is a suggestion, because that maps to everything. (including both UTF8 and UTF16). It is expensive in space but cheap in compute.

Alextp

  • Hero Member
  • *****
  • Posts: 1149
    • UVviewsoft
Re: PosEx variant for case-insensitive search
« Reply #10 on: September 18, 2020, 10:30:37 am »
No problem with my idea about str1+str2 (of same Len). If we have surrogate pair in str1, we must have the same surrogate pair in str2 (because Uppercase/Lowercase for surrogate pair doesn't change it AFAIK)

 

TinyPortal © 2005-2018