Recent

Author Topic: PosEx variant for case-insensitive search  (Read 1778 times)

AlexTP

  • Hero Member
  • *****
  • Posts: 2401
    • UVviewsoft
PosEx variant for case-insensitive search
« on: September 17, 2020, 09:37:10 pm »
PosEx is ASM based so it's very fast. (Uses IndexWord ASM based func.)
For CudaText, I need variant with case-insensitive match, with WideChar/UnicodeString params.
It can avoid WidestringManager by using some callback (CudaText has such callback to make UpperCase/LowerCase for widechar. It don't use WidestringManager. It uses table lookup).
Please?
« Last Edit: September 17, 2020, 09:39:44 pm by Alextp »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11451
  • FPC developer.
Re: PosEx variant for case-insensitive search
« Reply #1 on: September 17, 2020, 09:41:05 pm »
PosEx is ASM based so it's very fast. (Uses IndexWord ASM based func.)
For CudaText, I need variant with case-insensitive match, with WideChar/UnicodeString params.
Please?

Nope, it uses indexbyte.

But for unicodestring you would need to based on indexword, but that assumes there is a word based value to search for.

And this is hard because unicode (and unicode based case sensitivity) is simply hard. There is no chance that such version would even be in the same ballpark as the ascii version

AlexTP

  • Hero Member
  • *****
  • Posts: 2401
    • UVviewsoft
Re: PosEx variant for case-insensitive search
« Reply #2 on: September 17, 2020, 11:23:36 pm »
Then we can make a trick- pass TWO UnicodeString params to PosExI (example name) - str1, str2 (uppercase and lowercase) - it is app's work to prepare them. CudaText will prepare them using its table lookup.

ASBzone

  • Hero Member
  • *****
  • Posts: 678
  • Automation leads to relaxation...
    • Free Console Utilities for Windows (and a few for Linux) from BrainWaveCC
Re: PosEx variant for case-insensitive search
« Reply #3 on: September 18, 2020, 02:33:16 am »
Then we can make a trick- pass TWO UnicodeString params to PosExI (example name) - str1, str2 (uppercase and lowercase) - it is app's work to prepare them. CudaText will prepare them using its table lookup.


Okay, but UPPERCASE and lowercase are only two options in the case-insensitive continuum.   What about CamelCase, or jUsTmIxEdUpCaSe?   
-ASB: https://www.BrainWaveCC.com/

Lazarus v2.2.7-ada7a90186 / FPC v3.2.3-706-gaadb53e72c
(Windows 64-bit install w/Win32 and Linux/Arm cross-compiles via FpcUpDeluxe on both instances)

My Systems: Windows 10/11 Pro x64 (Current)

CM630

  • Hero Member
  • *****
  • Posts: 1091
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: PosEx variant for case-insensitive search
« Reply #4 on: September 18, 2020, 09:12:48 am »

Just to mention:
In English the capital lettor for „i“ is „I“.
In Turkish the capital letter for „i“ is „İ“, while the capital letter for „ı“ is „I“. This is only a single exception, that I am aware of, there might be hundreds.
So lowercase and uppercase might be problematic.
Лазар 3,2 32 bit (sometimes 64 bit); FPC3,2,2; rev: Lazarus_3_0 on Win10 64bit.

AlexTP

  • Hero Member
  • *****
  • Posts: 2401
    • UVviewsoft
Re: PosEx variant for case-insensitive search
« Reply #5 on: September 18, 2020, 10:08:22 am »
Quote
>Okay, but UPPERCASE and lowercase are only two options in the case-insensitive continuum.   What about CamelCase, or jUsTmIxEdUpCaSe?   
PosExI wil search for Widechar - using chars from str1+str2 - it will need the Len(str1)=Len(str2) and will compare next chars wil pairs - str1_i and str2_i. If both compares are False, next char is bad. Otherwise, next char is ok.
« Last Edit: September 18, 2020, 10:11:38 am by Alextp »

AlexTP

  • Hero Member
  • *****
  • Posts: 2401
    • UVviewsoft
Re: PosEx variant for case-insensitive search
« Reply #6 on: September 18, 2020, 10:10:59 am »
Quote
>In English the capital lettor for „i“ is „I“. In Turkish the capital letter for „i“ is „İ“,
No, in Unicode we have single result for UpperCase(wchar).

Thaddy

  • Hero Member
  • *****
  • Posts: 14371
  • Sensorship about opinions does not belong here.
Re: PosEx variant for case-insensitive search
« Reply #7 on: September 18, 2020, 10:17:28 am »
Quote
>In English the capital lettor for „i“ is „I“. In Turkish the capital letter for „i“ is „İ“,
No, in Unicode we have single result for UpperCase(wchar).
No, wchar does not expand to unicodechar by itself. So that only partially works (UCS2 subset of UTF16 afaik)
« Last Edit: September 18, 2020, 10:22:04 am by Thaddy »
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

AlexTP

  • Hero Member
  • *****
  • Posts: 2401
    • UVviewsoft
Re: PosEx variant for case-insensitive search
« Reply #8 on: September 18, 2020, 10:20:32 am »
If wchar is not in unicode surrogate range (my code has functions IsCharSurrogateLow/...High), then it's mapped to unicodechar. If it is in, we need next wchar2 to make unicodechar from 2 wchars.

Thaddy

  • Hero Member
  • *****
  • Posts: 14371
  • Sensorship about opinions does not belong here.
Re: PosEx variant for case-insensitive search
« Reply #9 on: September 18, 2020, 10:26:14 am »
If wchar is not in unicode surrogate range (my code has functions IsCharSurrogateLow/...High), then it's mapped to unicodechar. If it is in, we need next wchar2 to make unicodechar from 2 wchars.
Maybe UTF32 is a suggestion, because that maps to everything. (including both UTF8 and UTF16). It is expensive in space but cheap in compute.
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

AlexTP

  • Hero Member
  • *****
  • Posts: 2401
    • UVviewsoft
Re: PosEx variant for case-insensitive search
« Reply #10 on: September 18, 2020, 10:30:37 am »
No problem with my idea about str1+str2 (of same Len). If we have surrogate pair in str1, we must have the same surrogate pair in str2 (because Uppercase/Lowercase for surrogate pair doesn't change it AFAIK)

 

TinyPortal © 2005-2018