Recent

Author Topic: IsApostrophe anyone?  (Read 854 times)

EganSolo

  • Sr. Member
  • ****
  • Posts: 395
IsApostrophe anyone?
« on: October 31, 2025, 08:12:30 am »
This has turned into a mess ...
Ask: Given a string, I want to determine if a left single curly quote is an apostrophe ...
Problem: I've landed into UTF swamp ...

Here’s the code for that function ... obviously, it’s a heuristic since there’s no foolproof way to figure out in English if a quote is an apostrophe or the closure of a quoted sentence:
‘Hello, how are you?’ ==> simple quote
I wouldn’t ==> apostrophe
Whatcha doin’ ==> apostrophe (missing ?)

In any event, my problem is with the conversion between different string types: specifically, I want to use Character.IsLetter, which requires a WideChar. I’ve used UTF8CodePointToUnicode to obtain the code point and then performed a straight WideChar conversion (WideChar(CodePoint)) to convert it into a WideChar, but that’s failing.

Code: Pascal  [Select][+][-]
  1. function IsApostrophe(const S: String; const aPos: Integer): Boolean;
  2. var aPrevChar: WideChar;
  3.     aNextChar: WideChar;
  4.     CodePoint: LongInt;
  5.     aLen: Integer;
  6. begin
  7.   If UTF8CompareStr(UTF8Copy(S,aPos,1),RightCurlySingleQuote) <> 0 then Exit(False);
  8.   aLen := UTF8Length(S);
  9.   //If at start or end, assume it’s a quote (even if it’s the wrong quote.)
  10.   If (aPos = 1) or (aPos = aLen)
  11.   then Exit(False);
  12.   UTF8CodepointToUnicode(@S[aPos-1], Codepoint);
  13.   aPrevChar := WideChar(CodePoint);
  14.   UTF8CodepointToUnicode(@S[aPos+1], Codepoint);
  15.   aNextChar := WideChar(CodePoint);
  16.   If Not Character.IsLetter(aNextChar)
  17.   then Result := (Lowercase(aPrevChar) = 's')                              or
  18.                  ((aPos > 3) and (Lowercase(UTF8Copy(S,aPos-2,2)) = 'in')) or
  19.                  IsDigit(aNextChar)
  20.   else If IsLetter(aPrevChar)
  21.        then Exit(True)
  22.        else if (LowerCase(UTF8Copy(S,aPos+1,3)) = 'tis'  ) or
  23.                (LowerCase(UTF8Copy(S,aPos+1,4)) = 'twas' ) or
  24.                (LowerCase(UTF8Copy(S,aPos+1,5)) = 'cause') or
  25.                (LowerCase(UTF8Copy(S,aPos+1,2)) = 'em'   ) or
  26.                (LowerCase(UTF8Copy(S,aPos+1,3)) = 'til'  ) or
  27.                (LowerCase(UTF8Copy(S,aPos+1,5)) = 'round') or
  28.                (LowerCase(UTF8Copy(S,aPos+1,4)) = 'fore' )
  29.        then Exit(True);
  30. end;
  31.  

Suggestions?

Thaddy

  • Hero Member
  • *****
  • Posts: 18729
  • To Europe: simply sell USA bonds: dollar collapses
Re: IsApostrophe anyone?
« Reply #1 on: October 31, 2025, 08:34:51 am »
Way-to-complex:

Use CharInSet() which has a widechar overload.
Silly example:
Code: Pascal  [Select][+][-]
  1. procedure TForm1.FormCreate(Sender: TObject);
  2. var
  3.   cset:TSysCharSet = [#39,#145,#146];// apostrophe and left/right single quotation marks
  4. begin
  5.   if CharInSet('''',cset) then caption  := 'Apostrophe';
  6. end;
Note that CharInset only works for Widechars that fit into the AnsiChar lowerhalf range + the upperhalf range for the ansi codepage. It does not work for other, higher characters.
Since apostrophy = #39 this works fine.
What you call apostrophe is called "right single quotation mark" and is #146. Left single quotation mark = #145

Good overview chart: https://www.alanwood.net/demos/ansi.html

My example code covers them all if the ansi codepage is correct!
No it doesn't. Will write an alternative.
« Last Edit: October 31, 2025, 10:42:18 am by Thaddy »
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

avk

  • Hero Member
  • *****
  • Posts: 825
Re: IsApostrophe anyone?
« Reply #2 on: October 31, 2025, 10:03:04 am »
...
Suggestions?

It would probably make sense to use the UTF8CodepointToUnicode() function in the right way
Code: Pascal  [Select][+][-]
  1. function IsApostrophe(const S: String; const aPos: Integer): Boolean;
  2. var
  3.     ...
  4.     CodePoint: Cardinal;
  5.     aLen: Integer;
  6. begin
  7.   ...
  8.   CodePoint := UTF8CodepointToUnicode(@S[aPos-1], aLen);
  9.   ...
  10. end;
  11.  


Thaddy

  • Hero Member
  • *****
  • Posts: 18729
  • To Europe: simply sell USA bonds: dollar collapses
Re: IsApostrophe anyone?
« Reply #3 on: October 31, 2025, 10:44:03 am »
Yes that is a better solution until charinset is provided for UTF8 in a meaningful way.
Proposed overloads:
[removed: too many bugs, post back after fixing]
« Last Edit: November 01, 2025, 08:09:40 am by Thaddy »
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

EganSolo

  • Sr. Member
  • ****
  • Posts: 395
Re: IsApostrophe anyone?
« Reply #4 on: November 01, 2025, 03:35:54 am »
Thanks Guys, especially AVK, that's what I was missing.
Thaddy: It's not a question of detecting the right single curly quote; it's figuring out if it's acting as a end-quote or as an apostrophe.

'Hey, he said,' <= closing quote.
That's Carlos's <= apostrophe.

But yeah, AVK's fix did the trick. Thanks.

dbannon

  • Hero Member
  • *****
  • Posts: 3684
    • tomboy-ng, a rewrite of the classic Tomboy
Re: IsApostrophe anyone?
« Reply #5 on: November 01, 2025, 08:17:48 am »
Egan, not wanting to mess with your day. But I just answering a post in a different thread and realised I did something you might not like. I wanted to indicate I was talking about multiple ppu files. And did what is quite common (but incorrect) :

ppu's

Thats neither quote nor apostrophe, its bad grammar but people do it. And the English language is an evolving language.

Davo
Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

Thaddy

  • Hero Member
  • *****
  • Posts: 18729
  • To Europe: simply sell USA bonds: dollar collapses
Re: IsApostrophe anyone?
« Reply #6 on: November 01, 2025, 08:49:37 am »
Anyway, I debugged my charinset overloads, which are really string in set of string overloads:
Code: Pascal  [Select][+][-]
  1. program untitled;
  2. {$mode delphi}{$H+}{$codepage utf8}
  3. uses
  4.   sysutils;
  5. // utf8 charinset
  6. function CharInSet(const C: UTF8String; const CharSet: array of UTF8String): Boolean;overload;
  7. var
  8.   I: Integer;
  9.   FirstChar: UTF8String;
  10.   Len: Integer;
  11. begin
  12.   Result := False;
  13.  
  14.   if C = '' then
  15.     Exit;
  16.    
  17.   // Get the first UTF-8 character properly
  18.   // imho better than the UTF8CodepointToUnicode way
  19.   Len := 1;
  20.   case Byte(C[1]) of
  21.     $C2..$DF: Len := 2;  // 2-byte UTF-8
  22.     $E0..$EF: Len := 3;  // 3-byte UTF-8  
  23.     $F0..$F4: Len := 4;  // 4-byte UTF-8
  24.   end;
  25.  
  26.   // Extract first UTF-8 character
  27.   FirstChar := Copy(C, 1, Len);
  28.  
  29.   // Check against character set
  30.   // Check against character set
  31.   for I := Low(CharSet) to High(CharSet) do
  32.     if FirstChar = CharSet[I] then
  33.       Exit(True);
  34. end;
  35.  
  36.  
  37. // Or for UnicodeChar
  38. function CharInSet(const C: UnicodeString; const CharSet: array of UnicodeString): Boolean;overload;
  39. var
  40.   I: Integer;
  41. begin
  42.   Result := False;
  43.   Result := False;
  44.   for I := Low(CharSet) to High(CharSet) do
  45.     if C = CharSet[I] then Exit(True);
  46. end;
  47.  
  48. var
  49.   ch: utf8string = '&#128522;';
  50.   a: array of utf8String;
  51.   ch2: UnicodeString = '&#128512;';
  52.   a2: array of UnicodeString;
  53.  
  54. begin
  55.   a := ['&#128512;', '&#128522;', '&#128077;'];
  56.   if CharInSet(ch,a) then
  57.     WriteLn('Surrogate pair character found:',ch);
  58.   a2 := ['&#128512;', '&#128522;', '&#128077;'];
  59.   if CharInSet(ch2,a2) then
  60.     WriteLn('Surrogate pair character found:',ch);
  61. end.
Don't forget the {$codepage UTF8} because of the string literals.
The highlighter changed the literals - see screenshot - , so I attach the program code:
« Last Edit: November 01, 2025, 09:30:52 am by Thaddy »
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

EganSolo

  • Sr. Member
  • ****
  • Posts: 395
Re: IsApostrophe anyone?
« Reply #7 on: November 01, 2025, 08:46:35 pm »
Thaddy,
  Thank you very much for this code. It helped quite a bit! Much appreciate it!

 

TinyPortal © 2005-2018