Forum > General

IsApostrophe anyone?

(1/2) > >>

EganSolo:
This has turned into a mess ...
Ask: Given a string, I want to determine if a left single curly quote is an apostrophe ...
Problem: I've landed into UTF swamp ...

Here’s the code for that function ... obviously, it’s a heuristic since there’s no foolproof way to figure out in English if a quote is an apostrophe or the closure of a quoted sentence:
‘Hello, how are you?’ ==> simple quote
I wouldn’t ==> apostrophe
Whatcha doin’ ==> apostrophe (missing ?)

In any event, my problem is with the conversion between different string types: specifically, I want to use Character.IsLetter, which requires a WideChar. I’ve used UTF8CodePointToUnicode to obtain the code point and then performed a straight WideChar conversion (WideChar(CodePoint)) to convert it into a WideChar, but that’s failing.


--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---function IsApostrophe(const S: String; const aPos: Integer): Boolean;var aPrevChar: WideChar;    aNextChar: WideChar;    CodePoint: LongInt;    aLen: Integer;begin  If UTF8CompareStr(UTF8Copy(S,aPos,1),RightCurlySingleQuote) <> 0 then Exit(False);  aLen := UTF8Length(S);  //If at start or end, assume it’s a quote (even if it’s the wrong quote.)  If (aPos = 1) or (aPos = aLen)  then Exit(False);  UTF8CodepointToUnicode(@S[aPos-1], Codepoint);  aPrevChar := WideChar(CodePoint);  UTF8CodepointToUnicode(@S[aPos+1], Codepoint);  aNextChar := WideChar(CodePoint);  If Not Character.IsLetter(aNextChar)  then Result := (Lowercase(aPrevChar) = 's')                              or                 ((aPos > 3) and (Lowercase(UTF8Copy(S,aPos-2,2)) = 'in')) or                 IsDigit(aNextChar)  else If IsLetter(aPrevChar)       then Exit(True)       else if (LowerCase(UTF8Copy(S,aPos+1,3)) = 'tis'  ) or               (LowerCase(UTF8Copy(S,aPos+1,4)) = 'twas' ) or               (LowerCase(UTF8Copy(S,aPos+1,5)) = 'cause') or               (LowerCase(UTF8Copy(S,aPos+1,2)) = 'em'   ) or               (LowerCase(UTF8Copy(S,aPos+1,3)) = 'til'  ) or               (LowerCase(UTF8Copy(S,aPos+1,5)) = 'round') or               (LowerCase(UTF8Copy(S,aPos+1,4)) = 'fore' )       then Exit(True);end; 
Suggestions?

Thaddy:
Way-to-complex:

Use CharInSet() which has a widechar overload.
Silly example:
--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---procedure TForm1.FormCreate(Sender: TObject);var  cset:TSysCharSet = [#39,#145,#146];// apostrophe and left/right single quotation marksbegin  if CharInSet('''',cset) then caption  := 'Apostrophe';end;Note that CharInset only works for Widechars that fit into the AnsiChar lowerhalf range + the upperhalf range for the ansi codepage. It does not work for other, higher characters.
Since apostrophy = #39 this works fine.
What you call apostrophe is called "right single quotation mark" and is #146. Left single quotation mark = #145

Good overview chart: https://www.alanwood.net/demos/ansi.html

My example code covers them all if the ansi codepage is correct!
No it doesn't. Will write an alternative.

avk:

--- Quote from: EganSolo on October 31, 2025, 08:12:30 am ---...
Suggestions?

--- End quote ---

It would probably make sense to use the UTF8CodepointToUnicode() function in the right way

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---function IsApostrophe(const S: String; const aPos: Integer): Boolean;var     ...    CodePoint: Cardinal;    aLen: Integer;begin  ...  CodePoint := UTF8CodepointToUnicode(@S[aPos-1], aLen);  ...end; 

Thaddy:
Yes that is a better solution until charinset is provided for UTF8 in a meaningful way.
Proposed overloads:
[removed: too many bugs, post back after fixing]

EganSolo:
Thanks Guys, especially AVK, that's what I was missing.
Thaddy: It's not a question of detecting the right single curly quote; it's figuring out if it's acting as a end-quote or as an apostrophe.

'Hey, he said,' <= closing quote.
That's Carlos's <= apostrophe.

But yeah, AVK's fix did the trick. Thanks.

Navigation

[0] Message Index

[#] Next page

Go to full version