Recent

Author Topic: Custom string type  (Read 683 times)

MarkMLl

  • Hero Member
  • *****
  • Posts: 8045
Custom string type
« on: October 10, 2024, 10:27:33 pm »
Is it possible to tell the compiler to not assign any special meaning to "String", so that a project can define its own?

I'm hoping at some point to look at a reference implementation for Tree Meta, an early (but powerful) compiler generation tool. I would, obviously, prefer to use FPC for this plus the Lazarus IDE, but I have four very specific requirements:

1) Be able to handle arbitrary character sets and codepages, but base strings on 8-bit characters unless specifically needed.

2) Be able to adapt to any text input file (including- gasp- EBCDIC if that's what the user wants).

3) Ensure that a string can be iterated like an array (i.e. UTF-8 is definitely out). Using "classical Pascal"- as described by many books from its heyday, and broadly similar to how C used to do it- is essential, since anything else will alienate readers who think it's obsolete.

4) As such, avoiding anything that smacks of "cleverness": specialist iterators, generics and so on.

I anticipate a "black box" similar to a StringList to encapsulate a file, with each string being based on either 8- or 16-but characters (32-bit is left as an exercise, but would probably be unlikely). Other than that, very "Turbo Pascal like" code.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

PascalDragon

  • Hero Member
  • *****
  • Posts: 5766
  • Compiler Developer
Re: Custom string type
« Reply #1 on: October 10, 2024, 10:44:14 pm »
Is it possible to tell the compiler to not assign any special meaning to "String", so that a project can define its own?

No.

I'm hoping at some point to look at a reference implementation for Tree Meta, an early (but powerful) compiler generation tool. I would, obviously, prefer to use FPC for this plus the Lazarus IDE, but I have four very specific requirements:

1) Be able to handle arbitrary character sets and codepages, but base strings on 8-bit characters unless specifically needed.

2) Be able to adapt to any text input file (including- gasp- EBCDIC if that's what the user wants).

3) Ensure that a string can be iterated like an array (i.e. UTF-8 is definitely out). Using "classical Pascal"- as described by many books from its heyday, and broadly similar to how C used to do it- is essential, since anything else will alienate readers who think it's obsolete.

4) As such, avoiding anything that smacks of "cleverness": specialist iterators, generics and so on.

Just use AnsiString as it's code page aware and can transparently convert between code pages (and to/from Unicode when assigned to/from a UnicodeString). Aside from inside the LCL noone forces you to have UTF-8 encoded AnsiString.

Also you can provide conversions from/to EBCDIC if desired.

I anticipate a "black box" similar to a StringList to encapsulate a file, with each string being based on either 8- or 16-but characters (32-bit is left as an exercise, but would probably be unlikely). Other than that, very "Turbo Pascal like" code.

The problem is that you can't simply iterate with only a single type through 8-, 16- or 32-bit characters unless you use an iterator that converts to UTF-32, so you need at least three types. And FPC provides them: AnsiString (for single and multi Byte characters), UnicodeString (for 2 Byte characters (including surrogate pairs)) and UCS4String (for 4 Byte characters). The later is simply a array of UCS4Char with UCS4Char being a LongWord.

Sieben

  • Sr. Member
  • ****
  • Posts: 367
Re: Custom string type
« Reply #2 on: October 11, 2024, 12:55:21 am »
Is that why low level functions like Length and SetLength work on both strings and dynamic arrays? Isn't a string in any case basically none other than a one-dimensional dynamic array?
Lazarus 2.2.0, FPC 3.2.2, .deb install on Ubuntu Xenial 32 / Gtk2 / Unity7

MarkMLl

  • Hero Member
  • *****
  • Posts: 8045
Re: Custom string type
« Reply #3 on: October 11, 2024, 09:39:11 am »
The problem is that you can't simply iterate with only a single type through 8-, 16- or 32-bit characters unless you use an iterator that converts to UTF-32, so you need at least three types. And FPC provides them: AnsiString (for single and multi Byte characters), UnicodeString (for 2 Byte characters (including surrogate pairs)) and UCS4String (for 4 Byte characters). The later is simply a array of UCS4Char with UCS4Char being a LongWord.

Being able to index character-by-character as though a string is an array is an absolute prerequisite. Since there is at present no Tree-Meta reference implementation (and a substantial amount of historical material was lost- together with its custodian's home- in one of the California fires), I anticipate that readers will not be familiar with "modern Pascal" and might actually be hostile to the language. At that point the further that it moves from the language as it was in its heyday the worse for the project.

So using a variable-length encoding (UTF-8) is out. I could obviously use a 16-bit fixed-length encoding, but that would give problems to anybody who tried to backport it to an older system (e.g. an ICL mainframe, which is where much of the extant documentation originates)... although most older systems would have a limited character repertoire.

Another possibility would be to have an 8-bit internal representation, with translation on input and output. I'm particularly interested in being able to handle things like the old 8-bit APL codepages, which were never allocated "official" numbers... can an application "plug-in" a custom codepage?

Being able to handle non-Western text is by no means essential: /possibly/ permissible as variable content, /possibly/ permissible as procedure etc. names.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Warfley

  • Hero Member
  • *****
  • Posts: 1771
Re: Custom string type
« Reply #4 on: October 11, 2024, 12:52:18 pm »
I mean conceptually it's quite easy. Build your internal engine based on UTF-32, then write a conversion function that takes a file in a specific encoding, e.g. UTF-8, ANSI with codepages, UTF-32 with unicode planes, whatever, and convert it to UTF-32 before utilization in your engine.

But I don't know why you need a fixed size character model for this. I mean all you need is a backtrack DFA for your lexer, which can easiely be implemented for utf-8 strings with unicode codepoints for the character sets:
Code: Pascal  [Select][+][-]
  1. function BacktrackDFA(const AString: String; var StartPos: SizeInt; DFA: TDFA): TNullable<TMatch>;
  2. var
  3.   CurrentPos: SizeInt;
  4.   CurrentState, CPLen, U32Char: Integer;
  5. begin
  6.   Result := null;
  7.   CurrentPos:=StartPos;
  8.   CurrentState:=DFA.InitialStateState;
  9.   while CurrentState <> DFA.ErrorState do
  10.   begin
  11.     CPLen := UTF8CharLen(AString[CurrentPos]);
  12.     U32Char := UTF8Codepoint(AString.Substring(CurrentPos, CPLen));
  13.     CurrentState:=DFA.Step(CurrentState, U32Char);
  14.     Inc(CurrentPos, CPLen);
  15.     if DFA.isFinal(CurrentState) then
  16.     begin
  17.       Result:=CreateMatch(StartPos, AString, CurrentPos - StartPos, DFA.FinalToken(currentState));
  18.       StartPos:=CurrentPos; // Set Backtrack
  19.     end;
  20.   end;
  21. end;

It's really easy, even without iterators or other "cleverness" stuff as you call it. I've built this already a few times, last one was for my gold engine lexer which is doing it's character matching UTF-32 based and reads strings in UTF-8: https://github.com/Warfley/FPCGoldEngine/blob/master/src/lexer.pas

And I have built a UTF-8 based lexer for GOLD engines already in 4 different languages. So it's really not difficult
« Last Edit: October 11, 2024, 12:55:27 pm by Warfley »

PascalDragon

  • Hero Member
  • *****
  • Posts: 5766
  • Compiler Developer
Re: Custom string type
« Reply #5 on: October 12, 2024, 05:38:44 pm »
Being able to index character-by-character as though a string is an array is an absolute prerequisite.

Then best go the route of using UTF-32 and convert upon load from/save to whatever the encoding is. This gives you the benefit of indexing at the expense of runtime storage.

Another possibility would be to have an 8-bit internal representation, with translation on input and output. I'm particularly interested in being able to handle things like the old 8-bit APL codepages, which were never allocated "official" numbers... can an application "plug-in" a custom codepage?

You'd need to retrieve the WideString manager using GetWideStringManager, replace/wrap the Wide2AnsiMoveProc, Ansi2WideMoveProc, Unicode2AnsiMoveProc and Ansi2UnicodeMoveProc function pointers and set the WideString manager again using SetWideStringManager. For the wrapped functions just pick some code pages that are not in use by Windows (considering that the code page's range is 16-bit there should be some ;) ).

MarkMLl

  • Hero Member
  • *****
  • Posts: 8045
Re: Custom string type
« Reply #6 on: October 12, 2024, 05:59:17 pm »
Then best go the route of using UTF-32 and convert upon load from/save to whatever the encoding is. This gives you the benefit of indexing at the expense of runtime storage.
...
You'd need to retrieve the WideString manager using GetWideStringManager, replace/wrap the Wide2AnsiMoveProc, Ansi2WideMoveProc, Unicode2AnsiMoveProc and Ansi2UnicodeMoveProc function pointers and set the WideString manager again using SetWideStringManager. For the wrapped functions just pick some code pages that are not in use by Windows (considering that the code page's range is 16-bit there should be some ;) ).

Thanks for all that. I was /hoping/ there was some realistic way to organise it so that the working strings (i.e. those currently being operated on)- which should be fairly small in number- could transparently be based on 8 or 16 (or /possibly/) 32 bit characters, but you've successfully convinced me that the effort is excessive.

It sounds as though a UTF-8 based TStringList or similar is the way for backend storage, with something wider where stuff is being manipulated.

Looking at the reference manual, I note the way that strings are declared with a static codepage, so since I want to be able to look at the input file and either deduce the encoding (including byte ordering) or refer to a commandline parameter that's very much my responsibility.

I've got my previous Meta-2 implementation as a bit of a cautionary tale though: it's coded in a way to accommodate Turbo Pascal, TopSpeed Pascal, and originally MT+86... after 35ish years of maintenance it's unarguably "difficult" and that particularly applies to the bits that implement extensible syntax.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

MarkMLl

  • Hero Member
  • *****
  • Posts: 8045
Re: Custom string type
« Reply #7 on: October 12, 2024, 11:01:00 pm »
But I don't know why you need a fixed size character model for this. I mean all you need is a backtrack DFA for your lexer, which can easiely be implemented for utf-8 strings with unicode codepoints for the character sets:
...
And I have built a UTF-8 based lexer for GOLD engines already in 4 different languages. So it's really not difficult

Broadly agreed and respected, but Tree Meta is a very specific implementation using (in the language of its day) a virtual machine. The code going into the VM survives, no VM implementation survives: fixing that would probably take one to two weeks of my remaining life.

Since I've been using its predecessor Meta-2 heavily for many years, I'd like to code a reference implementation, and I'd like that to be able to accommodate the various APL (etc.) operators if required by a language. But at this point there's a choice: make it look too unfamiliar and it will alienate people, or if I can make Pascal look friendly it might even persuade one or two people that the language still has something going for it.

My experience with Meta-2 convinces me that I can define an extensible language. Tree Meta also allows a modicum of optimisation since it propagates concepts like "is this a constant?" down to the code generator. Plus I've also got a few useful twists relating to pragmata processing etc. that I think I could usefully get into the record...

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

 

TinyPortal © 2005-2018