Recent

Author Topic: String and PChar Relationship  (Read 1828 times)

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
String and PChar Relationship
« on: June 22, 2020, 04:34:41 pm »
Most of my Applications are about Text Parsing and Processing.
So the correct Memory Management with Strings is the key factor to speed up or slow down the applications.

According to this Comment:
But once you change the string it will be made unique. This is the Copy-On-Write mechanism of managed strings.

I started to remember the "Inmutable String" Mechanism of Python where the string will be copied on every change.

To check it I made a litte test procedure:
https://godbolt.org/z/BHfBvM
It produces the output:

s1 0: 'Test1234 batz 1.1'
s2 0: 'Test1234 batz 1.1'
s3 0: 'Test1234 batz 1.1'
s1 1: 'f0o 1234 batz 1.1'
s2 1: 'f0o 1234 batz 1.1'
s3 1: 'Test1234 batz 1.1'
s1 2: '20o 1234 batz 1.1'
s2 2: '30o 1234 batz 1.1'
s3 2: 'Test1234 batz 1.1'


I see although s1, s2, s3 contain the same textual data
only s1 and s2 actually reference the same memory buffer.

But then on writing to the position s1[1] the fpc_ansistr_unique Function is invoked.

So I wonder why do you need to copy the whole string on writing a single byte ?

ASerge

  • Hero Member
  • *****
  • Posts: 2222
Re: String and PChar Relationship
« Reply #1 on: June 22, 2020, 04:56:45 pm »
So I wonder why do you need to copy the whole string on writing a single byte ?
Because "copy-on-write mechanism". If two different strings refer to the same memory buffer and one of them is modified, the other must remain unchanged, so the data must be copied.
And you make a big mistake when you try to change strings via PChar, precisely because you exclude the correct copy-on-write mechanism. It is enough to remove the line s1 := s1 + 'batz 1.1'; in your program and the error will immediately appear as SIGSEGV.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: String and PChar Relationship
« Reply #2 on: June 22, 2020, 06:18:23 pm »
So I wonder why do you need to copy the whole string on writing a single byte ?

The copy-on-write mechanism is for passing by value parameters to chains of methods. Multiple methods might only pass it on, but don't need a deep copy, only ref count increase.

If some method starts modifying though, if you wouldn't copy, after the method is finished, and the calling methods continue they would see the  modified string, not the original, despite passing by value.

So the whole idea is lazy by-value semantics.

As far as pchar does, since (ansi/unicode)strings are null terminated, there are some easy conversions to/from pchar for external API interfacing purposes. Note though that pchar's can't contain NULLs as part of the string data, and strings CAN!

Also it means that a pchar pointing into a string is only safe as long as the string remains in scope.


domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
Re: String and PChar Relationship
« Reply #3 on: July 08, 2020, 04:33:16 pm »
The copy-on-write mechanism is for passing by value parameters to chains of methods. Multiple methods might only pass it on, but don't need a deep copy, only ref count increase.

If some method starts modifying though, if you wouldn't copy, after the method is finished, and the calling methods continue they would see the  modified string, not the original, despite passing by value.

So the whole idea is lazy by-value semantics.

I see your argument on the By-Value Parameters.

Actually it comes very handy when returning a String from a Function and thus enabling painless String Properties in Classes.

Also it means that a pchar pointing into a string is only safe as long as the string remains in scope.

According to this Explication at:
Pascal strings always end with a NUL to simplify conversion to PChar.

Please note that this applies to AnsiString, WideString and UnicodeString, but not to ShortString.

It seems the most convienent to create a PChar with the TypeCast PChar() rather with the address of the first string byte
Like it is demostrated at:
https://wiki.freepascal.org/UTF8_strings_and_characters#Iterating_over_string_analysing_individual_codepoints
Code: Pascal  [Select][+][-]
  1.   CurP := PChar(S);        // if S='' then PChar(S) returns a pointer to #0
  2.   EndP := CurP + length(S);
  3.   while CurP < EndP do
  4.   begin
  5.     // ...
  6.     inc(CurP);
  7.   end;
  8.  

Because as the Comment states on empty strings it does not produce an Exception
like it is known from the work with Streams
and as in:
Code: Pascal  [Select][+][-]
  1.   CurP := @S[1];        // if S='' then results in EAccessViolation Exception because this index does not exist
  2.  

Of course the work with PChar always holds the Risk as it is stated very diplomatically in this C Language Documentation:
https://www.man7.org/linux/man-pages/man3/memchr.3.html
Quote
If an instance of c is not found, the results are unpredictable.
Thus when surpassing the NULL Byte without checking it the PChar would just walk on and on throughout the whole Process Address Space

 

TinyPortal © 2005-2018