Lazarus

Programming => Databases => Topic started by: emilt on July 10, 2020, 12:19:25 pm

Title: Firebird character fields with UTF8 connection charset
Post by: emilt on July 10, 2020, 12:19:25 pm
I think the problem starts with Firebird itself - when the connection charset is UTF8, it always allocates the maximum possibly needed memory for a field, which is 4 bytes per character, and as a result, the buffer for a CHAR(3) field is 12 bytes, and its length is given as 12. No information is provided about the length of the buffer as number of characters, this is only available in the field metadata. (That description may not be entirely correct, if anyone can provide a better explanation - would be nice)

As a result of this many (or at least some) clients will return such CHAR(3) field as up to 12 UTF characters (depending on how many of them as ASCII), padded right with spaces. The PHP interbase driver does so, and FPC's IBConnection does so, and IBX for Lazarus too.

There is however a difference - when you create the TField object for such a firebird field, TIBConnection will create it with Size and DisplayWidth of 12. IBX for lazarus, on the other hand, will create the field with the correct Size and DisplayWidth of 3. And then, both of them will actually return the data in the AsString property as up to 12, padded with spaces to the length of 12 bytes.

There is a kind of workaround for this with TIBConnection - you can set the TField.Size to the correct value, and it will work - as long as your content is only ASCII. I couldn't find similar one for IBX.

Now the questions are - would it be possible to fix this on the level of data access components, and how. If not, why exactly? Do you know how this works on current Delphi versions? What about other db components supporting Firebird?
Title: Re: Firebird character fields with UTF8 connection charset
Post by: mangakissa on July 10, 2020, 03:31:14 pm
The SQLdb framework has the same problem. I've made a workaround when firebird TField requires UTF-8 to devide the fieldsize with 4.

The real problem relies on writing back to a table. If your fieldsize on Lazarus is 12 and your charsize is 3, an error occurred. The modifications can't be saved
Title: Re: Firebird character fields with UTF8 connection charset
Post by: EgonHugeist on July 11, 2020, 08:34:59 am
The SQLdb framework has the same problem. I've made a workaround when firebird TField requires UTF-8 to devide the fieldsize with 4.

The real problem relies on writing back to a table. If your fieldsize on Lazarus is 12 and your charsize is 3, an error occurred. The modifications can't be saved

Because of this (and missing fieldtypes/performance penalties of the tField-design) Zeos 7.3 https://sourceforge.net/p/zeoslib/code-0/HEAD/tree/branches/testing-7.3/ (https://sourceforge.net/p/zeoslib/code-0/HEAD/tree/branches/testing-7.3/) did introduce it's own T(?)Field descendants. In case of TStringField Size is the amound of characters, datasize is multiplied by bytes per char. The new SQLDB hase same logic as Zeos 7.2 inbetween but only for UTF8, all other multibyte charsets are treaded a single byte charset which is wrong.
Title: Re: Firebird character fields with UTF8 connection charset
Post by: LacaK on July 14, 2020, 07:42:32 am
As of FPC 3.2 this should be handled by TIBConnection and TStringField.
So with FPC 3.2 and UTF-8 connection charset you should get:
- TStringField.Size = 3 (chars)
- TStringField.DataSize = 12 (bytes)
Title: Re: Firebird character fields with UTF8 connection charset
Post by: emilt on July 14, 2020, 11:14:15 am
Yes! It seems this really has been fixed in FPC 3.2. I will probably re-create my datasets and fields just to be sure that everything is OK now.

Now, how could I have known that? I couldn't find anything about the changes in FCL in the FPC 3.2 release notes. Are these documented somewhere separately?

And, by the way, do you happen to know who is the author/maintainer of IBConnection? There is no copyright header in the source file...
Title: Re: Firebird character fields with UTF8 connection charset
Post by: tonyw on July 16, 2020, 01:19:37 pm
There is however a difference - when you create the TField object for such a firebird field, TIBConnection will create it with Size and DisplayWidth of 12. IBX for lazarus, on the other hand, will create the field with the correct Size and DisplayWidth of 3. And then, both of them will actually return the data in the AsString property as up to 12, padded with spaces to the length of 12 bytes.
You have highlighted an interesting problem with Firebird CHAR types versus VARCHAR types.

When the Firebird column type is VARCHAR, the server always returns the byte length of the characters string. In IBX this is used to return an AnsiString with a byte length set to the byte length returned by the server. If the string codepage is UTF8 then you can get the actual number of characters using the FPC UTF8Length() function.

When the Firebird column type is CHAR, the server returns a fixed size buffer (set to the character size * max no. of bytes per character). This contains the encoded characters right padded with spaces. At present, IBX does always return an AnsiString with a byte length set to the buffer size. As you observe, this may give an odd result as the character length can be longer than expected with UTF8 - although the displayed text looks OK given that the right padding is white space.

Arguably this is a bug. IBX always translates strings read from a server to the UTF8 code page so that they are compatible with the LCL - if you want the original code page then you need to use the underlying Firebird Pascal Interface. Therefore at the IBX  level it is easy enough to truncate the (UTF8) string to the expected character length and I'll make that change for the next release.

For the record, with IBX and string columns, the TField properties are set as follows:

Size: the maximum character width of the string.
DataSize: the maximum byte length of the string.
DisplayWidth: defaults to "Size" unless overridden at design time or at run time by assigning to the DisplayWidth property

Title: Re: Firebird character fields with UTF8 connection charset
Post by: emilt on July 17, 2020, 07:36:18 am
Yes! It seems this really has been fixed in FPC 3.2. I will probably re-create my datasets and fields just to be sure that everything is OK now.
Well, actually I had to recreate the fields, otherwise strange things started happening - AVs and all that. So it would really have been nice if there were some documentation about the [breaking] changes to the FCL packages with the new FPC version. A quick look showed there were really quite some changes to both IBConnection.pp and Fields.inc, didn't look at others.
Title: Re: Firebird character fields with UTF8 connection charset
Post by: emilt on July 17, 2020, 07:41:21 am
... and I'll make that change for the next release.
Any idea when that next release will be? Or any chance to to preview and test the changes before that?
Title: Re: Firebird character fields with UTF8 connection charset
Post by: LacaK on July 17, 2020, 01:47:02 pm
Now, how could I have known that? I couldn't find anything about the changes in FCL in the FPC 3.2 release notes. Are these documented somewhere separately?

https://wiki.freepascal.org/User_Changes_3.2.0
Title: Re: Firebird character fields with UTF8 connection charset
Post by: tonyw on July 18, 2020, 12:36:58 pm
... and I'll make that change for the next release.
Any idea when that next release will be? Or any chance to to preview and test the changes before that?
No formal release planned at present. However, you can always access up-to-date fixes at

https://www.mwasoftware.co.uk/svn-repository

You should review the change logs (linked from the above page). I have added the UTF8 truncation to the current fixes. There is also code for separating SQL Filters from client side filters. However, that is flagged as not yet tested - so don't rely on it at present.
Title: Re: Firebird character fields with UTF8 connection charset
Post by: emilt on July 20, 2020, 03:57:41 pm
I had a look at the updates, especially FBIntf, but I fail to understand the approach.

The way I see it, if I have a field defined in the database as CHAR(8), I should get always a string of 8 characters. If the connection charset is UTF8, then UTF8Length(AsString) should be 8.

The released version used to trim these char values, so value of '123' had length 3. Now with the changes, value of  '123'  in the field produces a string with Length/UTF8Length of 32 - padded with spaces.

Am I missing something?
Title: Re: Firebird character fields with UTF8 connection charset
Post by: tonyw on July 21, 2020, 10:08:11 am
I had a look at the updates, especially FBIntf, but I fail to understand the approach.

The way I see it, if I have a field defined in the database as CHAR(8), I should get always a string of 8 characters. If the connection charset is UTF8, then UTF8Length(AsString) should be 8.

The released version used to trim these char values, so value of '123' had length 3. Now with the changes, value of  '123'  in the field produces a string with Length/UTF8Length of 32 - padded with spaces.

Am I missing something?
My mistake. I had it in my head that Firebird returned the character width rather than the byte length in the column metadata. Testing then gave the result I expected rather what I should have expected.

The svn trunk is now updated so that the string should now be correctly truncated to the UTF8 character width. The regression tests have also been updated to ensure that this stays so. Please check and let me know if there are any more issues.
Title: Re: Firebird character fields with UTF8 connection charset
Post by: emilt on July 24, 2020, 04:44:45 pm
The UTF8 handling now seems correct everywhere I could check it. Thanks!

I now have another question - is there a specific reason why IBQuery parameters are not created when the SQL text is changed at runtime? (IBQuery, lines 240-244)
Title: Re: Firebird character fields with UTF8 connection charset
Post by: tonyw on July 25, 2020, 04:41:30 pm
I now have another question - is there a specific reason why IBQuery parameters are not created when the SQL text is changed at runtime? (IBQuery, lines 240-244)
TIBQuery was modified in January 2016 (looking back over the svn revision logs) to change the run time behaviour so that if the SQL was updated, the params were re-created after the statement was prepared rather than as soon as the statement was updated. If you change the SQL at run time, you thus have to call TIBQuery.Prepare before setting any parameter values. I do not have a note of why this change was made, however, the changelog does record:

IBQuery: Bug fixed that prevented SQL being set programmatically once the Parser had been invoked.

and this is probably the reason why the change was made. Do you have a specific problem in mind as a result of the change?
Title: Re: Firebird character fields with UTF8 connection charset
Post by: Thaddy on July 25, 2020, 05:38:08 pm
Complete ignorance. In a database (storage) the UTF8 really needs to reserve storage space for four bytes, just like UTF16 or UTF32. The latter is actually the best, because unlike ASCII(1 byte except CP_UTF8) it does not bother....and reserves one byte or USC2 which takes always two bytes....
I am really bothered that people do not get it? Annoying. >:D >:D >:D - yes, the grumpy mode.
Title: Re: Firebird character fields with UTF8 connection charset
Post by: tonyw on July 25, 2020, 05:46:51 pm
Complete ignorance. In a database (storage) the UTF8 really needs to store four bytes.
Correct. That is what Firebird does. The issue under discussion is the byte length of a Pascal AnsiString.

The problem was that when a fixed length string is stored by Firebird, if the UTF8 encoded byte length is less than the allocated column storage then Firebird right pads it with space characters. If the database driver (e.g. IBX) does not truncate the string so that the number of UTF8 characters matches the character width of the column then the UTF8Length of the AnsiString will be longer than it should be. This is at best misleading. At worst, if the user tried to write back the same string (as returned from the database), they would get a string too long error.
Title: Re: Firebird character fields with UTF8 connection charset
Post by: Soner on July 25, 2020, 06:29:06 pm
I am using ZEOS-components and Ii definied characterfields as varchar utf-8. TFIeld reverses 4 byte for each charakter.

The problem with utf8 fields is you can never know how length your text can be.

I solved the problem with TDBEdit by setting TEDBedit.MaxLength to the right length manually.
Also I modified TDBGrid to set fieldlength:
Code: Pascal  [Select][+][-]
  1. // usage MyDBGrid.Columns[n].Fieldlength
  2. dbgrids.pas
  3.   TColumn = class(TGridColumn)
  4.   private
  5.     fFieldLength: integer; //soner added
  6.   //..
  7.   public
  8.     property FieldLength: integer read fFieldLength write fFieldLength default 0; //soner added for manipulate fieldlength, especially to solve problem with unicode problem fieldlenghtx4!
  9.   end;
  10.  
  11. procedure TCustomDBGrid.SelectEditor;
  12. //..
  13.     if (SelectedField is TStringField) then begin //soner add: begin
  14.       if SelectedColumn.FieldLength=0 then aMaxLen := SelectedField.Size //soner add: if SelectedColumn.FieldLength=0 then
  15.       else aMaxLen := SelectedColumn.FieldLength; //soner add: else aMaxLen := SelectedColumn.FieldLength;
  16.     end //soner add: end
  17.     else
  18.       aMaxLen := 0;  
  19.  
Title: Re: Firebird character fields with UTF8 connection charset
Post by: emilt on July 25, 2020, 07:13:36 pm
Do you have a specific problem in mind as a result of the change?
No.
I am creating IBQuery components programatically, and I expected that since CheckParams is True by default, the Params would be created when I set the SQL.Text. (TSQLQuery does so). When I found out that they are not, I actually commented out the "if" part I was referring to, so that FParams.ParseSQL is called always, not only when State is csDesigning. It seems to work at the moment, so I was wondering why it had to be otherwise. Of course, in my case I only set SQL.Text once, maybe there would be a problem if Params are not empty and have to be re-created.
But, to be on the safe side, I will probably reverse this change and start calling Prepare after setting SQL.Text as you suggest.
TinyPortal © 2005-2018