Recent

Author Topic: Firebird character fields with UTF8 connection charset  (Read 1129 times)

emilt

  • New Member
  • *
  • Posts: 20
Firebird character fields with UTF8 connection charset
« on: July 10, 2020, 12:19:25 pm »
I think the problem starts with Firebird itself - when the connection charset is UTF8, it always allocates the maximum possibly needed memory for a field, which is 4 bytes per character, and as a result, the buffer for a CHAR(3) field is 12 bytes, and its length is given as 12. No information is provided about the length of the buffer as number of characters, this is only available in the field metadata. (That description may not be entirely correct, if anyone can provide a better explanation - would be nice)

As a result of this many (or at least some) clients will return such CHAR(3) field as up to 12 UTF characters (depending on how many of them as ASCII), padded right with spaces. The PHP interbase driver does so, and FPC's IBConnection does so, and IBX for Lazarus too.

There is however a difference - when you create the TField object for such a firebird field, TIBConnection will create it with Size and DisplayWidth of 12. IBX for lazarus, on the other hand, will create the field with the correct Size and DisplayWidth of 3. And then, both of them will actually return the data in the AsString property as up to 12, padded with spaces to the length of 12 bytes.

There is a kind of workaround for this with TIBConnection - you can set the TField.Size to the correct value, and it will work - as long as your content is only ASCII. I couldn't find similar one for IBX.

Now the questions are - would it be possible to fix this on the level of data access components, and how. If not, why exactly? Do you know how this works on current Delphi versions? What about other db components supporting Firebird?

mangakissa

  • Hero Member
  • *****
  • Posts: 1086
Re: Firebird character fields with UTF8 connection charset
« Reply #1 on: July 10, 2020, 03:31:14 pm »
The SQLdb framework has the same problem. I've made a workaround when firebird TField requires UTF-8 to devide the fieldsize with 4.

The real problem relies on writing back to a table. If your fieldsize on Lazarus is 12 and your charsize is 3, an error occurred. The modifications can't be saved
Lazarus 2.06 (64b) / FPC 3.0.4 / Windows 10
stucked on Delphi 10.3.1

EgonHugeist

  • Jr. Member
  • **
  • Posts: 71
Re: Firebird character fields with UTF8 connection charset
« Reply #2 on: July 11, 2020, 08:34:59 am »
The SQLdb framework has the same problem. I've made a workaround when firebird TField requires UTF-8 to devide the fieldsize with 4.

The real problem relies on writing back to a table. If your fieldsize on Lazarus is 12 and your charsize is 3, an error occurred. The modifications can't be saved

Because of this (and missing fieldtypes/performance penalties of the tField-design) Zeos 7.3 https://sourceforge.net/p/zeoslib/code-0/HEAD/tree/branches/testing-7.3/ did introduce it's own T(?)Field descendants. In case of TStringField Size is the amound of characters, datasize is multiplied by bytes per char. The new SQLDB hase same logic as Zeos 7.2 inbetween but only for UTF8, all other multibyte charsets are treaded a single byte charset which is wrong.

LacaK

  • Hero Member
  • *****
  • Posts: 600
Re: Firebird character fields with UTF8 connection charset
« Reply #3 on: July 14, 2020, 07:42:32 am »
As of FPC 3.2 this should be handled by TIBConnection and TStringField.
So with FPC 3.2 and UTF-8 connection charset you should get:
- TStringField.Size = 3 (chars)
- TStringField.DataSize = 12 (bytes)

emilt

  • New Member
  • *
  • Posts: 20
Re: Firebird character fields with UTF8 connection charset
« Reply #4 on: July 14, 2020, 11:14:15 am »
Yes! It seems this really has been fixed in FPC 3.2. I will probably re-create my datasets and fields just to be sure that everything is OK now.

Now, how could I have known that? I couldn't find anything about the changes in FCL in the FPC 3.2 release notes. Are these documented somewhere separately?

And, by the way, do you happen to know who is the author/maintainer of IBConnection? There is no copyright header in the source file...

tonyw

  • Full Member
  • ***
  • Posts: 173
    • MWA Software
Re: Firebird character fields with UTF8 connection charset
« Reply #5 on: July 16, 2020, 01:19:37 pm »
There is however a difference - when you create the TField object for such a firebird field, TIBConnection will create it with Size and DisplayWidth of 12. IBX for lazarus, on the other hand, will create the field with the correct Size and DisplayWidth of 3. And then, both of them will actually return the data in the AsString property as up to 12, padded with spaces to the length of 12 bytes.
You have highlighted an interesting problem with Firebird CHAR types versus VARCHAR types.

When the Firebird column type is VARCHAR, the server always returns the byte length of the characters string. In IBX this is used to return an AnsiString with a byte length set to the byte length returned by the server. If the string codepage is UTF8 then you can get the actual number of characters using the FPC UTF8Length() function.

When the Firebird column type is CHAR, the server returns a fixed size buffer (set to the character size * max no. of bytes per character). This contains the encoded characters right padded with spaces. At present, IBX does always return an AnsiString with a byte length set to the buffer size. As you observe, this may give an odd result as the character length can be longer than expected with UTF8 - although the displayed text looks OK given that the right padding is white space.

Arguably this is a bug. IBX always translates strings read from a server to the UTF8 code page so that they are compatible with the LCL - if you want the original code page then you need to use the underlying Firebird Pascal Interface. Therefore at the IBX  level it is easy enough to truncate the (UTF8) string to the expected character length and I'll make that change for the next release.

For the record, with IBX and string columns, the TField properties are set as follows:

Size: the maximum character width of the string.
DataSize: the maximum byte length of the string.
DisplayWidth: defaults to "Size" unless overridden at design time or at run time by assigning to the DisplayWidth property

« Last Edit: July 16, 2020, 01:21:58 pm by tonyw »

emilt

  • New Member
  • *
  • Posts: 20
Re: Firebird character fields with UTF8 connection charset
« Reply #6 on: July 17, 2020, 07:36:18 am »
Yes! It seems this really has been fixed in FPC 3.2. I will probably re-create my datasets and fields just to be sure that everything is OK now.
Well, actually I had to recreate the fields, otherwise strange things started happening - AVs and all that. So it would really have been nice if there were some documentation about the [breaking] changes to the FCL packages with the new FPC version. A quick look showed there were really quite some changes to both IBConnection.pp and Fields.inc, didn't look at others.

emilt

  • New Member
  • *
  • Posts: 20
Re: Firebird character fields with UTF8 connection charset
« Reply #7 on: July 17, 2020, 07:41:21 am »
... and I'll make that change for the next release.
Any idea when that next release will be? Or any chance to to preview and test the changes before that?

LacaK

  • Hero Member
  • *****
  • Posts: 600
Re: Firebird character fields with UTF8 connection charset
« Reply #8 on: July 17, 2020, 01:47:02 pm »
Now, how could I have known that? I couldn't find anything about the changes in FCL in the FPC 3.2 release notes. Are these documented somewhere separately?

https://wiki.freepascal.org/User_Changes_3.2.0

tonyw

  • Full Member
  • ***
  • Posts: 173
    • MWA Software
Re: Firebird character fields with UTF8 connection charset
« Reply #9 on: July 18, 2020, 12:36:58 pm »
... and I'll make that change for the next release.
Any idea when that next release will be? Or any chance to to preview and test the changes before that?
No formal release planned at present. However, you can always access up-to-date fixes at

https://www.mwasoftware.co.uk/svn-repository

You should review the change logs (linked from the above page). I have added the UTF8 truncation to the current fixes. There is also code for separating SQL Filters from client side filters. However, that is flagged as not yet tested - so don't rely on it at present.

emilt

  • New Member
  • *
  • Posts: 20
Re: Firebird character fields with UTF8 connection charset
« Reply #10 on: July 20, 2020, 03:57:41 pm »
I had a look at the updates, especially FBIntf, but I fail to understand the approach.

The way I see it, if I have a field defined in the database as CHAR(8), I should get always a string of 8 characters. If the connection charset is UTF8, then UTF8Length(AsString) should be 8.

The released version used to trim these char values, so value of '123' had length 3. Now with the changes, value of  '123'  in the field produces a string with Length/UTF8Length of 32 - padded with spaces.

Am I missing something?

tonyw

  • Full Member
  • ***
  • Posts: 173
    • MWA Software
Re: Firebird character fields with UTF8 connection charset
« Reply #11 on: July 21, 2020, 10:08:11 am »
I had a look at the updates, especially FBIntf, but I fail to understand the approach.

The way I see it, if I have a field defined in the database as CHAR(8), I should get always a string of 8 characters. If the connection charset is UTF8, then UTF8Length(AsString) should be 8.

The released version used to trim these char values, so value of '123' had length 3. Now with the changes, value of  '123'  in the field produces a string with Length/UTF8Length of 32 - padded with spaces.

Am I missing something?
My mistake. I had it in my head that Firebird returned the character width rather than the byte length in the column metadata. Testing then gave the result I expected rather what I should have expected.

The svn trunk is now updated so that the string should now be correctly truncated to the UTF8 character width. The regression tests have also been updated to ensure that this stays so. Please check and let me know if there are any more issues.

emilt

  • New Member
  • *
  • Posts: 20
Re: Firebird character fields with UTF8 connection charset
« Reply #12 on: July 24, 2020, 04:44:45 pm »
The UTF8 handling now seems correct everywhere I could check it. Thanks!

I now have another question - is there a specific reason why IBQuery parameters are not created when the SQL text is changed at runtime? (IBQuery, lines 240-244)

tonyw

  • Full Member
  • ***
  • Posts: 173
    • MWA Software
Re: Firebird character fields with UTF8 connection charset
« Reply #13 on: July 25, 2020, 04:41:30 pm »
I now have another question - is there a specific reason why IBQuery parameters are not created when the SQL text is changed at runtime? (IBQuery, lines 240-244)
TIBQuery was modified in January 2016 (looking back over the svn revision logs) to change the run time behaviour so that if the SQL was updated, the params were re-created after the statement was prepared rather than as soon as the statement was updated. If you change the SQL at run time, you thus have to call TIBQuery.Prepare before setting any parameter values. I do not have a note of why this change was made, however, the changelog does record:

IBQuery: Bug fixed that prevented SQL being set programmatically once the Parser had been invoked.

and this is probably the reason why the change was made. Do you have a specific problem in mind as a result of the change?

Thaddy

  • Hero Member
  • *****
  • Posts: 10282
Re: Firebird character fields with UTF8 connection charset
« Reply #14 on: July 25, 2020, 05:38:08 pm »
Complete ignorance. In a database (storage) the UTF8 really needs to reserve storage space for four bytes, just like UTF16 or UTF32. The latter is actually the best, because unlike ASCII(1 byte except CP_UTF8) it does not bother....and reserves one byte or USC2 which takes always two bytes....
I am really bothered that people do not get it? Annoying. >:D >:D >:D - yes, the grumpy mode.
« Last Edit: July 25, 2020, 05:44:45 pm by Thaddy »
I am more like donkey than shrek

 

TinyPortal © 2005-2018