Recent

Author Topic: UTF-8 and Laz 2.0.10/fpc 3.2.0 difference?  (Read 9082 times)

jc99

  • Hero Member
  • *****
  • Posts: 553
    • My private Site
Re: UTF-8 and Laz 2.0.10/fpc 3.2.0 difference?
« Reply #45 on: March 27, 2021, 09:18:43 am »
TestForm:
Code: Pascal  [Select][+][-]
  1. unit frm_TestDBEncoding;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. interface
  6.  
  7. uses
  8.   Classes, SysUtils, Forms, Controls, Graphics, Dialogs, ExtCtrls, StdCtrls;
  9.  
  10. type
  11.  
  12.   { TFrmTstDBEncodingMain }
  13.  
  14.   TFrmTstDBEncodingMain = class(TForm)
  15.     memo1: TMemo;
  16.     procedure FormShow(Sender: TObject);
  17.   private
  18.  
  19.   public
  20.  
  21.   end;
  22.  
  23. var
  24.   FrmTstDBEncodingMain: TFrmTstDBEncodingMain;
  25.  
  26. implementation
  27.  
  28. {$R *.lfm}
  29.  
  30. uses lazutf8;
  31. { TFrmTstDBEncodingMain }
  32.  
  33. procedure TFrmTstDBEncodingMain.FormShow(Sender: TObject);
  34. var
  35.   s: String;
  36. begin
  37.     s := 'Maaß';
  38.     memo1.Append(s);// gives correct word
  39.  
  40.     SetCodePage(RawByteString(s), 1252, False);
  41.     memo1.Append(s);// gives the wrong word
  42.  
  43.     s:='Wälde'; // -> Wälde
  44.     memo1.Append(s+' --> '+Utf8ToWinCP(s));
  45.     s:='Günther'; // -> Günther
  46.     memo1.Append(s+' --> '+Utf8ToWinCP(s));
  47.     s:='Maaß'; // -> Maaß
  48.     memo1.Append(s+' --> '+Utf8ToWinCP(s));
  49.  
  50. end;
  51.  
  52. end.
  53.  
result:
Maaß
Maaß
Wälde --> Wälde
Günther --> Günther
Maaß --> Maaß

OK, the data in the DB is double-UTF8-Encoded and can be decoded by decoding them again ...
... with rawbytestring the data can be encoded to be consistent with the DB.

So what is the next step ?
Is there somewhere an injection-point to get the displayed and written right [correctly-wrong] in DBComponents i.e. a DBGrid ?
[edit]Of topic: the edit-button in the forum seems to open a broken entry-editor only the modify-button works ...
« Last Edit: March 27, 2021, 09:35:58 am by jc99 »
OS: Win XP x64, Win 7, Win 7 x64, Win 10, Win 10 x64, Suse Linux 13.2
Laz: 1.4 - 1.8.4, 2.0
https://github.com/joecare99/public
'~|    /''
,_|oe \_,are
If you want to do something for the environment: Twitter: #reduceCO2 or
https://www.betterplace.me/klimawandel-stoppen-co-ueber-preis-reduzieren

dseligo

  • Hero Member
  • *****
  • Posts: 1196
Re: UTF-8 and Laz 2.0.10/fpc 3.2.0 difference?
« Reply #46 on: March 27, 2021, 11:05:30 am »
OK, the data in the DB is double-UTF8-Encoded and can be decoded by decoding them again ...
... with rawbytestring the data can be encoded to be consistent with the DB.

So what is the next step ?
Is there somewhere an injection-point to get the displayed and written right [correctly-wrong] in DBComponents i.e. a DBGrid ?
[edit]Of topic: the edit-button in the forum seems to open a broken entry-editor only the modify-button works ...
I don't know about injection point, hopefully someone else will answer you about that.

I would recommend that you correct data in database, and after that you could use database components without decoding. Since you mentioned you filled database with programs compiled with FPC 3.0.4 and older versions, I don't know if you still use them and if you could replace them.

Benefit of that would also be that you could use HeidiSQL (and other tools).

jc99

  • Hero Member
  • *****
  • Posts: 553
    • My private Site
Re: UTF-8 and Laz 2.0.10/fpc 3.2.0 difference?
« Reply #47 on: March 27, 2021, 11:43:42 am »
[...]I would recommend that you correct data in database, and after that you could use database components without decoding. Since you mentioned you filled database with programs compiled with FPC 3.0.4 and older versions, I don't know if you still use them and if you could replace them.
Replacing the Programs is not the problem. The problem would be that I then have inconsistent data because there are many of them doing different tasks. I would have to switch all of them including the data in a split-second.
... And have i mentioned the database is BIG.

I cannot shut down the programms for long because the gathered data is very short lived, that is the reason why I started the project in the first place.The end-goal is a database correcty encoded, but I need a mid-term solution.
It's gonna be a step by step solution.

Benefit of that would also be that you could use HeidiSQL (and other tools).
I already use HeidiSQL 11.2, I just have to be carefull with the encodings.
OS: Win XP x64, Win 7, Win 7 x64, Win 10, Win 10 x64, Suse Linux 13.2
Laz: 1.4 - 1.8.4, 2.0
https://github.com/joecare99/public
'~|    /''
,_|oe \_,are
If you want to do something for the environment: Twitter: #reduceCO2 or
https://www.betterplace.me/klimawandel-stoppen-co-ueber-preis-reduzieren

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: UTF-8 and Laz 2.0.10/fpc 3.2.0 difference?
« Reply #48 on: March 27, 2021, 01:51:42 pm »
Can you produce a project showing the problem. I just tried with a small project using SQLite, but failed to reproduce the same problem. I used Laz1.8.0 and Laz2.0.10

I suspect the problem is caused by adding codepage to TStringField, not sure. The change might be related to bug report #35796

Assuming this is the right track, here is the log of fields.inc.

All fields of string types are affected.

« Last Edit: March 27, 2021, 02:08:50 pm by engkin »

dseligo

  • Hero Member
  • *****
  • Posts: 1196
Re: UTF-8 and Laz 2.0.10/fpc 3.2.0 difference?
« Reply #49 on: March 28, 2021, 11:22:48 pm »
Replacing the Programs is not the problem. The problem would be that I then have inconsistent data because there are many of them doing different tasks. I would have to switch all of them including the data in a split-second.
... And have i mentioned the database is BIG.

I cannot shut down the programms for long because the gathered data is very short lived, that is the reason why I started the project in the first place.The end-goal is a database correcty encoded, but I need a mid-term solution.
It's gonna be a step by step solution.
I don't know how complex (not big) your database is or how complex your programs are.
Depending if (and how long) your database could be offline I suggest two approaches.

I.) If database could be offline for some time then you would prepare, test and distribute new version of programs which correctly work with UTF-8. After that you put
database offline (to this programs) and correct the encoding the data in database. After that you start using new programs. You should ensure that old programs aren't able to access database.

II.) If you can't take database offline then things get much more complicated.
1.) Add column named 'version' or similar to your tables and set it to be default '1'.
2.) Change your program to be able to decode "bad" UTF-8 and "good" UTF-8 depending of 'version' value in row you are reading ("bad" UTF-8 is version 1, "good" UTF-8 is version 2).
3.) Start updating your programs. Programs still should use "bad" UTF-8 for writing, but they should be able to read both versions. After all programs are updated with they can read both versions of table rows.
4.) You can start switching programs to use version '2' for writing into the database.
5.) After all programs writing "good" UTF-8 to database (and can read both versions) you start to correct encoding of strings in database (and update 'version' value to 2).
6.) When you are done with that you deploy new version of program where it doesn't need to take care of row 'version' value - only uses correct encoding.
7.) Remove 'version' columns from your tables.

While I think number II. is doable, number I. is definitely much easier to accomplish.

jc99

  • Hero Member
  • *****
  • Posts: 553
    • My private Site
Re: UTF-8 and Laz 2.0.10/fpc 3.2.0 difference?
« Reply #50 on: March 29, 2021, 10:59:21 am »
Can you produce a project showing the problem. I just tried with a small project using SQLite, but failed to reproduce the same problem. I used Laz1.8.0 and Laz2.0.10

I suspect the problem is caused by adding codepage to TStringField, not sure. The change might be related to bug report #35796

Assuming this is the right track, here is the log of fields.inc.

All fields of string types are affected.
I use Maria-DB the free fork of mySQL.
 
I suspected that, there was a breaking-change reported in the version-changes relating to TStringField.
... in my opinion the DB is irrelevant now. Just use my provided SQL-Creation script for places and make a program display and write the data correct (two-way).

OS: Win XP x64, Win 7, Win 7 x64, Win 10, Win 10 x64, Suse Linux 13.2
Laz: 1.4 - 1.8.4, 2.0
https://github.com/joecare99/public
'~|    /''
,_|oe \_,are
If you want to do something for the environment: Twitter: #reduceCO2 or
https://www.betterplace.me/klimawandel-stoppen-co-ueber-preis-reduzieren

jc99

  • Hero Member
  • *****
  • Posts: 553
    • My private Site
Re: UTF-8 and Laz 2.0.10/fpc 3.2.0 difference?
« Reply #51 on: March 29, 2021, 11:02:05 am »
I don't know how complex (not big) your database is or how complex your programs are.
Depending if (and how long) your database could be offline I suggest two approaches.

I.) If database could be offline for some time then you would prepare, test and distribute new version of programs which correctly work with UTF-8. After that you put
database offline (to this programs) and correct the encoding the data in database. After that you start using new programs. You should ensure that old programs aren't able to access database.

II.) If you can't take database offline then things get much more complicated.
1.) Add column named 'version' or similar to your tables and set it to be default '1'.
2.) Change your program to be able to decode "bad" UTF-8 and "good" UTF-8 depending of 'version' value in row you are reading ("bad" UTF-8 is version 1, "good" UTF-8 is version 2).
3.) Start updating your programs. Programs still should use "bad" UTF-8 for writing, but they should be able to read both versions. After all programs are updated with they can read both versions of table rows.
4.) You can start switching programs to use version '2' for writing into the database.
5.) After all programs writing "good" UTF-8 to database (and can read both versions) you start to correct encoding of strings in database (and update 'version' value to 2).
6.) When you are done with that you deploy new version of program where it doesn't need to take care of row 'version' value - only uses correct encoding.
7.) Remove 'version' columns from your tables.

While I think number II. is doable, number I. is definitely much easier to accomplish.
I have to go for II but first I need a solution to handle the incorrect old data. Then the other steps fall in place.

OS: Win XP x64, Win 7, Win 7 x64, Win 10, Win 10 x64, Suse Linux 13.2
Laz: 1.4 - 1.8.4, 2.0
https://github.com/joecare99/public
'~|    /''
,_|oe \_,are
If you want to do something for the environment: Twitter: #reduceCO2 or
https://www.betterplace.me/klimawandel-stoppen-co-ueber-preis-reduzieren

jc99

  • Hero Member
  • *****
  • Posts: 553
    • My private Site
Re: UTF-8 and Laz 2.0.10/fpc 3.2.0 difference?
« Reply #52 on: March 29, 2021, 10:22:10 pm »
I found a solution (for me)
Code: SQL  [Select][+][-]
  1. /*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
  2. /*!40101 SET NAMES utf8 */;
  3. /*!50503 SET NAMES utf8mb4 */;
  4. /*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
  5. /*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
  6. /*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
  7.  
  8. CREATE DATABASE IF NOT EXISTS `test` /*!40100 DEFAULT CHARACTER SET utf8 */;
  9. USE `test`;
  10.  
  11. CREATE TABLE IF NOT EXISTS `daten` (
  12.   `idDaten` INT(11) NOT NULL AUTO_INCREMENT,
  13.   `Description` VARCHAR(50) DEFAULT NULL,
  14.   `Nummer` VARCHAR(50) DEFAULT NULL,
  15.   `Name` VARCHAR(50) DEFAULT NULL,
  16.   PRIMARY KEY (`idDaten`)
  17. ) ENGINE=InnoDB AUTO_INCREMENT=7 DEFAULT CHARSET=utf8;
  18.  
  19. /*!40000 ALTER TABLE `daten` DISABLE KEYS */;
  20. INSERT INTO `daten` (`idDaten`, `Description`, `Nummer`, `Name`) VALUES
  21.     (1, 'Hand', '6', 'Hand Maaß'),
  22.     (2, 'Fritz', '3', 'Fritz Ãœhlein'),
  23.     (3, 'Peter', '9', 'Peter Öhringer'),
  24.     (4, 'Otto', '2', 'Otto Ährenmal'),
  25.     (5, 'Franz', '11', 'Franz Währentaz'),
  26.     (6, 'Ludwig', '8', 'Ludwig Sören Schütze');
  27. /*!40000 ALTER TABLE `daten` ENABLE KEYS */;
  28.  
  29. /*!40101 SET SQL_MODE=IFNULL(@OLD_SQL_MODE, '') */;
  30. /*!40014 SET FOREIGN_KEY_CHECKS=IFNULL(@OLD_FOREIGN_KEY_CHECKS, 1) */;
  31. /*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
  32. /*!40111 SET SQL_NOTES=IFNULL(@OLD_SQL_NOTES, 1) */;
and the test-project uses a custom TStringField2.
the injection is done with redefining the DefaultFieldClasses of DB
Code: Pascal  [Select][+][-]
  1. initialization
  2.   DefaultFieldClasses[ftString]:=TStringField2;
  3.   DefaultFieldClasses[ftFixedChar]:=TStringField2;  
  4.  
in the attached project.
OS: Win XP x64, Win 7, Win 7 x64, Win 10, Win 10 x64, Suse Linux 13.2
Laz: 1.4 - 1.8.4, 2.0
https://github.com/joecare99/public
'~|    /''
,_|oe \_,are
If you want to do something for the environment: Twitter: #reduceCO2 or
https://www.betterplace.me/klimawandel-stoppen-co-ueber-preis-reduzieren

 

TinyPortal © 2005-2018