Recent

Author Topic: How to load read large CSV with character encoding?  (Read 1049 times)

Xenno

  • New Member
  • *
  • Posts: 49
    • BS Programs
How to load read large CSV with character encoding?
« on: December 04, 2025, 01:05:29 pm »
Rewriting my CSV editor from Delphi to Lazarus, I encounter 2 issues:
  • I cannot find an efficient way to load large CSV files with any character encoding. Loading the whole file to memory (LoadFromFile, ReadAllText) takes (uninterruptible) time and suffers memory eventhough this process occurs in separated thread. I expect to read line by line so I can show loading progress thus user could cancel anytime. In Delphi I use TStreamReader which can handle character encoding and read line by line. Lazarus's TStreamReader can read line by line but does not have character encoding option. I tried LConvEncoding but it is more into Code Page stuffs than simple TEncoding.Unicode, TEncoding.ASCII, etc. By the way, this app writes the CSV content to a database so it does not have memory issue working with massive files.
  • I was using TCSVParser (and a TFileStream) but it is noticable slow as I challenge my app to load a 1 million rows CSV (could be larger in real world). Finally I use TFileReader to read each line and a TStringList to parse every line. It is suprisingly faster even with validation (double quote, trim, column count, etc). Anyway TSCVParser also does not handle character encoding.

Is there better alternative? Especially on issue #1.

TIA,
Lazarus 4.0, Windows 10, https://www.youtube.com/@bsprograms

wp

  • Hero Member
  • *****
  • Posts: 13336
Re: How to load read large CSV with character encoding?
« Reply #1 on: December 04, 2025, 02:40:14 pm »
Can you give more information for those who want to play with this? How many rows? (You mentioned 1 million at least) How many columns? Content of the fields: Strings, Integers, Floats, Dates? (I mean: are string-to-number/date conversions required?) What is the encoding of the string fields? And what do you want it to be converted to?

I do remember a text file reading discussion here some years ago, and the essence was that the winner was plain old Readln with increased text buffer size.

Zvoni

  • Hero Member
  • *****
  • Posts: 3230
Re: How to load read large CSV with character encoding?
« Reply #2 on: December 04, 2025, 03:13:19 pm »
I do remember a text file reading discussion here some years ago, and the essence was that the winner was plain old Readln with increased text buffer size.
BlockRead?

Since BlockRead reads it in a Byte-Buffer "as is", the "conversion" would take place in the Frontend.
Though i seem to remember, BlockRead needs a "fixed" recordsize
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

Xenno

  • New Member
  • *
  • Posts: 49
    • BS Programs
Re: How to load read large CSV with character encoding?
« Reply #3 on: December 04, 2025, 06:52:53 pm »
Thank you for your quick reply.

For some languages, we can do like below:

Delphi:
Code: Pascal  [Select][+][-]
  1. Reader := TStreamReader.Create(vFileStream, vEncoding, True);

Java:
Code: Java  [Select][+][-]
  1. Reader in = new InputStreamReader(new FileInputStream("file"), vEncoding));

VB:
Code: Visual Basic  [Select][+][-]
  1. sr = New StreamReader(infile, vEncoding)

Where vEncoding is a variable containing encoding name or constant. After that the file content will be read line by line; without read the whole file content first.

Maybe the implicit question is how do I that with Lazarus or Free Pascal?

TIA,
Lazarus 4.0, Windows 10, https://www.youtube.com/@bsprograms

wp

  • Hero Member
  • *****
  • Posts: 13336
Re: How to load read large CSV with character encoding?
« Reply #4 on: December 04, 2025, 08:05:51 pm »
For some languages, we can do like below:

Delphi:
Code: Pascal  [Select][+][-]
  1. Reader := TStreamReader.Create(vFileStream, vEncoding, True);
Maybe the implicit question is how do I that with Lazarus or Free Pascal?
The TStreamReader in FPC/Fixes and FPC/Main has these constructors:
Code: Pascal  [Select][+][-]
  1.    TStreamReader = class(TTextReader)
  2.   ...
  3.    public
  4.      constructor Create(AStream: TStream; ABufferSize: Integer; AOwnsStream: Boolean); virtual;
  5.      constructor Create(AStream: TStream); virtual;
  6.      constructor Create(const aFilename: string);
  7.      constructor Create(const aFilename: string; aDetectBOM: Boolean);
  8.      constructor Create(const aFilename: string; aEncoding: TEncoding; aDetectBOM: Boolean; aBufferSize: Integer); overload;
  9.     ...
  10.  

Nimbus

  • Jr. Member
  • **
  • Posts: 80
Re: How to load read large CSV with character encoding?
« Reply #5 on: December 04, 2025, 08:29:24 pm »
And if we look into the implementation...  ::)

Code: Pascal  [Select][+][-]
  1. constructor TStreamReader.Create(const aFilename: string; aEncoding: TEncoding; aDetectBOM: Boolean; aBufferSize: Integer);
  2. var
  3.   F : TFileStream;
  4.  
  5. begin
  6.   // DetectBOM & encoding ignored for the moment.
  7.   F:=TFileStream.Create(aFileName,fmOpenRead or fmShareDenyWrite);
  8.   Create(F,aBufferSize,True);
  9. end;
  10.  
  11.  

https://gitlab.com/freepascal.org/fpc/source/-/blob/main/packages/fcl-base/src/streamex.pp#L969
« Last Edit: December 04, 2025, 08:31:09 pm by Nimbus »

Xenno

  • New Member
  • *
  • Posts: 49
    • BS Programs
Re: How to load read large CSV with character encoding?
« Reply #6 on: December 05, 2025, 04:43:12 am »
That's very good news! In FPC 3.2.2, TStreamReader does not have that constructor. That addition will be great when ready. I use Stream Reader (combined with File Stream) a lot when working on text files with or without character encoding.

Thank you very much for the enlightenment.
Lazarus 4.0, Windows 10, https://www.youtube.com/@bsprograms

Nicole

  • Hero Member
  • *****
  • Posts: 1303
Re: How to load read large CSV with character encoding?
« Reply #7 on: December 05, 2025, 10:26:03 am »
This topic I spend days with.
I use an old US csv-file and read it (I used TStringList) by Laz 4.0

The most mean digit is: -
It looks the same, but it is twisted and turned by the IDE / Unicode, the operating system (Win 7) or anywhere else. You copy it - changed.

Is it a letter or is it a math operator? At the moment I mess around with it and will find my solution sooner or later.

I just want to draw your attention to this odd topic of a sign, which looks exactly the same for a human and is different for the computer.

Xenno

  • New Member
  • *
  • Posts: 49
    • BS Programs
Re: How to load read large CSV with character encoding?
« Reply #8 on: December 05, 2025, 10:44:13 am »
This topic I spend days with.
I use an old US csv-file and read it (I used TStringList) by Laz 4.0

The most mean digit is: -
It looks the same, but it is twisted and turned by the IDE / Unicode, the operating system (Win 7) or anywhere else. You copy it - changed.

Is it a letter or is it a math operator? At the moment I mess around with it and will find my solution sooner or later.

I just want to draw your attention to this odd topic of a sign, which looks exactly the same for a human and is different for the computer.

Hello, Nicole. If you are in Windows 10+, can you give my app a try? I published it in Microsoft Store with trial version limited to 200 rows. If the CSV contains more than 200 rows, the app has File Splitter.
https://apps.microsoft.com/store/detail/9NBFNRTPHSQQ?cid=DevShareMCLPCS
Lazarus 4.0, Windows 10, https://www.youtube.com/@bsprograms

avk

  • Hero Member
  • *****
  • Posts: 825
Re: How to load read large CSV with character encoding?
« Reply #9 on: December 05, 2025, 11:21:42 am »
...
In FPC 3.2.2, TStreamReader does not have that constructor. That addition will be great when ready.
...

Maybe as a temporary solution
Code: Pascal  [Select][+][-]
  1. ...
  2. type
  3.   TMyStreamReader = class(TStreamReader)
  4.     constructor Create(const aFileName: string; aCodePage: Integer); virtual; overload;
  5.   end;
  6.  
  7. constructor TMyStreamReader.Create(const aFileName: string; aCodePage: Integer);
  8. var
  9.   s: string;
  10. begin
  11.   with TStringStream.Create('', aCodePage) do
  12.     try
  13.       LoadFromFile(aFileName);
  14.       s := DataString;
  15.     finally
  16.       Free;
  17.     end;
  18.   inherited Create(TStringStream.Create(s), StreamEx.BUFFER_SIZE, True);
  19. end;  
  20.  

Of course, the fact that the entire file is loaded into memory is a noticeable drawback.

Xenno

  • New Member
  • *
  • Posts: 49
    • BS Programs
Re: How to load read large CSV with character encoding?
« Reply #10 on: December 05, 2025, 12:08:44 pm »
Thank you for the code. Yes, currently the app has to load the entire file when user open a file with encoding other than default. I use TStringList.

Code: Pascal  [Select][+][-]
  1.   ...
  2.   FS := TStringList.Create;
  3.   try
  4.     FS.LoadFromFile(FFileName, FEncoding);
  5.     row := 0;
  6.     while not(Terminated) and (row < FS.Count) do begin
  7.       LineStr := FS[row];
  8.   ...
 

If no encoding, I use TFileReader.

Code: Pascal  [Select][+][-]
  1.   ...
  2.   FR := TFileReader.Create(FFileName);
  3.   try        
  4.     while not(Terminated or FR.Eof) do begin
  5.       FR.ReadLine(LineStr);  
  6.   ...
 
Lazarus 4.0, Windows 10, https://www.youtube.com/@bsprograms

jcmontherock

  • Sr. Member
  • ****
  • Posts: 336
Re: How to load read large CSV with character encoding?
« Reply #11 on: December 05, 2025, 06:20:34 pm »
You can also try:  TStringGrid.LoadFromCSVFile
Windows 11 UTF8-64 - Lazarus 4.4-64 - FPC 3.2.2

Xenno

  • New Member
  • *
  • Posts: 49
    • BS Programs
Re: How to load read large CSV with character encoding?
« Reply #12 on: December 06, 2025, 05:43:10 am »
Thank you. Yes, it is a nice option. A very handy method of TStringGrid. But I prefer to use TDrawGrid.
Lazarus 4.0, Windows 10, https://www.youtube.com/@bsprograms

 

TinyPortal © 2005-2018