Recent

Author Topic: How to determine the unknown codepage of a textfile?  (Read 3427 times)

Hartmut

  • Hero Member
  • *****
  • Posts: 1103
How to determine the unknown codepage of a textfile?
« on: February 08, 2026, 09:46:46 am »
I want to display the content of a textfile in a TMemo, which needs UTF8 codepage. But I don't know the codepage of the textfile, which I must convert if so to UTF8. Possible in the textfile are only these 3 codepages:
 - UTF8
 - cp1252
 - cp850
or plain ASCII.

How can I determine these codepages?

What I found:
If LazUTF8.FindInvalidUTF8Codepoint() returns >= 0 then it is not UTF8.
If LazUTF8.FindInvalidUTF8Codepoint() returns < 0 then it can be UTF8 or plain ASCII, but I assume that a couple of combinations of cp1252 or cp850 are possible too.

https://lazarus-ccr.sourceforge.io/docs/lazutils/lconvencoding/guessencoding.html seems to be able to detect some UTF8 codepages, but if not, simply returns a default for the OS, which has nothing to do with the content of the textfile.

Does FPC contain a function for that purpose?
Or has someone already created something for that purpose?
Or an idea how to solve this?
Thanks in advance.

Thaddy

  • Hero Member
  • *****
  • Posts: 18783
  • To Europe: simply sell USA bonds: dollar collapses
Re: How to determine the unknown codepage of a textfile?
« Reply #1 on: February 08, 2026, 09:54:33 am »
How would you do that?
The same text can be valid in multiple codepages.
Which one to pick?
That said, if a text has enough entropy: information by means of occurrance of a mean set of characters for a given codepage, then yes, codepages can be recognized, but way too much trouble.
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

d4eva

  • New Member
  • *
  • Posts: 30
Re: How to determine the unknown codepage of a textfile?
« Reply #2 on: February 08, 2026, 11:40:23 am »
I'm using this - https://gitlab.freedesktop.org/uchardet/uchardet.
You will need to build so/dylib/dll, but it has worked fine for my needs :)

CM630

  • Hero Member
  • *****
  • Posts: 1641
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: How to determine the unknown codepage of a textfile?
« Reply #3 on: February 08, 2026, 12:21:14 pm »
Maybe you can find the answer here:
https://wiki.freepascal.org/UTF8_Tools
AFAIR it tries to guess the CP, as it can never be sure what it is. I suppose you can force the encoding, if you know it.
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

Hartmut

  • Hero Member
  • *****
  • Posts: 1103
Re: How to determine the unknown codepage of a textfile?
« Reply #4 on: February 08, 2026, 12:33:00 pm »
if a text has enough entropy: information by means of occurrance of a mean set of characters for a given codepage, then yes, codepages can be recognized
As so often your post is useless.
You only write "it's not simple but depending on the data it can be done".
This is obvious and helps nobody.

(Nearly) every Texteditor has to solve this problem, so there must be solutions.



I'm using this - https://gitlab.freedesktop.org/uchardet/uchardet.
You will need to build so/dylib/dll, but it has worked fine for my needs :)
Thanks a lot d4eva for your post. I had a quick look at it and it sounds very interesting. But I'm not familiar with gitlab, so I'm a bit overcharged how to "use" it in a FPC program. Seems to be originally C++ code and if I understand you correctly, I must create a dll from that (which I never did).

First question: I need this "codepage detector" in 1st line for Linux (Ubuntu 24.04 with KDE plasma desktop), but if it works on Windows too, this would be welcome.
Can I use "uchardet" on Linux too?
If yes, can you (or someone else) help me with some infos how to start?



Maybe you can find the answer here:
https://wiki.freepascal.org/UTF8_Tools
AFAIR it tries to guess the CP, as it can never be sure what it is. I suppose you can force the encoding, if you know it.
Thanks CM630 too. On the 1st view it looks more for dealing with Unicode (?)

Code: Pascal  [Select][+][-]
  1.  f := TCharEncStream.Create;
  2.  f.LoadFromFile(OpenDialog1.FileName);
  3.  Memo1.Text := f.UTF8Text;  
  4.  f.Free;
This seams to load and convert to UTF8, but how can I query the used codepage of the textfile, to be able to write it back (after changes) in the original codepage?

LeP

  • Full Member
  • ***
  • Posts: 203
Re: How to determine the unknown codepage of a textfile?
« Reply #5 on: February 08, 2026, 12:42:16 pm »
Finding a code page from unknown text is not possible with exactitude and precision.
For example, Portuguese (used in Portugal and Brazil) can be mapped to CP-1140, CP-1252, CP-850, CP-860, Unicode....
Not to mention other languages.

So, the best choice is to have everything in Unicode (like UTF-8) or know the original source.

The indicated tools can probably find the correct code page in most cases, but from experience I wouldn't trust them too much.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4694
  • I like bugs.
Re: How to determine the unknown codepage of a textfile?
« Reply #6 on: February 08, 2026, 01:12:38 pm »
So, the best choice is to have everything in Unicode (like UTF-8) or know the original source.
That is my feeling, too. Convert the files manually to get it right. It may be laborious but you only need to do it once.
Local codepages are history, or at least they should be history.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

CM630

  • Hero Member
  • *****
  • Posts: 1641
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: How to determine the unknown codepage of a textfile?
« Reply #7 on: February 08, 2026, 03:26:56 pm »
...
Code: Pascal  [Select][+][-]
  1.  f := TCharEncStream.Create;
  2.  f.LoadFromFile(OpenDialog1.FileName);
  3.  Memo1.Text := f.UTF8Text;  
  4.  f.Free;
This seams to load and convert to UTF8, but how can I query the used codepage of the textfile, to be able to write it back (after changes) in the original codepage?
I tried with a CP1252 snippet. It detected that it is not UTF 8, but applied CP1251 on it, which is the ansi accroding my localisation.
I believe @Theo is the author of the component, maybe he is reachable.

But you wrote that your expected CPs are limited to - UTF8;  - cp1252;  - cp850; plain ASCII.
If they are in one language only, maybe it will not be so difficult to autodected if the are cp1252 or cp850 and autoconvert them to UTF8.
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

Hartmut

  • Hero Member
  • *****
  • Posts: 1103
Re: How to determine the unknown codepage of a textfile?
« Reply #8 on: February 08, 2026, 03:38:33 pm »
Hello LeP and JuhaManninen, of course life would be easier, if every textfile would be in Unicode or UTF8 - but that's not reality and if the solution would be so easy, only to convert a restricted number of files only once in life, I would not have started this Topic...

If a textfile has not UTF8, e.g. because it's a logfile, written / changed often by a certain program, or a textfile needs to be read / imported by a certain program, I can't change their codepage. And this should furthermore not be the content / goal of this Topic.

Please back to my question, how to determine the codepage of a textfile. As said, fortunately I don't have to differ various Portuguese codepages - only 3 codepages are possible:
 - UTF8
 - cp1252
 - cp850

As said, (nearly) every Texteditor has to solve this problem, so there must be solutions.



Meanwhile I had a 2nd look at https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/master/README.md where I saw, that utf8 and cp1252 are supported, but cp850 does not appear in the long list of supported encodings.
Because playing with uchardet seems to be somewhat elaborate:
- is someone sure, that uchardet supports cp850?
 - can it be used on Linux (Ubuntu 24.04 with KDE plasma desktop)?




I tried with a CP1252 snippet. It detected that it is not UTF 8, but applied CP1251 on it, which is the ansi accroding my localisation.
I believe @Theo is the author of the component, maybe he is reachable.
Thank you for testing this. I understood, that you got a certain codepage (CP1251) reported as result. Will try to do some tests with existing files and if so try to contact @Theo.

LeP

  • Full Member
  • ***
  • Posts: 203
Re: How to determine the unknown codepage of a textfile?
« Reply #9 on: February 08, 2026, 03:48:28 pm »
I tried with a CP1252 snippet. It detected that it is not UTF 8, but applied CP1251 on it, which is the ansi accroding my localisation.

That's is the point ... your localization. But if you are in France and read a Polish text (with unknow CP-????) how the tool interprets the text?
Like I wrote my experience is that those tools made wrong assertion (very often ... but was my experience in the middel past).

I used them (I don't remember wich of them) sometimes ago 'cause I wrote application with multilanguage support (Arabic, Chinese, English, Sanskrit, European lang., etc ..) but I ended to solve all with Unicode text files (with BOM !!!) or simple ASCII text (ANSI truncated to 7 bit) for not BOM files.
« Last Edit: February 08, 2026, 04:08:21 pm by LeP »

Lutz Mändle

  • Jr. Member
  • **
  • Posts: 87
Re: How to determine the unknown codepage of a textfile?
« Reply #10 on: February 08, 2026, 04:06:59 pm »
Because playing with uchardet seems to be somewhat elaborate:
- is someone sure, that uchardet supports cp850?
 - can it be used on Linux (Ubuntu 24.04 with KDE plasma desktop)?



The uchardet library doesn't recognise CP850 or CP437 encoded text files, I have tested this with the commandline tool uchardet and some CP850 encoded files with german text. The output is "unknown".

CM630

  • Hero Member
  • *****
  • Posts: 1641
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: How to determine the unknown codepage of a textfile?
« Reply #11 on: February 08, 2026, 05:00:57 pm »
Something which might be useful:
As already mentioned, UTF8_Tools detects if the text file is UTF* or ansi.
If it is ansi* you have to detect if it is cp1252 or 850.
All CP* that I have checked use #$A0 as NBSP (non-breaking space), while in 850 it corresponds to á. I suppose that á is used much more often than NBSP.
So if you have #$A0 in your document, you can be almost sure, that it is not CP*. In your it should be 850.
Maybe you can find some more tokens to increase the chance of guessing.
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

Hartmut

  • Hero Member
  • *****
  • Posts: 1103
Re: How to determine the unknown codepage of a textfile?
« Reply #12 on: February 08, 2026, 05:47:06 pm »
The uchardet library doesn't recognise CP850 or CP437 encoded text files, I have tested this with the commandline tool uchardet and some CP850 encoded files with german text. The output is "unknown".
Thanks a lot Lutz for this valuable info. It's no good news, but saves me unneccessary time.



Something which might be useful:
As already mentioned, UTF8_Tools detects if the text file is UTF* or ansi.
If it is ansi* you have to detect if it is cp1252 or 850.
All CP* that I have checked use #$A0 as NBSP (non-breaking space), while in 850 it corresponds to á. I suppose that á is used much more often than NBSP.
So if you have #$A0 in your document, you can be almost sure, that it is not CP*. In your it should be 850.
Maybe you can find some more tokens to increase the chance of guessing.
Thanks CM630 for that info. It's no good news too.

Of course I could do any amount of investigating / experimenting / testing... but the reason to start this Topic was to avoid just that and not to reinvent the wheel.
As said, (nearly) every Texteditor faces this problem, so there must be solutions.

LV

  • Sr. Member
  • ****
  • Posts: 427
Re: How to determine the unknown codepage of a textfile?
« Reply #13 on: February 08, 2026, 07:14:44 pm »
Or has someone already created something for that purpose?

There is an open-source project created using Lazarus called Double Commander. This project features an auto-detection codepage capability in its text file viewer. You can examine the implementation in the Pascal source code, probably in the files named ufileview.pas and uShowText.pas. Interestingly, the author of this project occasionally visits this forum.

Hartmut

  • Hero Member
  • *****
  • Posts: 1103
Re: How to determine the unknown codepage of a textfile?
« Reply #14 on: February 08, 2026, 07:25:50 pm »
Meanwhile I tried https://wiki.freepascal.org/UTF8_Tools. From what I could see, it can only detect some UTF8-variants and "is not UTF8" (reported always as ANSI), as CM630 already said.

When you use it's demo
Code: Pascal  [Select][+][-]
  1. f := TCharEncStream.Create;
  2. f.LoadFromFile(OpenDialog1.FileName);
  3. Memo1.Text := f.UTF8Text;
  4. f.Free;
with a cp850 textfile, the result f.UTF8Text is unusable, because it was treated as ANSI.

All CP* that I have checked use #$A0 as NBSP (non-breaking space), while in 850 it corresponds to á. I suppose that á is used much more often than NBSP.
So if you have #$A0 in your document, you can be almost sure, that it is not CP*. In your it should be 850.
Maybe you can find some more tokens to increase the chance of guessing.
I checked more then 600 files which have codepage cp850, but only 9 of them included 1 or some #$A0 chars, although the files have sizes up to more than 1 MB, so this idea seems not to be successful.



Or has someone already created something for that purpose?

There is an open-source project created using Lazarus called Double Commander. This project features an auto-detection codepage capability in its text file viewer. You can examine the implementation in the Pascal source code, probably in the files named ufileview.pas and uShowText.pas. Interestingly, the author of this project occasionally visits this forum.
That sounds very interesting, thanks a lot LV. Will investigate tomorrow.

 

TinyPortal © 2005-2018