Recent

Author Topic: How to determine the unknown codepage of a textfile?  (Read 3396 times)

LV

  • Sr. Member
  • ****
  • Posts: 427
Re: How to determine the unknown codepage of a textfile?
« Reply #30 on: February 13, 2026, 07:37:32 pm »
Would you mind sharing a sample of your files? I am still interested in the challenge.  :)

Had to search some examples again because I had only noted a couple of results but not the corresponding filenames. I removed some personal content but the results remained always the same. I attached for you (results are from your 'Charset Detector'):

demo1.txt has 27 "Umlaute" = ÄÖÜäöüß in cp1252, but is reported as GB18030 (chinese).
demo2.txt is also cp1252, but is reported as ISO-8859-7 (greek).
demo3.txt has 36 "Umlaute" = ÄÖÜäöüß in cp850, but is reported as cp1252.
demo4.txt has 27 "Umlaute" = ÄÖÜäöüß in cp850, but is reported as Shift_JIS (japanese).

It seems you initiated this thread but then lost interest. 😊

Out of pure curiosity, I spent 15 minutes writing 150 lines of code, considering the limitations you mentioned: the German language and code pages cp850, cp1252, utf8, and ASCII.

I compiled and ran your tests:

Code: Text  [Select][+][-]
  1. demo1.txt -> CP1252
  2. demo2.txt -> CP1252
  3. demo3.txt -> CP850
  4. demo4.txt -> CP850
  5.  

Good luck 😉

P.S. Basically, I modified the code from answer #20.
« Last Edit: February 13, 2026, 08:44:07 pm by LV »

Hartmut

  • Hero Member
  • *****
  • Posts: 1102
Re: How to determine the unknown codepage of a textfile?
« Reply #31 on: February 14, 2026, 07:00:47 pm »
It seems you initiated this thread but then lost interest. 😊

I only "lost interest" in spending more time for tools, which are not reliable enough.
It would be nice to have such a (reliable) codepage detector, but I don't need it for my life (see solution in reply #25) and I don't want to invest an inappropriate amout of time for it.

As said these 4 demos for Roland57 were only a few examples.
Quote
Please accept that I do not want to spend (more) endless time to manually analyze and improve one individual file after each other and to continue this endless for future upcoming files.

Roland57

  • Hero Member
  • *****
  • Posts: 586
    • msegui.net
Re: How to determine the unknown codepage of a textfile?
« Reply #32 on: February 14, 2026, 09:42:26 pm »
P.S. Basically, I modified the code from answer #20.

Your example is interesting, but I am not sure how to modify it.

I tried this:

Code: Pascal  [Select][+][-]
  1. const
  2.   PTChars: UnicodeString =
  3.    //'áéíóúâêôãõçÁÉÍÓÚÂÊÔÃÕÇ';  // Portuguese special characters
  4.    'ßäÄöÖüÜ';

I don't get good results.  :(

$ ./detect4enc
demo1.txt -> CP850
demo2.txt -> CP850
demo3.txt -> CP850
demo4.txt -> CP850


P. S. I started another test. I save a german text in three different files and set the encoding using Geany.

Result with file command:
./cp1252.txt
  [file] text/plain; charset=iso-8859-1
./cp850.txt
  [file] text/plain; charset=unknown-8bit
./utf8.txt
  [file] text/plain; charset=utf-8


Result with detect4enc program:
cp850.txt -> CP850
cp1252.txt -> CP850
utf8.txt -> UTF8
« Last Edit: February 14, 2026, 10:23:29 pm by Roland57 »
My projects are on Codeberg.

LV

  • Sr. Member
  • ****
  • Posts: 427
Re: How to determine the unknown codepage of a textfile?
« Reply #33 on: February 15, 2026, 08:28:40 am »
Hi, @Roland57!
This task doesn't have much practical value for me; it's just a distraction from my main work and a minor expansion of my Pascal programming skills. Oh, and also just a general reflection on the question of problem-solving under uncertainty. In other words, I didn't conduct any in-depth research, except for testing on text files with different encodings, as provided by @CM630 and @Hartmut.
Here's a modified code for checking @Hartmut's files, narrowing down the alternatives to German and encodings cp850, cp1252, utf8, and ASCII.
The results on Windows 11, Laz 3.4; FPC 3.2.2, I provided above are correct. I'd be interested to hear from you about the reproducibility of the code.
Best regards

Code: Pascal  [Select][+][-]
  1. program Detect4Enc;
  2.  
  3. {$mode objfpc}
  4.  
  5. uses
  6.   SysUtils, Classes;
  7.  
  8. type
  9.   TEnc = (encASCII, encUTF8, encCP850, encCP1252);
  10.  
  11. const
  12.   // German umlauts + ß
  13.   DEChars: UnicodeString =
  14.    'äöüÄÖÜß';
  15.  
  16.   Garbage: UnicodeString =
  17.    '∆µ†¤';
  18.  
  19.   BoxChars: UnicodeString =
  20.    '░▒▓│┤┼┐└═';
  21.  
  22. function LoadBytes(const FN: string): TBytes;
  23. var
  24.   FS: TFileStream;
  25. begin
  26.   FS := TFileStream.Create(FN, fmOpenRead or fmShareDenyWrite);
  27.   try
  28.     SetLength(Result, FS.Size);
  29.     FS.ReadBuffer(Result[0], FS.Size);
  30.   finally
  31.     FS.Free;
  32.   end;
  33. end;
  34.  
  35. function IsASCII(const B: TBytes): Boolean;
  36. var
  37.   i: Integer;
  38. begin
  39.   Result := True;
  40.   for i := 0 to High(B) do
  41.     if B[i] > 127 then Exit(False);
  42. end;
  43.  
  44. function IsRealUTF8(const B: TBytes): Boolean;
  45. var
  46.   i, n: Integer;
  47.   HasMulti: Boolean;
  48. begin
  49.   i := 0;
  50.   HasMulti := False;
  51.  
  52.   while i < Length(B) do
  53.   begin
  54.     if B[i] < $80 then
  55.       Inc(i)
  56.     else
  57.     begin
  58.       HasMulti := True;
  59.  
  60.       if (B[i] and $E0) = $C0 then n := 1
  61.       else if (B[i] and $F0) = $E0 then n := 2
  62.       else if (B[i] and $F8) = $F0 then n := 3
  63.       else Exit(False);
  64.  
  65.       Inc(i);
  66.       while n > 0 do
  67.       begin
  68.         if (i >= Length(B)) or ((B[i] and $C0) <> $80) then Exit(False);
  69.         Inc(i);
  70.         Dec(n);
  71.       end;
  72.     end;
  73.   end;
  74.  
  75.   Result := HasMulti;
  76. end;
  77.  
  78. function Score(const S: UnicodeString): Integer;
  79. var
  80.   i: Integer;
  81. begin
  82.   Result := 0;
  83.  
  84.   for i := 1 to Length(S) do
  85.   begin
  86.     if S[i] in ['A'..'Z','a'..'z'] then Inc(Result);
  87.  
  88.     if Pos(S[i], DEChars) > 0 then Inc(Result, 6);
  89.  
  90.     if Pos(S[i], Garbage) > 0 then Dec(Result, 20);
  91.     if Pos(S[i], BoxChars) > 0 then Dec(Result, 25);
  92.   end;
  93.  
  94.   // German language patterns
  95.   if Pos(' der ', S) > 0 then Inc(Result, 20);
  96.   if Pos(' die ', S) > 0 then Inc(Result, 20);
  97.   if Pos(' und ', S) > 0 then Inc(Result, 15);
  98.   if Pos('sch', S) > 0 then Inc(Result, 10);
  99.   if Pos('über', S) > 0 then Inc(Result, 25);
  100.   if Pos('ß', S) > 0 then Inc(Result, 20);
  101. end;
  102.  
  103. function DetectEncoding(const FN: string): TEnc;
  104. var
  105.   B: TBytes;
  106.   S850, S1252: UnicodeString;
  107.   A, B2: Integer;
  108. begin
  109.   B := LoadBytes(FN);
  110.  
  111.   if IsASCII(B) then
  112.     Exit(encASCII);
  113.  
  114.   if IsRealUTF8(B) then
  115.     Exit(encUTF8);
  116.  
  117.   S850  := TEncoding.GetEncoding(850).GetString(B);
  118.   S1252 := TEncoding.GetEncoding(1252).GetString(B);
  119.  
  120.   A  := Score(S850);
  121.   B2 := Score(S1252);
  122.  
  123.   if B2 > A then
  124.     Result := encCP1252
  125.   else
  126.     Result := encCP850;
  127. end;
  128.  
  129. procedure Test(const FN: string);
  130. begin
  131.   Write(FN, ' -> ');
  132.   case DetectEncoding(FN) of
  133.     encASCII:   Writeln('ASCII');
  134.     encUTF8:    Writeln('UTF8');
  135.     encCP850:   Writeln('CP850');
  136.     encCP1252:  Writeln('CP1252');
  137.   end;
  138. end;
  139.  
  140. begin
  141.   Test('demo1.txt');
  142.   Test('demo2.txt');
  143.   Test('demo3.txt');
  144.   Test('demo4.txt');
  145.  
  146.   ReadLn;
  147. end.
  148.  

Roland57

  • Hero Member
  • *****
  • Posts: 586
    • msegui.net
Re: How to determine the unknown codepage of a textfile?
« Reply #34 on: February 15, 2026, 05:06:06 pm »
Here's a modified code for checking @Hartmut's files, narrowing down the alternatives to German and encodings cp850, cp1252, utf8, and ASCII.
The results on Windows 11, Laz 3.4; FPC 3.2.2, I provided above are correct. I'd be interested to hear from you about the reproducibility of the code.

Thank you for the new version of the program.

Here (Linux, FPC 3.2.2) the four files provided by Hartmut are detected as CP850, I don't know why.

./hartmut/demo1.txt
  [uchardet] ISO-8859-1
  [file] text/plain; charset=iso-8859-1
  [detect4enc] ./hartmut/demo1.txt -> CP850
./hartmut/demo2.txt
  [uchardet] WINDOWS-1252
  [file] text/plain; charset=iso-8859-1
  [detect4enc] ./hartmut/demo2.txt -> CP850
./hartmut/demo3.txt
  [uchardet] WINDOWS-1250
  [file] text/plain; charset=unknown-8bit
  [detect4enc] ./hartmut/demo3.txt -> CP850
./hartmut/demo4.txt
  [uchardet] ISO-8859-15
  [file] text/plain; charset=unknown-8bit
  [detect4enc] ./hartmut/demo4.txt -> CP850


I attach the program and all the sample files together.

I hope we won't let this good exercise without a conclusion.  :)

"Detect cp850, cp1252, utf8, and ASCII in text files containing german text."
My projects are on Codeberg.

Roland57

  • Hero Member
  • *****
  • Posts: 586
    • msegui.net
Re: How to determine the unknown codepage of a textfile?
« Reply #35 on: February 15, 2026, 07:47:42 pm »
Had to search some examples again because I had only noted a couple of results but not the corresponding filenames. I removed some personal content but the results remained always the same. I attached for you (results are from your 'Charset Detector'):

demo1.txt has 27 "Umlaute" = ÄÖÜäöüß in cp1252, but is reported as GB18030 (chinese).
demo2.txt is also cp1252, but is reported as ISO-8859-7 (greek).
demo3.txt has 36 "Umlaute" = ÄÖÜäöüß in cp850, but is reported as cp1252.
demo4.txt has 27 "Umlaute" = ÄÖÜäöüß in cp850, but is reported as Shift_JIS (japanese).

Yes, same results here.

../detect4enc/hartmut/demo1.txt
  [file] iso-8859-1
  [uchardet] ISO-8859-1
  [detect4enc] ../detect4enc/hartmut/demo1.txt -> CP850
  [lconvencoding] ISO-8859-1
  [chsdet] Codepage 54936 Name GB18030 Description "Windows XP and later: GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030)"
../detect4enc/hartmut/demo2.txt
  [file] iso-8859-1
  [uchardet] WINDOWS-1252
  [detect4enc] ../detect4enc/hartmut/demo2.txt -> CP850
  [lconvencoding] ISO-8859-1
  [chsdet] Codepage 28597 Name iso-8859-7 Description "ISO 8859-7 Greek"
../detect4enc/hartmut/demo3.txt
  [file] unknown-8bit
  [uchardet] WINDOWS-1250
  [detect4enc] ../detect4enc/hartmut/demo3.txt -> CP850
  [lconvencoding] ISO-8859-1
  [chsdet] Codepage 1252 Name windows-1252 Description "ANSI Latin 1; Western European (Windows)"
../detect4enc/hartmut/demo4.txt
  [file] unknown-8bit
  [uchardet] ISO-8859-15
  [detect4enc] ../detect4enc/hartmut/demo4.txt -> CP850
  [lconvencoding] ISO-8859-1
  [chsdet] Codepage 932 Name shift_jis Description "ANSI/OEM Japanese; Japanese (Shift-JIS)"


Other observation: If I open the four files in Geany, it detects the four as ISO-8859-1%)
« Last Edit: February 15, 2026, 07:49:28 pm by Roland57 »
My projects are on Codeberg.

valdir.marcos

  • Hero Member
  • *****
  • Posts: 1176
Re: How to determine the unknown codepage of a textfile?
« Reply #36 on: February 16, 2026, 09:22:19 am »
So, the best choice is to have everything in Unicode (like UTF-8) or know the original source.
That is my feeling, too. Convert the files manually to get it right. It may be laborious but you only need to do it once.
Local codepages are history, or at least they should be history.
Me too.

Roland57

  • Hero Member
  • *****
  • Posts: 586
    • msegui.net
Re: How to determine the unknown codepage of a textfile?
« Reply #37 on: February 16, 2026, 09:43:27 am »
So, the best choice is to have everything in Unicode (like UTF-8) or know the original source.
That is my feeling, too. Convert the files manually to get it right. It may be laborious but you only need to do it once.
Local codepages are history, or at least they should be history.
Me too.

Indeed, if the problem doesn't exist, it no longer needs to be solved.  :)
My projects are on Codeberg.

 

TinyPortal © 2005-2018