Meanwhile I checked
encodingexpert-260208.zip from reply #16. It contains 4 methods:
- LConvEncoding.GuessEncoding() can only return UTF8 or the Default of the OS (which is replaced by ISO-8859-1, if this Default is UTF8, which on Linux is the case) => for me not usable
- Linux 'file' command: returns 'unknown-8bit' for cp1252 and for cp850 => for me not usable
- Linux 'uchardet' command: because cp850 is not supported (from it's documentation and test by Lutz in reply #10) and 'Charset Detector' is the same, I did not install it
- Charset Detector: same as the 'chsdet' stand alone component included in Double Commander, I already tested and described in reply #22.
So regrettably no improvement to what was before.
Then I tested
program Detect4Enc from reply #20. For my 1st file it reported cp850, although this file is cp1252 and contains > 700 characters of cp1252 which makes it not usable for me.
But I like your idea of checking for 'Garbage' and 'BoxChars' to penalize them. If I
would think of reinventing the wheel and
would think about creating my own codepage detector then I would use something similar.
But I definitely do not want to invest all this effort of investigating / creating statistics / experimenting / testing / failing / improving etc. until this detector peace by peace comes nearer to reliable results.
So what bothers you?
If I have understood you correctly, for your files if it is not UTF8 and cp1252 it is cp850.
The risk to treat
everything, what could not detected (surely enough), as cp850.
It makes sense, that if a file is too short and/or has not enough characteristic data, to report it as 'Unknown' instead of returning some nonsense.
You might additionally check if any of these #$E1; #$84; #$8E; #$94; #$99; #$81; #$9A are present.
If there are no umlauts, there should be nothing above $#7F so it is English or *e was used... hopefully.
Unfortunately this is not true. Besides "Umlaute" = ÄÖÜäöüß there are many special chars, which can (but not must) occur in cp850 e.g. ø á ½ « » © µ ± § x² x³ to show only some of them.
And e.g. $F6 is 'ö' in cp1252 and '±' in cp850.
Or e.g. $FC is 'ü' in cp1252 and '³' in cp850.
I think for reliable results a mature statistics for each codepage is neccessary.
Meanwhile I made some more tests with
component 'chsdet' (contained in Double Commander) and got too many wrong results:
- a couple of files which do have cp1252 were not reported as cp1252, but as GB18030 (chinese) or ISO-8859-7 (greek)
- a couple of files which do have cp850 were not reported as Unknown, but as cp1252 (!) or Shift_JIS (japanese)
It does not help me, if a codepage detector is so unreliable.
Summary:Unfortunately none of the tools mentioned in this Topic (until now) shows results, which are reliable enough. Seems that component 'chsdet' (contained in Double Commander) generally is still the best of them. But it does generally not support cp850, which results in different wrong codepages, because it has no statistics for cp850 to consider them.
I thought that better tools exist, because (nearly) every Texteditor has to solve this problem.
If no better tool comes in the next time, I will proceed with my project this way:
- if a textfile has pure ASCII or UTF8 (both is simple to detect and I found no wrong results with 'chsdet' for both), I will use this codepage
- in all other cases the user must select manually the codepage (cp1252 or cp850).
Thanks a lot to all for your aid.