Forum > General

Common use of Byte Order Marker in textfiles????


Hi I'm fiddling a little bit with textfiles and UTF-8, Ansi, Unicode encodings  :)
How common do you think, the use of an UTF8BOM as the 3 first bytes in the file, is ?!?
I'm on Linux and the system encoding is UTF-8, but many files doesn't have the BOM...
Asking for opinions here  :D

Regards Benny

Well, the Lazarus IDE encodes source using utf8, but does not add a BOM.
How commonly you find it depends precisely on which particular encoded files you happen to open. There are potentially trillions out there. Some have BOM and some do not... The statistical proportion of one type to to the other type is ...?

I guess you want to know what default to use in case the BOM is missing.

I am doing two projects now that rely on some text files which *must* be in UTF-8, but "text file" is not a standard as if you can trust the BOM will be there to identify the file as UTF-8. It's just like with any other text file encoded in ANSI with a certain codepage, UTF-16 or whatever. The only way to know if a file without BOM is valid UTF-8 is to read it and ensure that every character is a valid UTF-8 character. And even then, you don't know if if was encoded with UTF-8, just that the characters it contains are valid UTF-8 codepoints.

So basically I found no practical solution besides what anyone should do when dealing with any text file you expect to be in a particular coding: document the requirement so files have to be in a particular coding, treat every file as if it were in that particular coding, and if the user doesn't take that into account, weird results will be a good warning.

To give you an example, I use Notepad++ as text editor, but to know if a file is UTF-8 without BOM or what, it actually loads and parses the entire text file. If it finds ANSI escape codes (which I believe are not unicode points) then it loads the file as ANSI. If it finds unicode points, then it loads it as UTF-8 without BOM. But if the file contains only ASCII characters, there's simply no way of knowing and thus loads it as ANSI because it has to load it as something before allowing you to use it.

It's pretty much like the \r, \n and \r\n lineendings: what would you do if you found a text file with every line having a random lineending combination?

If you are going to read and parse heavily the text file, then in the case of lack of BOM, checking wether each line has valid UTF-8 codepoints wouldn't add much overhead compared to other parsing you might do with the data. Things like ValidUTF8String or FindInvalidUTF8Character in unit lazutf8 will help you to decide wether you must stop processing because the text file was, in fact, not UTF-8 without BOM. For more information, the BOM article at Wikipedia is a good entry-level view about what BOM does and does not with UTF-8.

Regards, JMM


[0] Message Index

Go to full version