Recent

Author Topic: Mixed Utf8 and iso8859-1 file problem???  (Read 3163 times)

Robert W.B.

  • Sr. Member
  • ****
  • Posts: 328
  • Love my Wife, My Kids and Lazarus/Freepascal.
Mixed Utf8 and iso8859-1 file problem???
« on: September 21, 2017, 02:10:55 pm »
Hi friends. I have a big problem. I have a mixed textfile with utf8 and iso8859-1.
I hope and wish there is a way to do the whole file to an iso8859-1 file but how can I do this?

What I wish is something like this:

for a:=1 to 100 do begin
MyString:=MyTextline[a];
If MyString contains UTF8 then begin
convert MyString to iso8859-1
end;
end;

Thanks in advance
Bob  :-[
Rob

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1311
    • Lebeau Software
Re: Mixed Utf8 and iso8859-1 file problem???
« Reply #1 on: September 21, 2017, 06:43:25 pm »
I have a mixed textfile with utf8 and iso8859-1.

Why is it mixed in the first place?  A text file should never use more than one charset at a time.  Are you sure this is actually a text file, and not a structured binary file?

I hope and wish there is a way to do the whole file to an iso8859-1 file but how can I do this?

You can't convert the whole file in one go.  You need to know which portions are UTF-8 and which portions are ISO-8859-1, and then convert each portion separately as needed.

What I wish is something like this:

for a:=1 to 100 do begin
  MyString:=MyTextline[a];
  If MyString contains UTF8 then begin
    convert MyString to iso8859-1
  end;
end;

UTF-8 has a very well-defined structure to it, so it is very easy to detect.  The first byte of a UTF-8 sequence always has its bits set to either 0xxxxxxx, 110xxxxx, 1110xxxx, or 11110xxx.  If 0xxxxxxx, the byte contains the complete codepoint as-is, otherwise the number of 1 bits specifies the number of total bytes in the sequence (2, 3, or 4), and the extra bytes all have their bits set to 10xxxxxx.  The combined x bits from the entire sequence forms the actual Unicode codepoint.

ISO-8859-1 is also well-defined.

So, while scanning through your file data, look for these specific bit patterns, and if detected then convert between UTF-8 and ISO-8859-1 as needed.

On the other hand, ISO-8859-1 supports only a very small subset of Unicode, whereas UTF-8 supports the entire Unicode repertoire.  So, why would you want to convert from UTF-8 to ISO-8859-1 and risk data loss?  If a valid UTF-8 sequence represents a Unicode codepoint that ISO-8859-1 does not support, you will lose that codepoint during the conversion.  There is less chance of data loss if you go the other way.  Any valid ISO-8859-1 byte can be converted to UTF-8 without data loss.  And it is easier to check if a given single byte is NOT in a valid UTF-8 sequence than it is to check if a given range of bytes IS a valid UTF-8 sequence.

Either way, do note that there is some overlap (outside of the ASCII range) between UTF-8 and ISO-8859-1, so it is possible (albeit unlikely) that valid ISO-8859-1 byte sequences would ALSO be valid UTF-8 byte sequences.  Without any *context* about what the data is supposed to represent, it may not always be easy to differentiate between them correctly.  So be prepared to accept some margin of false positives while converting.
« Last Edit: September 21, 2017, 10:41:52 pm by Remy Lebeau »
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

Robert W.B.

  • Sr. Member
  • ****
  • Posts: 328
  • Love my Wife, My Kids and Lazarus/Freepascal.
Re: Mixed Utf8 and iso8859-1 file problem???
« Reply #2 on: September 21, 2017, 09:00:54 pm »
Thanks Remy for the reply.
The thing is, in many years, ive been saving to textfile weekly in utf8 and the textfile is now 600 lines utf8.
Suddenly the site i pick data from change its charset to iso 8859-1 and i have recently discovered this so my textfile is now mixed, utf8 and iso8859-1. I saw this when it couldn't show European åäö anymore. So therefore i hoped there was a solution in the mighty lazarus/frepascal, that could in code convert my mixed charset textfile, to just one charset.
Best regards
Bob 8)
Rob

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1311
    • Lebeau Software
Re: Mixed Utf8 and iso8859-1 file problem???
« Reply #3 on: September 21, 2017, 10:46:03 pm »
Thanks Remy for the reply.
The thing is, in many years, ive been saving to textfile weekly in utf8 and the textfile is now 600 lines utf8.
Suddenly the site i pick data from change its charset to iso 8859-1 and i have recently discovered this so my textfile is now mixed, utf8 and iso8859-1.

Well, that was your mistake.  You should not have been saving the downloaded text as-is to begin with.  You should have detected the charset used by the text as reported by the server or inferred otherwise, converted it if needed, and then saved it.  that way, your file would have had a consistent encoding.

I saw this when it couldn't show European åäö anymore. So therefore i hoped there was a solution in the mighty lazarus/frepascal, that could in code convert my mixed charset textfile, to just one charset.

Nope.  You have to code it manually.  Since your file was mostly UTF-8 to begin with, you should scan the file looking for invalid UTF-8 bytes, translate them to Unicode via ISO-8859-1, and encode them to UTF-8.  Then make sure subsequent downloads are always saved to file using UTF-8 only.
« Last Edit: September 21, 2017, 10:48:19 pm by Remy Lebeau »
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Mixed Utf8 and iso8859-1 file problem???
« Reply #4 on: September 22, 2017, 01:27:56 am »
What I wish is something like this:

for a:=1 to 100 do begin
MyString:=MyTextline[a];
If MyString contains UTF8 then begin
convert MyString to iso8859-1
end;
end;

Your description implies that your file has two parts: first part in UTF8 and the second in ISO8859-1. Lazarus editor can convert between UTF8 and ISO8859-1 If you know/guess which lines are in UTF8.

If you want to do that in code, there is a unit: LazUTF8 has a function FindInvalidUTF8Character could be used to find the first ISO8859-1 line:
Code: Pascal  [Select][+][-]
  1. uses ..., LazUTF8;
  2.  
  3. function IsValidUTF8Line(aLine: string): boolean;
  4. begin
  5.   Result := FindInvalidUTF8Character(@aLine[1], Length(aLine))=-1;
  6. end;

Use unit LConvEncoding to convert the encoding to whatever you want.

 

TinyPortal © 2005-2018