Recent

Author Topic: [solved] unparsable tab-delimited file?  (Read 2918 times)

squirreldancer

  • New member
  • *
  • Posts: 7
[solved] unparsable tab-delimited file?
« on: November 18, 2013, 08:23:56 pm »
I learned Pascal a long time ago and peaked as a hobbyist programmer with Delphi 5. I now need to create a program to merge two text files. One contains results from a testing application and the other is downloaded from our Learning management system (LMS).
I want to read both files into TStringLists, merge them, and display them in a TStringGrid to confirm the merge has worked, then write the results to a tab delimited file.
The problem is, the files from the LMS are tab delimited, with quoted strings and unix (LF) line endings and I cannot figure out how to read them properly.
As a workaround, I tried to read in the file, convert it to CRLF line endings and write it out to a temp file. This creates a file where every second line is gibberish. I have tried using TextFiles, LoadFromFile, TStreams and CSVDocument. I Even downloaded the Delphi-based FileFixUp, which is supposed to correct line endings. It produces the same gibberish every second line.
Can anyone suggest code to convert sample_tab_file to a CRLF file, correctly? I attach it and the sample output from FileFixUP, which is the same as the output from all my unsuccessful attempts at creating the code.
« Last Edit: November 18, 2013, 09:06:07 pm by squirreldancer »

taazz

  • Hero Member
  • *****
  • Posts: 5365
Re: unparsable tab-delimited file?
« Reply #1 on: November 18, 2013, 08:49:07 pm »
In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.

  • If the 16-bit units are represented in big-endian byte order, this BOM character will appear in the sequence of bytes as 0xFE followed by 0xFF. This sequence appears as the ISO-8859-1 characters þÿ in a text display that expects the text to be ISO-8859-1.
  • if the 16-bit units use little-endian order, the sequence of bytes will have 0xFF followed by 0xFE. This sequence appears as the ISO-8859-1 characters ÿþ in a text display that expects the text to be ISO-8859-1.
Your file starts with $FF$FE which means that the file is a unicode  UTF16 in little endian TStringlist is ansi only it does not support unicode at this point you need to find a WideStringList or a unicodeStringlist that is able to read and decode it correctly. Jedi Code library seems to have an implementation that you can use.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

squirreldancer

  • New member
  • *
  • Posts: 7
Re: unparsable tab-delimited file?
« Reply #2 on: November 18, 2013, 09:05:27 pm »
Wow! Thanks for the rapid reply. I was beginning to suspect something like this, and had started to scratch my head about trying widestrings and how to re-cast them to merge with the other file, if they worked.

Time to pull out the books again.

 

TinyPortal © 2005-2018