Recent

Author Topic: Text file issues  (Read 6684 times)

Newmann

  • Jr. Member
  • **
  • Posts: 65
Text file issues
« on: November 11, 2015, 08:21:07 am »
Hi guys,
I have a really weird/unique problem - not sure if somebody experienced something like this before - I'm not 100% sure if its an encoding issue, or if I'm just being retarded - but my issue is this:

I need to analyze text files - basically activity log files that's queued for import to a DB - that generate import errors sometimes due to damaged files.
Each one of the lines contains a two letter header (and comma) to indicate the information that will follow - based on that, the processor on our server knows what to do with the rest of the line...very straight forward.

Here's my problem - when I try to open the file and do a simple count of the different lines available in the file, I can't seem to get a match for those first two letters no matter what I try  :o

I've tried using AnsiToUTF8, SysToUtf8, and normal comparison with Pos, AnsiPOS, and even Copy to match the extracted characters...keeps on failing.

Using a normal TextFile, I use a while not EOF loop to read the file contents to a TStringList - I tried both reading it in direclty, and converting the lines right before I use them (also tried without converting the lines), and also converting the lines before I add it to the TStringList.

I've exhausted all possible combinations of getting a match on those two header letters - but here is the real brainteaser:

Say I open the file with Notapad/Notepad++, and I copy/paste the two letters from that line into my code - it works perfectly! But the moment I type the letters into my code, it suddenly doesn't exist?!?
(hence the fact I'm thinking its file encoding)

Also - I can't really ask our developers for assistance on this...my work at the moment is "unsanctioned", hehehe




Lazarus 1.0.14
FPC 2.6.0
Win7 Pro 64bit/Win XP Pro 32bit

Josh

  • Hero Member
  • *****
  • Posts: 1455
Re: Text file issues
« Reply #1 on: November 11, 2015, 08:46:59 am »
Hi,
Sounds like encoding issue. I would open the file with a hex editor, you should see pretty quickly how the file is made up.

When analyzing unknown files and they are not too large I always read them as binary files so and work from there, even if they are large files I tend to use memory/filestreams to work in blocks of data. It is very fast and flexible as you have all the data present.

Josh
The best way to get accurate information on the forum is to post something wrong and wait for corrections.

Newmann

  • Jr. Member
  • **
  • Posts: 65
Re: Text file issues
« Reply #2 on: November 11, 2015, 09:54:35 am »
I used HxD to look at the file - there doesn't seem to be anything funny about it...(see attachments)

Are there maybe some obscure function built in to Lazarus somewhere that ensure the letters imported/checked/copied at runtime can be converted to the same I use at devtime?

Otherwise my only option will be to go through all the log files, and copy/paste them out one by one...which is simply not worth the time/effort... :o
Lazarus 1.0.14
FPC 2.6.0
Win7 Pro 64bit/Win XP Pro 32bit

rasberryrabbit

  • Full Member
  • ***
  • Posts: 151
Re: Text file issues
« Reply #3 on: November 11, 2015, 09:57:09 am »
If you using 'FPC >= 2.7.1' and 'DefaultSystemCodePage = CP_UTF8',
AnsiToUTF8, UTF8ToAnsi function didn't work.

Code is long, Life is short, AI is not your enemy.

eny

  • Hero Member
  • *****
  • Posts: 1665
Re: Text file issues
« Reply #4 on: November 11, 2015, 11:24:01 am »
The problem is probably in the file itself; some corrupted line(s) somewhere.
Just search for the first line that does not start with MH; don't use NP++ but do that with your program.

BTW: TStriingList has a LoadFromFile() method.
All posts based on: Win11; stable Lazarus 4_4  (x64) 2026-02-12 (unless specified otherwise...)

Newmann

  • Jr. Member
  • **
  • Posts: 65
Re: Text file issues
« Reply #5 on: November 11, 2015, 12:53:52 pm »
The problem is probably in the file itself; some corrupted line(s) somewhere.
Just search for the first line that does not start with MH; don't use NP++ but do that with your program.

BTW: TStriingList has a LoadFromFile() method.

The MH was just an example - almost each line starts with a different combination (depending on the data contained within that line), and repeats at random.
Lazarus 1.0.14
FPC 2.6.0
Win7 Pro 64bit/Win XP Pro 32bit

Newmann

  • Jr. Member
  • **
  • Posts: 65
Re: Text file issues
« Reply #6 on: November 11, 2015, 12:57:34 pm »
If you using 'FPC >= 2.7.1' and 'DefaultSystemCodePage = CP_UTF8',
AnsiToUTF8, UTF8ToAnsi function didn't work.

FPC 2.6.2

Where exactly is this "DefaultSystemCodePage" setting - I cant seem to find it   :-[
Lazarus 1.0.14
FPC 2.6.0
Win7 Pro 64bit/Win XP Pro 32bit

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Text file issues
« Reply #7 on: November 11, 2015, 03:18:02 pm »
I've tried using AnsiToUTF8, SysToUtf8, and normal comparison with Pos, AnsiPOS, and even Copy to match the extracted characters...keeps on failing.

Using a normal TextFile, I use a while not EOF loop to read the file contents to a TStringList - I tried both reading it in direclty, and converting the lines right before I use them (also tried without converting the lines), and also converting the lines before I add it to the TStringList.

Hold on I'm still on reading here forget converting for a second, read a single line and hex dump it on screen. Compare the results with your hex editor are they the same? Open the same file as an untyped file read the same number of bytes in to a buffer and comparemem them together are they the same? hex dump both on screen and look for differences. Try to define a sequence of bytes to be checked and forget for a moment that they are strings those kind of interpretations are getting in your way for the time being.

I've exhausted all possible combinations of getting a match on those two header letters - but here is the real brainteaser:

really? in that case build a sample text file a small application that will showcase the problem and attach them to a message in this thread.

Say I open the file with Notapad/Notepad++, and I copy/paste the two letters from that line into my code - it works perfectly! But the moment I type the letters into my code, it suddenly doesn't exist?!?
(hence the fact I'm thinking its file encoding)
Try to type hex values instead of characters eg
Code: Pascal  [Select][+][-]
  1. const
  2.   ID1 :string = #$4D#$48#$2C;//'MH,'
  3.  
and compare against the id1. Just to make things clear hexdump your const/var data on screen next to what ever you have read from the file. You might get surprised at what you find. Just a hint $4D, $48 and $2C are the same letters in ascii, ansi and utf8. So typing or pasting them with or with out conversation from to utf8 should produce the same results which means that you are not typing the correct letters, they only look the same on screen.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

Newmann

  • Jr. Member
  • **
  • Posts: 65
Re: Text file issues
« Reply #8 on: November 17, 2015, 07:17:58 am »
I managed to get it working by the looks of it now - there was a stupid little logic error that messed me around whilst troubleshooting, but the winning code looks like this (lines 5 + 6):

Code: Pascal  [Select][+][-]
  1. while not EOF(lFile) do                                  //Scan through entire file
  2.       begin
  3.          ReadLn(lFile, line);                            //Read the current line into a holding variable
  4.          lCount := lCount + 1;                           //Increase the counter by one
  5.          tempLine := line;
  6.          fBody.Add(SysToUtf8(tempLine));                 //Adds the loaded line to the bulk list for processing
  7.       end;//END WHILE
  8.  

Inadvertently this seems to work exactly like the main processor (for which the file was intended), in that it fails to read past a line that is damaged/corrupted. Is there any other way I can detect this, and force it to continue reading the rest of the, or should I try and have a TStringList load it instead.
(I'm merely asking for suggestions)

What I figured is that I could cross reference the size of the file vs the quantity of lines the above code returns - I did the math on a whole bunch of logfile sizes vs the lines contained inside them - I can safely assume a 100byte p/line size for comparison - so divide the file size by 100, and compare the result with the amount of lines counted.
If the counted lines is less than whats expected, it can use another mechanism to go thought he file if necessary?

But, I'm sure there is a better/more efficient way of doing this...?
Lazarus 1.0.14
FPC 2.6.0
Win7 Pro 64bit/Win XP Pro 32bit

eny

  • Hero Member
  • *****
  • Posts: 1665
Re: Text file issues
« Reply #9 on: November 17, 2015, 10:08:52 am »
All posts based on: Win11; stable Lazarus 4_4  (x64) 2026-02-12 (unless specified otherwise...)

Bart

  • Hero Member
  • *****
  • Posts: 5721
    • Bart en Mariska's Webstek
Re: Text file issues
« Reply #10 on: November 17, 2015, 03:42:40 pm »
...in that it fails to read past a line that is damaged/corrupted.

Corrupted in what sense?
If readln(ATextFile, AString) does not read the entire line (endend by a line-ending or eof) you need to examine the file e.g. with a hex-editor.
E.g. the line in question may have null-characters in it (not very likely though).
As long as we do not know what a "corrupetd line" is, we cannot help in solving the problem.

Can you attach a logfile that has a "corrupted" string?

Otherwise it's just crystall ball time.

Bart

jack616

  • Sr. Member
  • ****
  • Posts: 268
Re: Text file issues
« Reply #11 on: November 17, 2015, 05:15:46 pm »
If corrupt files are an issue read them as  binary files and deal with corruptions in code.
If its just the one file go to the point it stops reading and fix it - I'd suggest pspad as
a decent editor for this.


Windsurfer

  • Sr. Member
  • ****
  • Posts: 368
    • Windsurfer
Re: Text file issues
« Reply #12 on: November 17, 2015, 05:30:38 pm »
Check the end of line characters. They may be inconsistent due to bugs in the code that originated the files. The same goes for any characters like commas and quotes that mark the end of fields.

Also check for the presence of non printing characters like Ctrl Z, which used to mark end of file in the Windows environment. i used to insert that after a short header so that non-programmers could not see the contents.

rasberryrabbit

  • Full Member
  • ***
  • Posts: 151
Re: Text file issues
« Reply #13 on: November 18, 2015, 12:32:57 am »
If you using 'FPC >= 2.7.1' and 'DefaultSystemCodePage = CP_UTF8',
AnsiToUTF8, UTF8ToAnsi function didn't work.

FPC 2.6.2

Where exactly is this "DefaultSystemCodePage" setting - I cant seem to find it   :-[

There is no options in 2.6  :-[
Code is long, Life is short, AI is not your enemy.

 

TinyPortal © 2005-2018