Recent

Author Topic: Stringgrid vs. LoadFromCSVFile  (Read 49927 times)

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: Stringgrid vs. LoadFromCSVFile
« Reply #30 on: August 28, 2012, 04:41:16 pm »
I adjusted the parsing for LoadfromCsvStream in r37598.

  • leading and trailing spaces are always part of the field
  • the exception is that trailing spaces after the end-quote of a quoted field are trimmed
  • also Tab characters are allowed as whitespace
As already indicated, the second item is not compatible with RFC4180 2.4

As FPC doesn't seem to progress with the delimitedtext patches (and as mentioned, stringlist.delimitedtext is not meant for CSV import but sdf import), it may make sense to use http://wiki.lazarus.freepascal.org/CsvDocument...

I'm working on a proof of concept for this (as I needed csvdocument anyway); perhaps additional advantages could be:
  • Excel import/export functionality: boolean switch to enable this instead of RFC4180
  • Same for ignoring outer whitespace (e.g. around quotes)=>existing code can be used
  • Support for multiline import/export. See below
  • UTF8 and local Windows codepage support=>UTF8 support for import. AFAIU, this is probably unlikely to emerge soon from FreePascal - and we're bound to 2.6 for current Laz releases anyway
  • Possible SaveToCSV/LoadFromCSV functionality in other grids?

Multiline: a form, a stringgrid and a button; this:
Code: [Select]
procedure TForm1.Button1Click(Sender: TObject);
begin
  StringGrid1.Cells[1,1]:='a'+LineEnding+'b';
  ShowMessage('Found lf at position: '+inttostr(Pos(#10,StringGrid1.Cells[1,1])));
end; 
... which indicates line ending is stored in the stringgrid though not displayed.

As said, it's proof of concept, compiles but untested, comments/suggestions welcome.
Patch attached together with csvdocument (needs to be put in $(lazarusdir)\lcl)
« Last Edit: August 28, 2012, 08:11:45 pm by BigChimp »
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

Bart

  • Hero Member
  • *****
  • Posts: 5573
    • Bart en Mariska's Webstek
Re: Stringgrid vs. LoadFromCSVFile
« Reply #31 on: August 28, 2012, 11:34:36 pm »
As already indicated, the second item is not compatible with RFC4180 2.4

I wrote it this way, because it is compatible with the way excel reads csv files.
(My original changes were based upon an error reading a csv file that excel handled OK)
Csv files created with TStringGrid will never have these wrong fields, so it's only for "external comaptibility".

Preferrably however, we would have only one code-base to read/interpret (a single line of a) csv file.

Bart

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: Stringgrid vs. LoadFromCSVFile
« Reply #32 on: August 29, 2012, 09:23:03 am »
Goedemorgen Bart,

I understand you wrote it to be Excel-compatible - and that it is for external compatibility. Writing it only to accommodate Excel would seem to be a bit limited though as e.g. LibreOffice/OpenOffice calc can deliver up multiline fields which Excel won't.

CSVDocument seems to offer good RFC4180 standards support and also seems to offer Excel compatibility mode. I therefore think using code like this (preferably in FPC, but that will take a while) for all csv read/write (as you say) would be a good thing.
There's e.g. also the fpc/Lazarus db export code that could use csvdocument ;)
« Last Edit: August 29, 2012, 09:34:17 am by BigChimp »
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Stringgrid vs. LoadFromCSVFile
« Reply #33 on: August 29, 2012, 10:10:04 am »
as far as I can tell trimming spaces after the end quote of a field is something that RFC does not predict or talk about. It does say that the fields are either quoted or not and that spaces are part of the field data and are not to be ignored.
But if the field is quoted then its value has been marked from the quotes. What happens to data outside the quotes? are they suppose to be part of the field or not if they are why are they not inside the quotes?

I couldn't find any reference to this issue in the RFC so trimming them is an accepted behavior that does not invalidate conformance with the RFC.

As I see it if it is outside the quotes of a quoted field they are white space not data.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: Stringgrid vs. LoadFromCSVFile
« Reply #34 on: August 29, 2012, 10:28:48 am »
@taazz: no, I'm sorry, that's not what the RFC4180 text:
Quote
Spaces are considered part of a field and should not be ignored.
or BNF representation says. See also ludob's earlier post.

See also the wikipedia article on this:
http://en.wikipedia.org/wiki/Comma-separated_values#Basic_rules_and_examples
examples, specifically this:
Quote
In some CSV implementations, leading and trailing spaces and tabs are trimmed. This practice is controversial, and does not accord with RFC 4180
and this:
Quote
In CSV implementations that do trim leading or trailing spaces, fields with such spaces as meaningful data must be quoted.

Having said this, your view, is a view many other products implement, including Excel, if I understand Bart correctly. For this reason, as I said, I propose to add support for both RFC 4180 and Excel formats.
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Stringgrid vs. LoadFromCSVFile
« Reply #35 on: August 29, 2012, 10:50:17 am »
true that is what the RFC says but you are missing the point that it says nothing about  characters outside the quotes of a quoted field. EG

1234,   John, Edison   , test
is the same as
1235,"  John","Edison   ","test"

but is it the same with

1235,"  John"  ,"Edison   "  ,"test"

I say yes it is since the quotes specify the data area the spaces after the end-quote are a mistake and should not be there so trim them you say no include them in the value does it seem that they are part of the value to you?

EDIT:
For the last line there is nothing in the RFC that can help you decide one way or an other it simple ignores the issue altogether.

Quote
RFC 4180       Common Format and MIME Type for CSV Files    October 2005


   Interoperability considerations:

      Due to lack of a single specification, there are considerable
      differences among implementations.  Implementors should "be
      conservative in what you do, be liberal in what you accept from
      others
" (RFC 793 [8]) when processing CSV files.  An attempt at a
      common definition can be found in Section 2.

in other words import everything you can export strictly on RFC standards
« Last Edit: August 29, 2012, 11:01:17 am by taazz »
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

Bart

  • Hero Member
  • *****
  • Posts: 5573
    • Bart en Mariska's Webstek
Re: Stringgrid vs. LoadFromCSVFile
« Reply #36 on: August 29, 2012, 11:00:04 am »
Ook goedemorgen BigChimp,

I understand you wrote it to be Excel-compatible - and that it is for external compatibility. Writing it only to accommodate Excel would seem to be a bit limited though as e.g. LibreOffice/OpenOffice calc can deliver up multiline fields which Excel won't.

Multline fields are not yet supported in a StringGrid, so I did not implement this.
If we woud consider doing that, then we cannot use TStringlist internally anymore, and have to design a completey different algorithm.

Again IMO that calls for a single CSV handling routing (as in: 1 code in 1 place) and not have it all scattered around the place).
This code then should have as little depenencies as possible.

Bart

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: Stringgrid vs. LoadFromCSVFile
« Reply #37 on: August 29, 2012, 11:07:42 am »
It indeed says nothing about space outside the quotes of "a field with quotes". Therefore spaces follow the spec in 2.4 "Spaces are considered part of a field and should not be ignored."
If you don't agree, please tell me where in RFC4180 it says spaces around quotes are handled differently than spaces around other characters.

It's clearer in the ABNF section:
Code: [Select]
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTEif there's no " around the field (so NO spaces), it's not a quoted field.

And repeating myself again: I understand why you want to see that last example as a mistake and that the space should be stripped. That is what many programs would do and I'm not against it, but just not when implementing RFC4180 compatible import/export.


Code: [Select]
1235,"  John"  ,"Edison   "  ,"test"would mean there's fields (* denote begin/end)
*1235*
*  John*
*"Edison   " *  => breaks 2.5 "If fields are not enclosed with double quotes, then double quotes may not appear inside the fields."=>invalid field
*test*
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

ludob

  • Hero Member
  • *****
  • Posts: 1173
Re: Stringgrid vs. LoadFromCSVFile
« Reply #38 on: August 29, 2012, 11:14:23 am »
Quote
but is it the same with

1235,"  John"  ,"Edison   "  ,"test"
According the ABNF grammar the second and third are not valid fields. So we are talking about recovering from input errors instead of raising an exception. The fall-back can be to ignore the spaces or to include them. Since space is just a character as any other, ask yourself how you would threat
Code: [Select]
1235,"John "E,"Edison "J,"test" I'm afraid this can only be satisfactory solved with a parameter that says to ignore chars outside quotes or include them as if they where inside the quotes. This is the same situation as for recovering from other errors like
Code: [Select]
1235,"John ","He said "Hi"","test"or
Code: [Select]
1235,John E,He said "Hi",test
Important is that the RFC says nothing about how to deal with errors. So it is all personal preferences or looking at how others do it. An option parameter that uses a set of error recovery settings would be a solution.

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: Stringgrid vs. LoadFromCSVFile
« Reply #39 on: August 29, 2012, 11:17:47 am »
1234,   ...
is the same as
1235,  ....

Is it?,  I know, only kidding it's just a typo.   :D

I'm with you though on spaces outside of quoted fields, they should be ignored.  Somebody might use spaces outside for keeping there data looking tabulated, but still been able to CSV import with the correct spacing.

Quote
It's clearer in the ABNF section:

In that ABNF section, where does it say that AFTER an escaped field, it has to be immediately followed by the next field with no spaces or EOL?   all its telling you is what an escaped field looks like, not what's outside the escaped field.

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: Stringgrid vs. LoadFromCSVFile
« Reply #40 on: August 29, 2012, 11:19:59 am »
Ook goedemorgen BigChimp,

Multline fields are not yet supported in a StringGrid, so I did not implement this.
Well, they apparently are. Stringgrids retain the line ending in the cell's string. They just don't display it. See my example posted earlier - adjusted to also show cr is stored on windows:
(FPC 2.6.x, Laz trunk)

Multiline: a form, a stringgrid and a button; this:
Code: [Select]
procedure TForm1.Button1Click(Sender: TObject);
begin
  StringGrid1.Cells[1,1]:='a'+LineEnding+'b';
  ShowMessage('Found cr at position: '+inttostr(Pos(#13,StringGrid1.Cells[1,1])));
  ShowMessage('Found lf at position: '+inttostr(Pos(#10,StringGrid1.Cells[1,1])));
end; 
... which indicates line ending is stored in the stringgrid even though it is not currently displayed.

Furthermore, think a bit further. If multiline were not supported, don't you think people will try to read multiline csv files anyway and complain? Far better to support multiline import, than strip the line ending if necessary. However, that doesn't even seem necessary, see above.

If we woud consider doing that, then we cannot use TStringlist internally anymore, and have to design a completey different algorithm.
Doing what? Multiline support? Seems that already works.

Again IMO that calls for a single CSV handling routing (as in: 1 code in 1 place) and not have it all scattered around the place).
This code then should have as little depenencies as possible.
1. Have you read what I proposed? How is that "scattered around the place"?
1.1 Use csvdocument for all csv handling in Lazarus now
1.2 Try to get csvdocument in FPC and use that when it is stable (e.g. for dataset export to csv, future perhaps import from csv)
Surely better than the current situation (custom CSV code in the actual procedure)?
2. Have you looked at csvdocument and its dependencies?

Finally, if you don't want to use csvdocument... then you'd just be trying to rewrite it to deal with all exceptions/weird formats (RFC4180, Excel) etc. I don't see any reason to do that.

As previously discussed, FPC sdf format may be shoehorned into supporting csv format. If that is the case, I'd suggest that csvdocument be in FPC and we can rewrite that, to make it seamless to the user.
« Last Edit: August 29, 2012, 11:21:37 am by BigChimp »
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: Stringgrid vs. LoadFromCSVFile
« Reply #41 on: August 29, 2012, 11:23:47 am »
Quote
An option parameter that uses a set of error recovery settings would be a solution.

I would say that seems the most logical solution.  Strictly abiding by RFC rules that are open to interpretation, not only from us from people sending us data just cannot be totally done automatically.

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: Stringgrid vs. LoadFromCSVFile
« Reply #42 on: August 29, 2012, 11:29:06 am »
Quote
It's clearer in the ABNF section:

In that ABNF section, where does it say that AFTER an escaped field, it has to be immediately followed by the next field with no spaces or EOL?   all its telling you is what an escaped field looks like, not what's outside the escaped field.
Look up how "escaped" is used:
Code: [Select]
field = (escaped / non-escaped)So either escaped or non-escaped, nothing else.
Then how fields are used:
Code: [Select]
record = field *(COMMA field)So only fields, and one or more comma's followed by a field. Nothing else.


I know you guys don't want to accept it; and I'd rather have RFC4180 was a bit more lenient, but it isn't.
So, rather than go on about the fact that RFC4180 should accept "quoted fields with spaces around them", we could perhaps more profitably say that loadfromcsvfile should support RFC4180 as well as "quoted fields with spaces around them". See
http://wiki.lazarus.freepascal.org/CsvDocument
IgnoreOuterWhitespace

If anybody still wants to argue that RFC4180 supports quoted fields surrounded by spaces please provide a reasoned argument with actual detailed references to the standard....
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: Stringgrid vs. LoadFromCSVFile
« Reply #43 on: August 29, 2012, 11:32:29 am »
Important is that the RFC says nothing about how to deal with errors. So it is all personal preferences or looking at how others do it. An option parameter that uses a set of error recovery settings would be a solution.
I also very much agree with this.
For the stringgrid, given the lack of standards for CSV files, it might even be feasible to set this option by default to the most permissive possible as the user will see any errors in the stringgrid anyway.
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: Stringgrid vs. LoadFromCSVFile
« Reply #44 on: August 29, 2012, 12:03:10 pm »
Code: [Select]
record = field *(COMMA field)

Yes, this second ABNF clears things up.

Even though CSV is not a standard as such, as default being strict by this RFC might make sense.  Even if "xyx" , = >xyz < seems illogical.  And then other scenarios could be controlled with options.

 

TinyPortal © 2005-2018