Recent

Author Topic: strings and modes uncertainty  (Read 3865 times)

jack616

  • Sr. Member
  • ****
  • Posts: 268
strings and modes uncertainty
« on: June 20, 2016, 04:26:16 pm »
As some may know I'm in the process of updating an old delphi console program.
Unfortunately my experience in the last several years has been more specification
than coalface coding so I find myself a little bewildered with the range of issues
thrown up by "strings"

I could start throwing out one question or another (I've scanned the wiki and tried a few things etc)
but I thought maybe it would be more productive to ask this first:

If you were writing a small (sub 1MB) console app that made a lot of use of
character and string manipulation and command line I/O - and would like it
to be reasonably internationalised (no need to go overboard I don't think)

... how would you configure it from scratch and is there anything you would avoid?

I currently have removed the {$mode delphi} and have set use ansistrings
This throws up the "illegal conversion short string to pchar" type errors....
These I can understand .... shortstring uses bytecount to 255 pchar does not etc
The current code makes a lot of use of pchars and (Z)strings

Where I fall down I think is knowledge of using codepages and when to use UTF8 versions etc.
The docs seem to suggest it's all massively complicated  -

Is there a simple route or guideline  I can use to maintain character manipulation
or am I just seeing ghosts maybe with all the if's and buts in the docs?

Basicly I need a whole slew of character and string manipulation in the program
and be able to accept and output as internationalised text on the command line
Does that sound reasonable?

If so does anyone have any advice?







Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: strings and modes uncertainty
« Reply #1 on: June 20, 2016, 04:49:25 pm »
If you were writing a small (sub 1MB) console app that made a lot of use of
character and string manipulation and command line I/O - and would like it
to be reasonably internationalised (no need to go overboard I don't think)

... how would you configure it from scratch and is there anything you would avoid?

It should work out of the box with fpc >= 3.0 and using plain string (not Utf8String or the like) types.
If you need to use charaters that are outside the current users codepage (e.g. asian on western european windows) using either UTF8 or UTF16 as default encoding can be considered. However, when reading from / writing to the console may be a problem then.

If your program is targeted to Wester Europe (where we all share the same codepage), you should not have any trouble at all when using just string as the type.

Bart
« Last Edit: June 20, 2016, 05:57:32 pm by Bart »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4474
  • I like bugs.
Re: strings and modes uncertainty
« Reply #2 on: June 20, 2016, 05:02:53 pm »
Just use the Unicode solution we have:
  http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus
and things just (mostly) work.
Even writing to console on Windows works better than before:
  http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#Writing_to_console

I feel Bart confuses things a little. UTF16 cannot be used as defaut encoding without major effort because FCL does not support it yet.
Also, you don't need to worry about codepages when using Unicode.
The only exception is when your input data is encoded using those codepages. Then it needs an explicit conversion.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: strings and modes uncertainty
« Reply #3 on: June 20, 2016, 06:01:07 pm »
I feel Bart confuses things a little.

I get confused easily these days, especially with Unicode (but I'm not alone there, it's just a bit scary and confusing topic).

UTF16 cannot be used as defaut encoding without major effort because FCL does not support it yet.

I meant use UnicodeString (or WideString) explicitely to hold and process your string data.
(Which leaves the conversion to and from console to be considered.)

Bart

jack616

  • Sr. Member
  • ****
  • Posts: 268
Re: strings and modes uncertainty
« Reply #4 on: June 26, 2016, 10:19:06 pm »
Hi
Thanks for the replies - sorry it's taken me so long to reply but I've not been
able to work this week (due to health issues) 

anyway - as Bart says - It is a complex and confusing topic.
So what I've done is knocked out something that's "just about" an application
(its very small!) that lets you see what I'm working with.

If anyone would like to download it and let me have any feedback on what
you think I should watch out for with regard to all this I think that may help a lot.

I've put up a quick web page on www.pixiesoft.co.uk - if you want
to comment on the program itself  I've put a forum up for that to keep
those away from here.

One thing I have noticed is that spamassasin logs contain the phrase
"unicode aware"
Does anyone know if that's a widely used term with a specific meaning
or is it just something they put out themselves? (and what does it mean?)

Bart - I like your idea of using widestrings so I did some reading and discovered
a discrepancy in the docs about conversions between ansi-strings and widestrings
(A sentence that states no data can be lost that contradicts this at the end of
the same sentence - given my problems last week I havn't been able to follow
that up yet)

I'm thinking I need to develop a standard way of handling all strings somehow
ideally with access to a pchar or equivalent.  (pchar2 ?)
 
« Last Edit: June 26, 2016, 11:06:21 pm by jack616 »

jack616

  • Sr. Member
  • ****
  • Posts: 268
Re: strings and modes uncertainty
« Reply #5 on: July 02, 2016, 12:36:30 pm »
If you are one of the downloaders of the proglet I would appreciate
any feedback you may wish to make relevant to this thread.
Did you have any character or i/o issues - are you using any non-western software
or anything else you wish to say.

Thanks
jack

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4474
  • I like bugs.
Re: strings and modes uncertainty
« Reply #6 on: July 02, 2016, 02:04:22 pm »
One thing I have noticed is that spamassasin logs contain the phrase
"unicode aware"
Does anyone know if that's a widely used term with a specific meaning
or is it just something they put out themselves? (and what does it mean?)

A "unicode aware" program understands its string data has Unicode encoding and treats its accordingly.

Quote
Bart - I like your idea of using widestrings so I did some reading and discovered
a discrepancy in the docs about conversions between ansi-strings and widestrings
(A sentence that states no data can be lost that contradicts this at the end of
the same sentence ...

Please be more specific. Which document? You can fix wiki documents also yourself.

Quote
I'm thinking I need to develop a standard way of handling all strings somehow
ideally with access to a pchar or equivalent.  (pchar2 ?)

Lazarus already offers a standard way of handling all strings. Also PChar type can be used, no problem.
Why would you need to develop your own "standard" way?
Your download link has no source code. Please copy your problematic source and we can find solutions.

See also my encoding agnostic functions for codepoints:
 http://forum.lazarus.freepascal.org/index.php/topic,33064.0.html
They allow you to write code that is fully source compatible between
  • UTF-8 solution provided by Lazarus
  • Delphi with UTF-16
  • Future UTF-16 solution by FPC/Lazarus
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018