Recent

Author Topic: Detect string encoding  (Read 4318 times)

McClane

  • New Member
  • *
  • Posts: 44
Detect string encoding
« on: May 06, 2019, 07:04:28 pm »
Hi I don't know how to detect the encoding of a string.

I'm making an IRC client and if someone writes in spanish and send a string with á or ¿ i need to know the encoding
to convert it to utf8.

Thanks in advance.

Imants

  • Full Member
  • ***
  • Posts: 198
Re: Detect string encoding
« Reply #1 on: May 07, 2019, 02:27:52 pm »
I do not think It is possible to detect string encoding only to guess.

If you are using Lazarus for client then anything user types in control should be in UTF8. If I remember correctly all string variables in Lazarus by default should be in UTF8.

But if user input in not in UTF8 but using local code-page then I think your only option is to find local code page in client side and when sending data to server convert it to UTF8 so that server only receives data in UTF8 format.

wp

  • Hero Member
  • *****
  • Posts: 13353
Re: Detect string encoding
« Reply #2 on: May 07, 2019, 02:49:17 pm »
Try the ChsDet (CharacterSet Detector) available through Online-Package-Manager.

See also https://forum.lazarus.freepascal.org/index.php/topic,34695.msg228321.html#msg228321 and http://chsdet.sourceforge.net/

Thaddy

  • Hero Member
  • *****
  • Posts: 18729
  • To Europe: simply sell USA bonds: dollar collapses
Re: Detect string encoding
« Reply #3 on: May 07, 2019, 03:42:39 pm »
Try the ChsDet (CharacterSet Detector) available through Online-Package-Manager.

See also https://forum.lazarus.freepascal.org/index.php/topic,34695.msg228321.html#msg228321 and http://chsdet.sourceforge.net/
That only works with enough entropy.... On four letters, throw a dice..

A BOM encoding was designed just to prevent that... But programmers are stubborn and forget these "little details" all the time. They are born like that....
« Last Edit: May 07, 2019, 03:52:10 pm by Thaddy »
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

lucamar

  • Hero Member
  • *****
  • Posts: 4217
Re: Detect string encoding
« Reply #4 on: May 07, 2019, 05:53:14 pm »
A BOM encoding was designed just to prevent that... But programmers are stubborn and forget these "little details" all the time. They are born like that....

Having a BOM wouldn't help the OP at all: the unique purpose of the BOM is to specify whether a Unicode file is LE or BE encoded and, indirectly, allows one to know if a file is Unicode text. Other than that ...

The OP's question is unsolvable: an array of characters may represent a string in almost any coceivable encoding and there is no simple way to "guess" which it might be. The only exceptions are the (relative) easy of guessing whether a file might be UTF8 and the even easier guess of a pure ASCII file.

Beyond that, how do you differentiate between Windoiws-1252, IBM-850 and IBM-437 encoded texts? You can't ... unless you have a nice expert system that can guess which language the text is and, by testing, the most probable encoding.
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

ASerge

  • Hero Member
  • *****
  • Posts: 2475
Re: Detect string encoding
« Reply #5 on: May 07, 2019, 06:21:02 pm »
In general, the IRC server knows nothing about encodings and works with a stream of bytes. Only clients can negotiate on the encoding used. Many IRC clients offer an explicit choice of the encoding used. Some of them prefer UTF8 by default, and use/recognize BOM for Unicode/UTF8 encoding.
It is clear that the use of Unicode or UTF8 in languages that are very different from english significantly reduces the size of the message. Therefore, users of one country prefer the local one-byte character set (if the alphabet allows).

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1572
    • Lebeau Software
Re: Detect string encoding
« Reply #6 on: May 07, 2019, 09:10:58 pm »
I'm making an IRC client and if someone writes in spanish and send a string with á or ¿ i need to know the encoding
to convert it to utf8.

Check if the string is already encoded in UTF-8 (easy enough to do manually), and if so then send it as-is, otherwise just use the user's current locale to convert the string to UTF-8, then send it.
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

 

TinyPortal © 2005-2018