Re: [eclipse-clp-users] UTF8 support for String

From: Joachim Schimpf <jschimpf_at_coninfer.com>
Date: Sun, 05 Jul 2015 16:40:16 +0100
Yes, unfortunately Unicode support is still somewhat limited in ECLiPSe.

Non-ISO-8859-1 characters can be used in quoted tokens like "タ" or 'タ',
and can be read and written.  However, as you have observed, the generic
atom/string predicates still always consider them as byte sequences,
assuming a fixed encoding and not recognising multi-byte characters.

When you are working on the character level, the best solution currently
is to do all your computation with character code (=integer) lists.  As
I suggested in my other mail, this anyway seems to fit your application
better.  You should then

  - get your input strings
  - convert the strings to lists (using string_list/3)
  - do all computation with lists
  - convert the result lists to strings (using string_list/3 in reverse)
  - return the result strings

which means you don't need any new predicates, and the encoding/decoding
problem is limited to the input/output phases.

In the longer term, of course we want to improve Unicode support and
your input is welcome.

Cheers,
Joachim



On 03/07/15 04:44, Edgaonkar, Shrirang wrote:
> Dear CLP users,
>
>     The following predicate returns the Length variable as 12  since the unicode
> character length is counted as 3 instead of 1. Since there are 3 characters it
> gets 9 plus 3 Ascii characters equals 12.
>
> string_length("ABCターデ", Length),
>
> Whereas the following clauses would return N as 6 for the same string since it
> supports utf8.
>
> string_list("ABCターデ", List, utf8),
>
> length(List, N),
>
> I have written a list of predicates for string manipulation. They use the
> existing predicates from library Strings and Atoms like append_strings(?String1,
> ?String2, ?String3) etc. If I have to support utf8 such that string_length("ABC
> ターデ", Length, utf8), gives me 6, I have to write my own version for example:-
>
> string_length(STR, Length, utf8):-
>
> string_list(STR, List, utf8),
>
> length(List, Length).
>
> This is just a prototype for illustration. Please let me know if my
> understanding is right. Replacing all the Strings and Atoms with utf8 support is
> a task for me given they from sepia-kernel.
>
> Thanks and Regards,
>
> Shrirang Edgaonkar
Received on Sun Jul 05 2015 - 15:40:24 CEST

This archive was generated by hypermail 2.2.0 : Thu Jul 09 2015 - 18:13:14 CEST