Re: utf8 from java to eclipse and back?

From: Andrew John Sadler <ajs2_at_icparc.ic.ac.uk> Date: Mon 16 Feb 2004 03:13:13 PM GMT Message-Id: <E1AskQr-0000f2-H1@tempest.icparc.ic.ac.uk> · This archive was generated by hypermail 2.1.8 : Wed 16 Nov 2005 06:07:27 PM GMT GMT

Hello Jacco,

> 
> I sent this mail out twice, but I don't think it ever made it to
> the list.  So let's try again, this time I replaced the problem
> character by an X..:
> --
> 
> I've problems writing utf8 data to eclipse and getting it back:

Yes, I'm sure you do :(

Unfortunately writing anything but 7-bit clean ASCII is going to be
hard if not impossible and the problems are not necessarily with
ECLiPSe.

You say that you wish to write utf8 data, that is non ASCII characters
encoded using the "Eight-bit Unicode Transformation Format".  The
problem is that utf8 is a transformation format, a way of encoding ANY
unicode character as a sequence of one or more octets (8-bit numbers).

So, if you happened to have your data as a sequence of octets which
encoded your desired characters, then everything would be fine as far
as sending them to ECLiPSe.  This may not be the case however.

There are atleast three potential problems.  One with ECLiPSe, one
with you editor and another with Java itself...

Java
----
Internally Java uses 16-bit unicode, that is every character is
represented as a 16 bit number, so if you have a String in Java, all
characters are stored un-ambigously as a 16-bit number. (Hence the
string that you mention in your code sample will have a single 16-bit
number for the offending non-ASCII character, and NOT the utf8 octet
sequence).

When you write java Strings to java Streams they undergo a
transformation into a particular character encoding.  The default
encoding is dependent on you system setup (OS+Localisation+JVM), on my
i386 Linux box with J2SE 1.4.2 from Sun, the default character
encoding is ISO8859_1 (a.k.a. LATIN-1).  Hence if you were to write
your single Java character to a file, by default it would be converted
to a sequence of octets according to the LATIN-1 encoding.
Importantly this sequence is DIFFERENT to the UTF-8 encoding that you
desire.

Editor
------
The editor that you are using to write your java program must store
this "fancy" character using some encoding, when it saves the file.
It may well be using the utf-8 encoding, hence your single "e with
acute accent" character is being written as a two octet utf-8
sequence.

NOTE: The "e with acute accent" character is number 233 (decimal) in
the LATIN-1 encoding, but is the two octet sequence 195, 169
(decimal) in UTF-8.

The problem here is that the Java language requires that Java source
files must be written in unicode.

"Programs are written in Unicode (§3.1), but lexical translations are
provided (§3.2) so that Unicode escapes (§3.3) can be used to include
any Unicode character using only ASCII characters. Line terminators
are defined (§3.4) to support the different conventions of existing
host systems while maintaining consistent line numbers." - Taken from
http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#95413

Essentialy saying that any character which is not simple ASCII should
be encoded using unicode escape sequences like this "\uXXXX".  Hence
is you want to insert a non ASCII character into your Java source you
should (according to the spec) insert the 6 character unicode escape
sequence eg.  "caf\u00D9 bla bla" (which is the unicode escape
sequence for the character you want).

So when the command Javac attempts to compile your Java source file it
comes across TWO octets (the utf-8 encoding of you desired character),
which it treats as TWO seperate unicode characters.  Hence within java
your string has/may-have one more character than you intended.

ECLiPSe
-------
Partly due to the enormous mess of character encodings, ECLiPSe treats
strings simply as arrays of 8-bit numbers.

When transferring Strings from Java to ECLiPSe, as you have done with
your RPC, the high byte of the 16-bit unicode value is simply removed
and only the lower byte is sent.  This is fine for all ASCII
characters and quite a few others besides.  In your case ECLiPSe is
recieving two 8 bit numbers (corresponding to the two 16-bit unicode
numbers that Java has).  However, we're still not finished.

When reading the (2) bytes back into Java from the ECLiPSe side, a
translation must be applied inorder to get the 16-bit unicode
representation required by Java.  Currently we use the default
character encoding scheme for your JVM (as explained above, typically
LATIN-1).  For most purposes this is fine and produces the desired
results, however in you case you are getting two DIFFERENT unicode
characters (different than the ones that you had before).

CONCLUSION
----------
To conclude then, in order to fix your problem(s) you should.

1) Write your non-ascii characters as unicode escape sequences (as per
   the Java language spec).

2) When transfering strings to ECLiPSe, be aware that when dealing
   with strings or atoms, ECLiPSe neither knows, nor cares about
   anything more complicated than 8-bit numbers for characters. So if
   you want to preserve the 16-bit nature of you Java strings, you
   should send them as byte lists and handle the encoding/decoding
   yourself.

   eg.  String s = "caf\u00D9 bla bla";
        byte b[] = s.getBytes("UTF-8");
        eclipse.rpc(new CompountTermImpl("writeln",Arrays.asList(b)));

   Once read back in to Java, the list of bytes will be seen as a
   Collection which can be decoded back into the original Java 16-bit
   unicode String by..

        Collection chars = (Collection)compoundTerm.arg(1);
        byte b[] = new byte[chars.size()];
        int i = 0;
        for(Iterator it = chars.iterator(); it.hasMore(); ) {
             Integer chr = (Integer)it.next();
             b[i++] = chr.byteValue();
        }
        String s = new String(b, "UTF-8");

   Also note that on the ECLiPSe side, you may use the "string_list/3"
   predicate to format lists of character codes as UTF-8 encoded
   ECLiPSe strings. eg. from the documentation of string_list/3

   eg.[eclipse 2]: string_list(S, [65,66,67], utf8).
    S = "ABC"
    yes.

    [eclipse 3]: string_list(S, [65, 0, 700, 2147483647], bytes).
    out of range in string_list(S, [65, 0, 700, 2147483647])

    [eclipse 4]: string_list(S, [65, 0, 700, 2147483647], utf8).
    S = "A\000\312\274\375\277\277\277\277\277"
    yes.

3) Avoid non-ascii characters ;-)

> 
> eclipse.rpc("write(output,'cafÃ© bla bla'),nl(output),flush(output).");
> 
> When I now read the corresponding stream back into java, I get the
> string "cafX bla bla"  (the Ã© has turned into \357\277\275).
> 
> Any idea what I'm doing wrong here?  When I stick to ascii it all works
> fine.
> 
> Help would be highly appreciated,
> 
> Jacco

In the next release we will probably add options to make it easier to
preserve Java's 16-bit unicode strings when passing to/from ECLiPSe.

I hope that has helped in some way.  Frankly my fingers hurt from
typing such a long reply, so I will not be offended if you dont read
it all :)

Andrew Sadler