Hello Jacco, > > I sent this mail out twice, but I don't think it ever made it to > the list. So let's try again, this time I replaced the problem > character by an X..: > -- > > I've problems writing utf8 data to eclipse and getting it back: Yes, I'm sure you do :( Unfortunately writing anything but 7-bit clean ASCII is going to be hard if not impossible and the problems are not necessarily with ECLiPSe. You say that you wish to write utf8 data, that is non ASCII characters encoded using the "Eight-bit Unicode Transformation Format". The problem is that utf8 is a transformation format, a way of encoding ANY unicode character as a sequence of one or more octets (8-bit numbers). So, if you happened to have your data as a sequence of octets which encoded your desired characters, then everything would be fine as far as sending them to ECLiPSe. This may not be the case however. There are atleast three potential problems. One with ECLiPSe, one with you editor and another with Java itself... Java ---- Internally Java uses 16-bit unicode, that is every character is represented as a 16 bit number, so if you have a String in Java, all characters are stored un-ambigously as a 16-bit number. (Hence the string that you mention in your code sample will have a single 16-bit number for the offending non-ASCII character, and NOT the utf8 octet sequence). When you write java Strings to java Streams they undergo a transformation into a particular character encoding. The default encoding is dependent on you system setup (OS+Localisation+JVM), on my i386 Linux box with J2SE 1.4.2 from Sun, the default character encoding is ISO8859_1 (a.k.a. LATIN-1). Hence if you were to write your single Java character to a file, by default it would be converted to a sequence of octets according to the LATIN-1 encoding. Importantly this sequence is DIFFERENT to the UTF-8 encoding that you desire. Editor ------ The editor that you are using to write your java program must store this "fancy" character using some encoding, when it saves the file. It may well be using the utf-8 encoding, hence your single "e with acute accent" character is being written as a two octet utf-8 sequence. NOTE: The "e with acute accent" character is number 233 (decimal) in the LATIN-1 encoding, but is the two octet sequence 195, 169 (decimal) in UTF-8. The problem here is that the Java language requires that Java source files must be written in unicode. "Programs are written in Unicode (§3.1), but lexical translations are provided (§3.2) so that Unicode escapes (§3.3) can be used to include any Unicode character using only ASCII characters. Line terminators are defined (§3.4) to support the different conventions of existing host systems while maintaining consistent line numbers." - Taken from http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#95413 Essentialy saying that any character which is not simple ASCII should be encoded using unicode escape sequences like this "\uXXXX". Hence is you want to insert a non ASCII character into your Java source you should (according to the spec) insert the 6 character unicode escape sequence eg. "caf\u00D9 bla bla" (which is the unicode escape sequence for the character you want). So when the command Javac attempts to compile your Java source file it comes across TWO octets (the utf-8 encoding of you desired character), which it treats as TWO seperate unicode characters. Hence within java your string has/may-have one more character than you intended. ECLiPSe ------- Partly due to the enormous mess of character encodings, ECLiPSe treats strings simply as arrays of 8-bit numbers. When transferring Strings from Java to ECLiPSe, as you have done with your RPC, the high byte of the 16-bit unicode value is simply removed and only the lower byte is sent. This is fine for all ASCII characters and quite a few others besides. In your case ECLiPSe is recieving two 8 bit numbers (corresponding to the two 16-bit unicode numbers that Java has). However, we're still not finished. When reading the (2) bytes back into Java from the ECLiPSe side, a translation must be applied inorder to get the 16-bit unicode representation required by Java. Currently we use the default character encoding scheme for your JVM (as explained above, typically LATIN-1). For most purposes this is fine and produces the desired results, however in you case you are getting two DIFFERENT unicode characters (different than the ones that you had before). CONCLUSION ---------- To conclude then, in order to fix your problem(s) you should. 1) Write your non-ascii characters as unicode escape sequences (as per the Java language spec). 2) When transfering strings to ECLiPSe, be aware that when dealing with strings or atoms, ECLiPSe neither knows, nor cares about anything more complicated than 8-bit numbers for characters. So if you want to preserve the 16-bit nature of you Java strings, you should send them as byte lists and handle the encoding/decoding yourself. eg. String s = "caf\u00D9 bla bla"; byte b[] = s.getBytes("UTF-8"); eclipse.rpc(new CompountTermImpl("writeln",Arrays.asList(b))); Once read back in to Java, the list of bytes will be seen as a Collection which can be decoded back into the original Java 16-bit unicode String by.. Collection chars = (Collection)compoundTerm.arg(1); byte b[] = new byte[chars.size()]; int i = 0; for(Iterator it = chars.iterator(); it.hasMore(); ) { Integer chr = (Integer)it.next(); b[i++] = chr.byteValue(); } String s = new String(b, "UTF-8"); Also note that on the ECLiPSe side, you may use the "string_list/3" predicate to format lists of character codes as UTF-8 encoded ECLiPSe strings. eg. from the documentation of string_list/3 eg.[eclipse 2]: string_list(S, [65,66,67], utf8). S = "ABC" yes. [eclipse 3]: string_list(S, [65, 0, 700, 2147483647], bytes). out of range in string_list(S, [65, 0, 700, 2147483647]) [eclipse 4]: string_list(S, [65, 0, 700, 2147483647], utf8). S = "A\000\312\274\375\277\277\277\277\277" yes. 3) Avoid non-ascii characters ;-) > > eclipse.rpc("write(output,'café bla bla'),nl(output),flush(output)."); > > When I now read the corresponding stream back into java, I get the > string "cafX bla bla" (the é has turned into \357\277\275). > > Any idea what I'm doing wrong here? When I stick to ascii it all works > fine. > > Help would be highly appreciated, > > Jacco In the next release we will probably add options to make it easier to preserve Java's 16-bit unicode strings when passing to/from ECLiPSe. I hope that has helped in some way. Frankly my fingers hurt from typing such a long reply, so I will not be offended if you dont read it all :) Andrew SadlerReceived on Mon Feb 16 15:18:18 2004
This archive was generated by hypermail 2.1.8 : Wed 16 Nov 2005 06:07:27 PM GMT GMT