ECLiPSeWiki | Prolog / Strings

Strings in ECLiPSe 6.2, SWI-7 and YAP

Joachim Schimpf, 2013-11-28, 2013-12-07, 2013-12-26, 2014-07-11

History

ECLiPSe (and it precedessor Sepia) has always had the string data type (which was part of early BSI standard drafts) with double-quote syntax. SWI also had strings, but up to version 6 not with double quote syntax. With SWI-7 and ECLiPSe 6.2 string support has been harmonized, and YAP is expected to agree as well. The following is a summary of the common functionality, and a record of the related discussion.

Agreed Common Functionality

Syntax

strings are double-quoted by default

Term order

Strings fall between numbers and atoms:

?- sort([1,1.2,a,"a",X,f(a)], S).
S = [X, 1.2, 1, "a", a, f(a)]

Intuition: strings have a "more compound" flavour than numbers, but atoms and compounds must remain consecutive because atoms may be considered as compound terms with arity 0.

String-related builtins

string(?Term) is semidet

    succeeds iff Term is a string

string_length(+String, -Length) is det

    where String is of type string.

string_code(?Index, +String, ?Code) is nondet

    Index from 1 to length of String.
    Domain errors on Index and Code if negative.
    Character codes like ISO char_code/2.

get_string_code(+Index, +String, -Code) is det

    like string_code/3, but deterministic and
    strict 1..N domain checking on Index.

string_char(?Index, +String, ?Char) is nondet

    analogous to string_code/3.

string_codes(?String, ?Codes)

    analogous to ISO atom_codes/2.

string_chars(?String, ?Chars)

    analogous to ISO atom_chars/2.

string_lower(+String, -Lower) is det
string_upper(+String, -Upper) is det

    Convert String to all lower or all upper case.

atom_string(+Atom, -String) is det
atom_string(-Atom, +String) is det

    where Atom is of type atom, and String of type string.

number_string(+Number, -String) is det
number_string(-Number, +String) is semidet

    Conversion between any type of number and a string.
    Fails if String can't be parsed as a number. The number syntax does not allow
    for leading or trailing spaces, nor for spaces between sign and digits.
    Both + and - are allowed as signs.  Comments etc are not allowed.

string_concat(?String1, ?String2, ?String3) is nondet

    analogous to ISO atom_concat/3 and previous ECLiPSe append_strings/3.

sub_string(+String, ?Before, ?Length, ?After, ?Sub) is nondet

    analogous to ISO sub_atom/5, and identical to ECLiPSe substring/5.

atomics_to_string(+Atomics, -String) is det

    concat list of atomic terms. Identical to previous ECLiPSe concat_string/2.

atomics_to_string(+Atomics, +Glue, -String) is det

    concat list of atomic terms, with glue between. Identical to previous ECLiPSe join_string/2.

split_string(+String, +SepChars, +PadChars, -SubStrings) is det

    as in ECLiPSe

term_string(+Term, -String) is det
term_string(-Term, +String) is det

    If String was uninstantiated, it is bound to a string representation of
    Term as produced by writeq/2.  If String was instantiated, it is parsed
    as with read/2 and the resulting term unified with Term.

term_string(+Term, -String, +Options) is det
term_string(-Term, +String, +Options) is det

    If String was uninstantiated, it is bound to a string representation of
    Term as produced by write_term/3 (with options corresponding to writeq/2,
    and in addition, and potentially overridden by, the given options).
    If String was instantiated, it is parsed as with read_term/3 with the given options,
    and the resulting term unified with Term.  Inapplicable options are ignored.

text_to_string(+Text, -String) is det

    Converts different textual representations into a string.
    Text is either an atom, string, list of character codes (codes), or
    list of single-character atoms (chars).  Text==[] gives String="".

read_string(+Stream, +Length, -String)
read_string(+Stream, -Length, -String)

    If Length is given, read Length characters from Stream into String.
    Otherwise, read until end of stream, and bind Length to the number
    of characters read.

read_string(+Stream, +SepChars, +PadChars, -Sep, -String)

    Read a string from Stream, providing functionality similar to split_string/4.
    The predicate performs the following steps:
     * Skip all characters that match PadChars
     * Read up to a character that matches SepChars or end of file
     * Discard trailing characters that match PadChars from the collected input
     * Unify String with a string created from the input and Sep with the
       separator character read. If input was terminated by the end of the
       input, Sep is unified with -1.
    The predicate read_string/5 called repeatedly on an input until Sep is -1
    (end of file) is equivalent to reading the entire file into a string and
    calling split_string/4, provided that SepChars and PadChars are not
    partially overlapping (which would require lookahead and could cause
    unexpected blocking read).

Note regarding mode notation: where mode '-' is specified, mode '+' is also allowed and affects the determinism class accordingly.

Situation before December 2013

ECL Syntax

strings are double-quoted, but back-quoted in when using iso-syntax (or, more precisely, when the the character classes are set accordingly)
ECL supports string token concatenation, ie. "a" "b"=="ab"
by default in ECL also "a""b"=="ab" rather than "a""b"=="a\"b" (negotiable)

Builtins previously in both ECL and SWI

string(?Term)

    Term is a string

atom_string(?Atom, ?String)

    but SWI allows numbers as Atom in (+,-) mode, and numbers as String in (-,+) mode

string_length(+String, -Length)

    but SWI allows atoms and numbers as String

string_code(+String, +Index, ?Code) ECL
string_code(?Index, +String, ?Code) SWI

    This is very unfortunate!
    Different argument order, 0-based in SWI vs. 1-based Index in ECL,
    and nondeterministic reverse mode in SWI...
    In ECL, this is supposed to be a very fast primitive (like arg/3),
    it could even be implemented as an abstract machine instruction.
    What about renaming the nondet version string_member/3 or the like?

append_strings(?String1, ?String2, ?String3) ECL
string_concat(?String1, ?String2, ?String3) SWI

    Name is historical in ECL, could add alias.

substring(+String, ?Before, ?Length, ?After, ?Sub) ECL
sub_string(+String, ?Before, ?Length, ?After, ?Sub) SWI

    ECL ready to add underscore variant, in analogy to sub_atom/5.
    However, Quintus precedent is without underscore (and different
    argument order...)

number_string(?Number, ?String)

    Conversion between any number and a string.
    Fails if String can't be parsed as a number.

Builtins previously in SWI only

http://www.swi-prolog.org/pldoc/man?section=strings

string_codes(?String, ?Codes)

    ECL could add this, but subsumed by string_list/3.

string_chars(?String, ?Chars)

    ECL could add this, but subsumed by string_list/3.

Builtins previously in ECL only (ignoring deprecated ones)

http://www.eclipseclp.org/doc/bips/kernel/stratom/index.html

concat_strings(+String1, +String2, ?String3)

    Deterministic version of concatenation.

concat_string(++List, -Dest) [redundant]

    Succeeds if Dest is the concatenation of the atomic terms
    contained in List.

join_string(++List, +Glue, -String)

    String is the string formed by concatenating the elements of List with
    an instance of Glue between each of them (subsumes concat_string/2).

split_string(+String, +SepChars, +PadChars, -SubStrings)

    Decompose String into SubStrings according to separators SepChars
    and padding characters PadChars.

string_list(?String, ?List, +Format)

    Conversion between string in different encodings and a list
    (subsumes string_codes, string_chars, string_list).
    Format is bytes, codes, chars, utf8.

string_list(?String, ?List) [redundant]

    same as string_list(String,List,bytes)

substring(+String1, +String2, ?Position) [redundant]

    Quick semidet check for substring presence. 1-based position.

term_string(?Term, ?String)

    In the (+,-) direction, String is like the output of writeq.
    In the (?,+) direction, String is parsed with read.

sprintf(-String, +Format, ?ArgList)

    printf with output to string.

Other Suggestions

Other inspiration is to be found from Quintus's lib(string). http://quintus.sics.se/isl/quintus/html/quintus/lib-txp.html#lib-txp in particular the span-family (but cf. split_string/4 above) http://quintus.sics.se/isl/quintus/html/quintus/lib-txp-sub-spa.html#lib-txp-sub-spa

Suggestions by Richard O'Keefe

The following is based on http://www.cs.otago.ac.nz/staffpriv/ok/pllib.htm as of 2013-12-27.

substring(String,Sub,Before,Length,After)

    with Quintus argument order, enabling omission of arguments
    for substring/2,3,4, and no underscore.

string_codes/2,3,4,5

    like substring, but Sub is of type codes.

string_chars/2,3,4,5

    like substring, but Sub is of type chars.

integer_string(Integer,String,Base,Zero) and /2,3

    taking an integer Base 2..36 and a character to be treated as zero.
    This allows alternative isomorphic 0-9 Unicode sequences to be used.

float_codes(Float,String,Format,Zero,Decimal,Exponent) and /3,4,5

    taking a format descriptor term, Zero character, Decimal point character,
    exponent marker character (characters all as string).

number_string(Number,String,Zero) and /2

    combination of the preceding two.

string_append/3

    well, string_concat/3 was chosen because of ISO.

atomics_to_string/2,3

    as above

string_leading_count(String,Set,[LengthOut,]LengthIn)

    determines maximal leading sequence of characters out of the Set,
    followed by maximal sequence of characters in the Set.

string_trailing_count(String,Set,[LengthOut,]LengthIn)

    same from other end.

Tentative suggestion for

skip_input(Stream,Set)

    skip characters in Set

read_string(+Stream,+Set,-String,+Bound,-BaseCount)

    reads all available characters in Set up to a limit of Bound.
    Also reads any diacriticals that may follow the characters.
    Bound and Base are counting base characters only.