Previous Up Next

5.4  The String Data Type

In the Prolog community there are ongoing discussions about the need to have a special string data type. The main argument against strings is that everything that can be done with strings can as well be done with atoms or with lists, depending on the application. Nevertheless, ECLiPSe provides and heavily uses the string data type. It is familiar from other programming languages, and facilitates interfacing. It also offers programmers who are aware of the characteristics of the different data types a choice of most appropriate one. The system provides efficient built-ins for converting from one data type to another.

5.4.1  Choosing The Appropriate Data Type

Strings, atoms and character lists are written in similar ways, just distinguished by the type of quote:

    "abc"      is a string
    'abc'      is an atom
    `abc`      is a character code list, equivalent to [97,98,99]

They differ in space consumption and in the time needed for performing operations on the data.

Strings vs. Character Lists

Let us first compare strings with character lists. Maybe the main disadvantage of a character code list in an untyped language is that it is indistinguishable from a general list of small integers. This implies, for example, that the system cannot reliably decide whether to pretty-print a code list as a quoted string.

The space consumption of a string is always less than that of the corresponding list. For long strings, it is asymptotically 16 times more compact. Items of both types are allocated on the global stack, which means that the space is reclaimed on failure and on garbage collection.

For the complexity of operations it must be kept in mind that the string type is essentially an array representation, i.e., every character in the string can be immediately accessed via its index. The list representation allows only sequential access. The time complexity for extracting a substring when the position is given is therefore only dependent on the size of the substring for strings, while for lists it is also dependent on the position of the substring. Comparing two strings is of the same order as comparing two lists, but faster by a constant factor. If a string is to be processed character by character, this is easier to do using the list representation, since using strings involves keeping index counters and calling the string_code/3 predicate.

The higher memory consumption of lists is sometimes compensated by the property that when two lists are concatenated, only the first one needs to be copied, while the list that makes up the tail of the concatenated list can be shared. When two string are concatenated, both strings must be copied to form the new one.

Strings vs. Atoms

At a first glance, an atom does not look too different from a string. In ECLiPSe, many predicates accept both strings and atoms (e.g., the file name in open/3) and some predicates are provided in two versions, one for atoms and one for strings (e.g., concat_atoms/3 and concat_strings/3).

However, internally these data types are quite different. While a string is simply stored as a character sequence, an atom is mapped into an internal constant. This mapping is done via a table called the dictionary. A consequence of this representation is that copying and comparing atoms is a unit time operation, while for strings both are proportional to the string length. On the other hand, each time an atom is read into the system, it has to be looked up and possibly entered into the dictionary, which implies some overhead. The dictionary is a much less dynamic memory area than the global stack. That means that once an atom has been entered there, this space will only be reclaimed by a relatively expensive dictionary garbage collection. It is therefore in general not a good idea to have a program creating new atoms dynamically at runtime.

Atoms should always be preferred when they are involved in unification and matching. As opposed to strings, they can be used to index clauses of predicates. Consider the following example:

[eclipse 1]: [user].
 afather(mary, george).
 afather(john, george).
 afather(sue, harry).
 afather(george, edward).

 sfather("mary", "george").
 sfather("john", "george").
 sfather("sue", "harry").
 sfather("george", "edward").
user   compiled 676 bytes in 0.00 seconds

yes.
[eclipse 2]: afather(sue,X).

X = harry
yes.
[eclipse 3]: sfather("sue",X).

X = "harry"     More? (;)

no (more) solution.

The predicate with atoms is indexed: the matching clause is selected directly and the determinacy of the call is recognised (the system does not prompt for more solutions). When the names are instead written as strings, the system attempts to unify the call with the first clause, then the second and so on until a match is found. This is much slower than the indexed access. Moreover the call leaves a choicepoint behind (as shown by the More? prompt).

Conclusion

Atoms should be used for representing (naming) the items that a program reasons about, much like enumeration constants in other languages. If used like this, an atom is in fact indivisible and there should be no need to ever consider the atom name as a sequence of characters.

When a program deals with text processing, it should choose between string and list representation. When there is a lot of manipulation on the single character level, it is probably best to use the character list representation, since this makes it very easy to write recursive predicates walking through the text, and lends itself to the use of Definite Clause Grammars (see 13.3).

The string type can be viewed as being a compromise between atoms and lists. It should be used when handling large amounts of input, when the extreme flexibility of lists is not needed, when space is a problem or when handling very temporary data.

5.4.2  Built-in Support for Strings

Most ECLiPSe built-ins that deliver text objects (like getcwd/1, read_string/3,4,5 and many others) return strings.

By means of the built-ins atom_string/2, string_list/2,3, string_chars/2, number_string/2, term_string/2, text_to_string/2, atomics_to_string/2,3, strings can easily be converted to and from other data types.

String manipulation is provided by builtins like string_concat/3, string_code/3, string_codes/2, string_char/3, string_chars/2, split_string/4, and substring/5. The regular expression library library(regex) also operates on strings.

The string stream feature (cf. section 11.3.1) allows strings to be opened like I/O streams, thus providing another way of creating or analysing strings.


Previous Up Next