Previous Up Next

5.4  The String Data Type

In the Prolog community there have been ongoing discussions about the need to have a special string data type. The main argument against strings is that everything that can be done with strings can as well be done with atoms or with lists, depending on the application. Nevertheless, in ECLiPSe it was decided to have the string data type, so that users that are aware of the advantages and disadvantages of the different data types can always choose the most appropriate one. The system provides efficient builtins for converting from one data type to another.

5.4.1  Choosing The Appropriate Data Type

Strings, atoms and character lists differ in space consumption and in the time needed for performing operations on the data.

Strings vs. Character Lists

Let us first compare strings with character lists. The space consumption of a string is always less than that of the corresponding list. For long strings, it is asymptotically 16 times more compact. Items of both types are allocated on the global stack, which means that the space is reclaimed on failure and on garbage collection.

For the complexity of operations it must be kept in mind that the string type is essentially an array representation, ie. every character in the string can be immediately accessed via its index. The list representation allows only sequential access. The time complexity for extracting a substring when the position is given is therefore only dependent on the size of the substring for strings, while for lists it is also dependent on the position of the substring. Comparing two strings is of the same order as comparing two lists, but faster by a constant factor. If a string is to be processed character by character, this is easier to do using the list representation, since using strings involves keeping index counters and calling the string_code/3 predicate.

The higher memory consumption of lists is sometimes compensated by the property that when two lists are concatenated, only the first one needs to be copied, while the list that makes up the tail of the concatenated list can be shared. When two string are concatenated, both strings must be copied to form the new one.

Strings vs. Atoms

At a first glance, an atom does not look too different from a string. In ECLiPSe, many predicates accept both strings and atoms (e.g. the file name in open/3) and some predicates are provided in two versions, one for atoms and one for strings (e.g. concat_atoms/3 and concat_strings/3).

However, internally these data types are quite different. While a string is simply stored as a character sequence, an atom is mapped into an internal constant. This mapping is done via a table called the dictionary. A consequence of this representation is that copying and comparing atoms is a unit time operation, while for strings both is proportional to the string length. On the other hand, each time an atom is read into the system, it has to be looked up and possibly entered into the dictionary, which implies some overhead. The dictionary is a much less dynamic memory area than the global stack. That means that once an atom has been entered there, this space will only be reclaimed by a relatively expensive dictionary garbage collection. It is therefore in general not a good idea to have a program creating new atoms dynamically at runtime.

Atoms should always be preferred when they are involved in unification and matching. As opposed to strings, they can be used for indexing clauses of predicates. Consider the following example:
[eclipse 1]: [user].
 afather(mary, george).
 afather(john, george).
 afather(sue, harry).
 afather(george, edward).

 sfather("mary", "george").
 sfather("john", "george").
 sfather("sue", "harry").
 sfather("george", "edward").
user   compiled 676 bytes in 0.00 seconds

yes.
[eclipse 2]: afather(sue,X).

X = harry
yes.
[eclipse 3]: sfather("sue",X).

X = "harry"     More? (;)

no (more) solution.
The predicate with atoms is indexed, that means that the matching clause is directly selected and the determinacy of the call is recognised (the system does not prompt for more solutions). When the names are instead written as strings, the system attempts to unify the call with the first clause, then the second and so on until a match is found. This is much slower than the indexed access. Moreover the call leaves a choicepoint behind (as shown by the more-prompt).

Conclusion

Atoms should be used for representing (naming) the items that a program reasons about, much like enumeration constants in PASCAL. If used like this, an atom is in fact indivisible and there should be no need to ever consider the atom name as a sequence of characters.

When a program deals with text processing, it should choose between string and list representation. When there is a lot of manipulation on the single character level, it is probably best to use the character list representation, since this makes it very easy to write recursive predicates walking through the text.

The string type can be viewed as being a compromise between atoms and lists. It should be used when handling large amounts of input, when the extreme flexibility of lists is not needed, when space is a problem or when handling very temporary data.

5.4.2  Builtin Support for Strings

Most ECLiPSe builtins that deliver text objects (like getcwd/1, read_string/3,4 and many others) return strings. Strings can be created and their contents may be read using the string stream feature (cf. section 10.3.1). By means of the builtins atom_string/2, string_list/2, number_string/2 and term_string/2, strings can easily be converted to other data types.

5.4.3  Quoted lists

As already discussed, many Prologs use the double quotes as a notation for lists of characters. By default, ECLiPSe does not provide any syntactical support for such quoted lists. However, the user can manipulate the quotes by means of the set_chtab/2 predicate. A quote is defined by setting the character class of the chosen character to string_quote, list_quote or atom_quote respectively. To create a list quote (which is not available by default) one may use:
[eclipse 1]: set_chtab(0'`, list_quote).

yes.
[eclipse 2]: X = `text`, Y = "text", type_of(X, TX), type_of(Y, TY).

X = [116, 101, 120, 116]
TX = compound
Y = "text"
TY = string
yes.

Previous Up Next