Chapter 3 Data Structures

In this chapter we discuss the choice of data structures for the different application parts. Next to the top-level design, this is the most important aspect that must be specified correctly right from the beginning of a project. The wrong choice of a data structure may mean significant re-work in order to change some deficiency later on, while on the other hand a good data structure design can simplify the coding process and can lead to a very efficient implementation.

3.1 External data representation

The first question is how the data will be fed to the application. We can distinguish five alternatives.

arguments: In the first alternative, all data are passed in arguments to the query. Multiple items of the same type will usually be represented as lists, with structures to hold different attributes of the different objects. This form has the advantage that each query can be run with a completely new data set without changing the database or creating a new set of files. But debugging data in this form can be more difficult, as there is not direct way to look up some data item. This method also requires work on the Java side to build all the data structures before a call to the ECLiPSe solver. A similar effort is required to develop testing code written in ECLiPSe which exercises the interface.
data files: The second alternative is to use data files in a fixed format. The ECLiPSe program then has to read these files and build the internal data structures at the same time. Depending on the format, this may require parsing the input format with definite clause grammars (DCG) (see section 6.2), adding to the development effort¹. But as the files can be read and written easily, it is quite simple to create test data sets and to analyze problems by hand. The design for the fixed format may require some extra effort if we want to use the full character set for atoms and strings. A proper quoting mechanism may be required in order to distinguish say a comma separator from a comma contained inside a data field.
prolog terms: The third alternative is to use data files as before, but to format them as valid Prolog terms that can directly read with the ECLiPSe term I/O predicates. This avoids the overhead of writing parsers in ECLiPSe, but may be difficult for the calling side of the application, unless that is also written in ECLiPSe. Note that we again may face quoting problems, in particular for single and double quotes.
EXDR: ECLiPSe also provides a binary data format called EXDR that can be used to exchange information. This can be generated and parsed quite easily in ECLiPSe and in Java, and often allows significant space savings. In addition, problems with quoting are avoided. A disadvantage is that EXDR files are not directly readable by humans, and so may require extra effort during debugging.
facts: The last alternative is to store the data as facts in the application. They can then be accessed from any part of the ECLiPSe code quite easily. Testing the code is simple by compiling some data files into the system. The Java interface can also store facts into the database quite easily. But changing the data for a new query can be rather complex, and may require recompiling some data modules.

We should note that instead of using files we can also build queues between the ECLiPSe and the Java parts of the application, avoiding the need for file system space.

Which of these methods should be used? This depends on the application. Passing data as arguments clearly is the cleanest way, but requires significant work on the interface and on code for testing. Using data files in fixed formats is simple if the format is defined correctly, but its use of the file system can cause problems when multiple queries should be run concurrently on the same machine. Using Prolog terms in data files has the same disadvantage, but is very simple to use if different ECLiPSe systems exchange data. EXDR files are the safest form to store data, but also the least intuitive. Using queues instead of files avoids problems with multiple instances running at the same time, but require some form of logging to allow debugging. Using facts is a valid alternative if most of the data do not change from one query to the next, but requires extra work to reclaim memory after each change. The following table tries to summarize the advantages and disadvantages of each method.

Property Argument Data file Terms Facts EXDR

Multiple runs ++ + + - +

Debugging - + + ++ -

Test generation effort - + + + -

Java I/O effort - + - + +

ECLiPSe I/O effort ++ + ++ ++ ++

Memory ++ - - - - -

Development effort + - + + -

Table 3.1: Data representation

3.2 Internal data representation

The next question is how the data should be represented inside the application. For this purpose we will have to introduce data structures which allow rapid access to information, which deal with multiple data sets in different parts of the application and where we can add information in a structured way. It should be clear that the built-in fact data base cannot be used for this purpose. Instead, we have to pass the information via arguments of the predicates. In the following sections, we will discuss how the data should be structured to simplify access and coding.

Note that all our data structures use single assignment, i.e. there is no destructive assignment in the language². Instead of removing or changing elements of a structuce, we will always make a near-copy with some information being removed or changed.

3.2.1 Named structures

The principal data representation feature of ECLiPSe are named structures . These are terms were each argument is linked to an argument name. We can access one or more of the arguments with the with operator. Named structures are very similar to structures in other languages, the arguments of the structure correspond to attributes of the entity represented. Different attributes can have different types, so that we can store diverse information in a named structure.

In order to use a structure, it must be defined with a struct definition. We can define a structure either local to a module or export the definition so that the same structure can be used in other modules which import the definition. As part of a systematic design we normally create a module which contains nothing but exported structure definitions. This module is then imported with a use_module directive in all other modules of the application which use the structures. If a structure is used in one module only, we should define it as a local structure in that module.

We also use comment directives to document the named structures, just like we do for exported predicates. For each attribute name, we define the data type of the attribute. Normally, these will be atomic data types (integer, real, atom, string), but that is not required. The attribute can hold any data type that we can pass as an argument to a predicate.

As an example of a named structure we use a small part of the RiskWise module flow_structures.

:-comment(struct(group),[
summary:"
this data structure describes the group object
",
fields:[
"name":"atom, name of the group",
"type":"atom, one of pop, interconntion, vpn or tariff_class",
"index":"integer, unique index of the group",
"list":"list of interface indices belonging to the group",
"nodes":"list of nodes which contain interfaces of that group"
]
]).
:-export struct(group(name,
                      type,
                      index,
                      list,
                      nodes)).

3.2.2 Placeholder variables

If we do not specify a fixed attribute value when the named structure is created, then its value will be a free variable which can be bound later on. This is useful for two main purposes. On one side we can define attributes of a structure which will hold the constraint variables of a problem, on the other side we can leave some attributes initially unassigned so that we can fill these slots with results of a computation later on.

3.2.3 Nested structures vs. key lookup

A very common data representation problem is how to access information about some structure from another structure, for example in RiskWise how to access the information about a router from an interface of the router. There are two main alternatives. The first is to insert the data of the first entity (router) directly in the representation of the second entity (interface) as an additional attribute, the second is to store a key which can be used to look up the entity. Although the first method has the advantage of avoiding the extra lookup, we do not recommend this approach. If we have recursive references to objects (in our example above if the router also contains a link to all its interfaces) then this direct representation becomes an infinite data structure, which causes problems for printing and debugging. If we use the second approach, we obviously need a way to find the entity belonging to a particular key without too much overhead. The choice of the key depends on the representation of our overall data structure, which we will discuss in the next sections.

3.2.4 Lists

A natural way to represent a collection of items of the same type is to use lists. They are very convenient to handle an arbitrary number of items by iterating on successive heads of the list, until the empty list is reached. Unfortunately, finding a particular item in a list is a very expensive operation, as we have to scan the list sequentially.

We should never use a list when we can use a structure instead. If we know that a collection will always have the same number of items (say 3), it is much better to use a structure with that number of arguments than to use a list.

3.2.5 Hash tables

Hash tables are a very useful alternative to lists, if we sometimes want to look up items rather than iterate over all of them. They are defined in the library hash. We can add items one by one, without an a priori limit on the number of items. As key we can use numbers, atoms or arbitrary terms, but atoms would be the most common key in a hash table. This is very useful when converting input data, since the external data representation often will use names (atoms) to identify objects.

While it is possible to iterate over all items of a hash table, this is not as simple as iteration over a list or an array.

3.2.6 Vectors and arrays

Vectors are another way to represent a collection of items. Each item is associated with an integer key in the range from 1 to N, where N is the size of the vector. Unfortunately, the value N must be known a priori, when we first create the vector. Accessing individual entries by index is very fast, and iterating over all entries is nearly as simple as for lists. The main drawbacks of a vector representation are that we have to know the total number of items beforehand and that the keys must be consecutive integers in the range from 1 to N.

Multi-dimensional arrays are simple nested vectors, they are created with the dim predicate for a given dimension and size. Access to an element is with the subscript predicate (see section 5.5 for an example).

3.2.7 Multi-representation structures

Each of the alternative representations given above has some advantages and disadvantages. To obtain a very flexible representation, we can choose a multi-representation structure. In this structure, a collection of items is represented as a list and as a hash table and as an array. The list representation can be used for a very simple iteration over all items, the hash table is used in the initial data input phase to find items with a given name and the array of items is used in the core routines of the solver for the fastest access by an integer index.

The memory requirements of this multi-representation scheme are quite low. The storage to hold the items themselves is shared for all representations, we only need the additional space for the list, hash table and array structures.

In RiskWise, we use the multi-representation scheme for most data structures, with special access predicates like find_interface/3 to access items with either numeric indices or atom keys. References from one entity to another are by integer key, e.g. each interface structure contains as an attribute the integer key value of the node (router) to which it belongs.

1: ECLiPSEe 5.4 contains a freeware XML (http://www.xml.org) parser which handles most of the detail of parsing XML files. This makes the use of XML as a data exchange format for ECLiPSe are very exiting new possibility. The parser is described in the “Third Party Library” section of the reference manual.
2: Destructive assignment in the hash library is hidden from the user.

Property	Argument	Data file	Terms	Facts	EXDR
Multiple runs	++	+	+	-	+
Debugging	-	+	+	++	-
Test generation effort	-	+	+	+	-
Java I/O effort	-	+	-	+	+
ECLiPSe I/O effort	++	+	++	++	++
Memory	++	-	-	- -	-
Development effort	+	-	+	+	-