In this chapter we discuss the choice of data structures for the different application parts. Next to the top-level design, this is the most important aspect that must be specified correctly right from the beginning of a project. The wrong choice of a data structure may mean significant re-work in order to change some deficiency later on, while on the other hand a good data structure design can simplify the coding process and can lead to a very efficient implementation.
The first question is how the data will be fed to the application. We can distinguish five alternatives.
We should note that instead of using files we can also build queues between the ECLiPSe and the Java parts of the application, avoiding the need for file system space.
Which of these methods should be used? This depends on the application. Passing data as arguments clearly is the cleanest way, but requires significant work on the interface and on code for testing. Using data files in fixed formats is simple if the format is defined correctly, but its use of the file system can cause problems when multiple queries should be run concurrently on the same machine. Using Prolog terms in data files has the same disadvantage, but is very simple to use if different ECLiPSe systems exchange data. EXDR files are the safest form to store data, but also the least intuitive. Using queues instead of files avoids problems with multiple instances running at the same time, but require some form of logging to allow debugging. Using facts is a valid alternative if most of the data do not change from one query to the next, but requires extra work to reclaim memory after each change. The following table tries to summarize the advantages and disadvantages of each method.
Property Argument Data file Terms Facts EXDR Multiple runs ++ + + - + Debugging - + + ++ - Test generation effort - + + + - Java I/O effort - + - + + ECLiPSe I/O effort ++ + ++ ++ ++ Memory ++ - - - - - Development effort + - + + -
The next question is how the data should be represented inside the application. For this purpose we will have to introduce data structures which allow rapid access to information, which deal with multiple data sets in different parts of the application and where we can add information in a structured way. It should be clear that the built-in fact data base cannot be used for this purpose. Instead, we have to pass the information via arguments of the predicates. In the following sections, we will discuss how the data should be structured to simplify access and coding.
Note that all our data structures use single assignment, i.e. there is no destructive assignment in the language2. Instead of removing or changing elements of a structuce, we will always make a near-copy with some information being removed or changed.
The principal data representation feature of ECLiPSe are named structures. These are terms were each argument is linked to an argument name. We can access one or more of the arguments with the with operator. Named structures are very similar to structures in other languages, the arguments of the structure correspond to attributes of the entity represented. Different attributes can have different types, so that we can store diverse information in a named structure.
In order to use a structure, it must be defined with a struct definition. We can define a structure either local to a module or export the definition so that the same structure can be used in other modules which import the definition. As part of a systematic design we normally create a module which contains nothing but exported structure definitions. This module is then imported with a use_module directive in all other modules of the application which use the structures. If a structure is used in one module only, we should define it as a local structure in that module.
We also use comment directives to document the named structures, just like we do for exported predicates. For each attribute name, we define the data type of the attribute. Normally, these will be atomic data types (integer, real, atom, string), but that is not required. The attribute can hold any data type that we can pass as an argument to a predicate.
As an example of a named structure we use a small part of the RiskWise module flow_structures.
:-comment(struct(group),[ summary:" this data structure describes the group object ", fields:[ "name":"atom, name of the group", "type":"atom, one of pop, interconntion, vpn or tariff_class", "index":"integer, unique index of the group", "list":"list of interface indices belonging to the group", "nodes":"list of nodes which contain interfaces of that group" ] ]). :-export struct(group(name, type, index, list, nodes)).
If we do not specify a fixed attribute value when the named structure is created, then its value will be a free variable which can be bound later on. This is useful for two main purposes. On one side we can define attributes of a structure which will hold the constraint variables of a problem, on the other side we can leave some attributes initially unassigned so that we can fill these slots with results of a computation later on.
A very common data representation problem is how to access information about some structure from another structure, for example in RiskWise how to access the information about a router from an interface of the router. There are two main alternatives. The first is to insert the data of the first entity (router) directly in the representation of the second entity (interface) as an additional attribute, the second is to store a key which can be used to look up the entity. Although the first method has the advantage of avoiding the extra lookup, we do not recommend this approach. If we have recursive references to objects (in our example above if the router also contains a link to all its interfaces) then this direct representation becomes an infinite data structure, which causes problems for printing and debugging. If we use the second approach, we obviously need a way to find the entity belonging to a particular key without too much overhead. The choice of the key depends on the representation of our overall data structure, which we will discuss in the next sections.
A natural way to represent a collection of items of the same type is to use lists. They are very convenient to handle an arbitrary number of items by iterating on successive heads of the list, until the empty list is reached. Unfortunately, finding a particular item in a list is a very expensive operation, as we have to scan the list sequentially.
We should never use a list when we can use a structure instead. If we know that a collection will always have the same number of items (say 3), it is much better to use a structure with that number of arguments than to use a list.
Hash tables are a very useful alternative to lists, if we sometimes want to look up items rather than iterate over all of them. They are defined in the library hash. We can add items one by one, without an a priori limit on the number of items. As key we can use numbers, atoms or arbitrary terms, but atoms would be the most common key in a hash table. This is very useful when converting input data, since the external data representation often will use names (atoms) to identify objects.
While it is possible to iterate over all items of a hash table, this is not as simple as iteration over a list or an array.
Vectors are another way to represent a collection of items. Each item is associated with an integer key in the range from 1 to N, where N is the size of the vector. Unfortunately, the value N must be known a priori, when we first create the vector. Accessing individual entries by index is very fast, and iterating over all entries is nearly as simple as for lists. The main drawbacks of a vector representation are that we have to know the total number of items beforehand and that the keys must be consecutive integers in the range from 1 to N.
Multi-dimensional arrays are simple nested vectors, they are created with the dim predicate for a given dimension and size. Access to an element is with the subscript predicate (see section 5.5 for an example).
Each of the alternative representations given above has some advantages and disadvantages. To obtain a very flexible representation, we can choose a multi-representation structure. In this structure, a collection of items is represented as a list and as a hash table and as an array. The list representation can be used for a very simple iteration over all items, the hash table is used in the initial data input phase to find items with a given name and the array of items is used in the core routines of the solver for the fastest access by an integer index.
The memory requirements of this multi-representation scheme are quite low. The storage to hold the items themselves is shared for all representations, we only need the additional space for the list, hash table and array structures.
In RiskWise, we use the multi-representation scheme for most data structures, with special access predicates like find_interface/3 to access items with either numeric indices or atom keys. References from one entity to another are by integer key, e.g. each interface structure contains as an attribute the integer key value of the node (router) to which it belongs.