Chapter 6 Input/Output

In this chapter we will discuss input and output with the ECLiPSe system. We will first discuss how to read data into an ECLiPSe program, then discuss different output methods. From this we extract some rules about good and bad data formats that may be useful when defining a data exchange format between different applications. At the end we show how to use a simple report generator library in RiskWise, which converts data lists into HTML reports.

6.1 Reading input data into data structures

The easiest way to read data into ECLiPSe programs is to use the Prolog term format for the data. Each term is terminated by a fullstop, which is a dot (.) followed by some white space. The following code reads terms from a file until the end of file is encountered and returns the terms in a list.

:-mode read_data(++,-).
read_data(File,Result):-
        open(File,read,S),
        read(S,X),
        read_data_lp(S,X,Result),
        close(S).

read_data_lp(_S,end_of_file,[]):-
        !.
read_data_lp(S,X,[X|R]):-
        read(S,Y),
        read_data_lp(S,Y,R).

This method is very easy to use if both source and sink of the data are ECLiPSe programs. Unfortunately, data provided by other applications will normally not be in the Prolog term format. For them we will have to use some other techniques¹.

6.2 How to use DCGs

In this section we describe the use of tokenizers and DCG (definite clause grammar) to produce a very flexible input system. The input routine of the NDI mapper of RiskWise is completely implemented with this method, and we use some of the code in the examples below.

In this approach the data input is split into two parts, a tokenizer and a parser. The tokenizer read the input and splits it into tokens. Each token corresponds to one field in a data item. The parser is used to recognize the structure of the data and to group all data belonging to one item together.

Using these techniques to read data files is a bit of overkill, they are much more powerful and are used for example to read ECLiPSe terms themselves. But, given the right grammar, this process is very fast and extremely easy to modify and extend.

The top level routine read_file opens the file, calls the tokenizer, closes the file and starts the parser. We assume here that at the end of the parsing the complete input stream has been read (third argument in predicate phrase). Normally, we would check the unread part and produce an error message.

:-mode read_file(++,-).
read_file(File,Term):-
        open(File,'read',S),
        tokenizer(S,1,L),
        close(S),
        phrase(file(Term),L,[]).

The tokenizer is a bit complicated, since our NDI data format explicitly mentions end-of-line markers, and does not distinguish between lower and upper case spelling. Otherwise, we might be able to use the built-in tokenizer of ECLiPSe (predicate read_token).

The tokenizer reads one line of the input at a time and returns it as a string. After each line, we insert a end_of_line(N) token into the output with N the current line number. This can be used for meaningful error messages, if the parsing fails (not shown here). We then split the input line into white space separated strings, eliminate any empty strings and return the rest as our tokens.

The output of the tokenizer will be a list of strings intermixed with end-of-line markers.

:-mode tokenizer(++,++,-).
tokenizer(S,N,L):-
        read_string(S,'end_of_line',_,Line),
        !,
        open(string(Line),read,StringStream),
        tokenizer_string(S,N,StringStream,L).
tokenizer(_S,_N,[]).

tokenizer_string(S,N,StringStream,[H|T]):-
        non_empty_string(StringStream,H),
        !,
        tokenizer_string(S,N,StringStream,T).
tokenizer_string(S,N,StringStream,[end_of_line(N)|L]):-
        close(StringStream),
        N1 is N+1,
        tokenizer(S,N1,L).

non_empty_string(Stream,Token):-
        read_string(Stream, " \t\r\n", _, Token1),
        (Token1 = "" ->
            non_empty_string(Stream,Token)
        ;
            Token = Token1
        ).

We now show an example of grammar rules which define one data file of the NDI mapper, the RouterTrafficSample file. The grammar for the file format is shown below:

file              := <file_type_line> 
                     <header_break>
                     [<data_line>]* 
file_type_line    := RouterTrafficSample <newline> 
header_break      := # <newline> 
data_line         := <timestamp> 
                     <router_name> 
                     <tcp_segments_in> 
                     <tcp_segments_out> 
                     <udp_datagrams_in> 
                     <udp_datagrams_in> 
                     <newline> 
timestamp         := <timestamp_ms> 
router_name       := <name_string>  
tcp_segments_in   := integer 
tcp_segments_out  := integer 
udp_datagrams_in  := integer 
udp_datagrams_out := integer

This grammar translates directly into a DCG representation. The start symbol of the grammar is file(X), the argument X will be bound to the parse tree for the grammar. Each rule uses the symbol --> to define a rule head on the left side and its body on the right. All rules for this particular data file replace one non-terminal symbol with one or more non-terminal symbols. The argument in the rules is used to put the parse tree together. For this data file, the parse tree will be a term router_traffic_sample(L) with L a list of terms router_traffic_sample(A,B,C,D,E,F) where the arguments are simple types (atoms and integers).

file(X) --> router_traffic_sample(X).

router_traffic_sample(router_traffic_sample(L)) --> 
        file_type_line("RouterTrafficSample"),
        header_break,
        router_traffic_sample_data_lines(L).

router_traffic_sample_data_lines([H|T]) --> 
        router_traffic_sample_data_line(H), 
        !,
        router_traffic_sample_data_lines(T).
router_traffic_sample_data_lines([]) --> [].

router_traffic_sample_data_line(
            router_traffic_sample(A,B,C,D,E,F)) --> 
        time_stamp(A),
        router_name(B),
        tcp_segments_in(C),
        tcp_segments_out(D),
        udp_datagrams_in(E),
        udp_datagrams_out(F),
        new_line.

tcp_segments_in(X) --> integer(X).

tcp_segments_out(X) --> integer(X).

udp_datagrams_in(X) --> integer(X).

udp_datagrams_out(X) --> integer(X).

Note the cut in the definition of router_traffic_sample_data_lines, which is used to commit to the rule when a complete data line as been read. If a format error occurs in the file, then we will stop reading at this point, and the remaining part of the input will be returned in phrase.

The following rules define the basic symbols of the grammar. Terminal symbols are placed in square brackets, while additional Prolog code is added with braces. The time_stamp rule for example reads one token X. It first checks that X is a string, then converts it to a number N, and then uses a library predicate eclipse_date to convert N into a date representation Date: 2017/10/25 10:03:20 , which is returned as the parse result.

file_type_line(X) --> ndi_string(X), new_line.

header_break --> 
        ["#"],
        new_line.

router_name(X) --> name_string(X).

time_stamp(Date) --> 
        [X],
        {string(X),
         number_string(N,X),
         eclipse_date(N,Date)
        }.

integer(N) --> [X],{string(X),number_string(N,X),integer(N)}.

name_string(A) --> ndi_string(X),{atom_string(A,X)}.

ndi_string(X) --> [X],{string(X)}.

new_line --> [end_of_line(_)].

These primitives are reused for all files, so that adding the code to read a new file format basically just requires some rules defining the format.

6.3 Creating output data files

In this section we discuss how to generate output data files. We present three methods which implement different output formats.

6.3.1 Creating Prolog data

We first look at a special case where we want to create a file which can be read back with the input routine shown in section 6.1. The predicate output_data writes each item in a list of terms on one line of the output file, each line terminated by a dot (.). The predicate writeq ensures that atoms are quoted, operator definitions are handled correctly, etc.

:-mode output_data(++,+).
output_data(File,L):-
        open(File,'write',S),
        (foreach(X,L),
         param(S) do
           writeq(S,X),writeln('.')
        ),
        close(S).

It is not possible to write unbound constrained variables to a file and to load them later, they will not be re-created with their previous state and constraints. In general, it is a good idea to restrict the data format to ground terms, i.e. terms that do not contain any variables.

6.3.2 Simple tabular format

Generating data in Prolog format is easy if the receiver of the data is another ECLiPSe program, but may be inconvienient if the receiver is written in another language. In that case a tabular format that can be read with routines like scanf is easier to handle. The following program segment shows how this is done. For each item in a list we extract the relevant arguments, and write them to the output file separated by white space.

:-mode output_data(++,+).
output_data(File,L):-
        open(File,'write',S),
        (foreach(X,L),
         param(S) do
            output_item(S,X)
        ),
        close(S).

output_item(S,data_item with [attr1:A1,attr2:A2]):-
        write(S,A1),
        write(S,' '),
        write(S,A2),
        nl(S).

We use the predicate write to print out the individual fields. As this predicate handles arbitrary argument types, this is quite simple, but it does not give us much control over the format of the fields.

6.3.3 Using printf

Instead, we can use the predicate printf which behaves much like the C-library routine. For each argument we must specify the argument type and an optional format. If we make a mistake in the format specification, a run-time error will result.

:-mode output_data(++,+).
output_data(File,L):-
        open(File,'write',S),
        (foreach(X,L),
         param(S) do
            output_item(S,X)
        ),
        close(S).

output_item(S,data_item with [attr1:A1,attr2:A2]):-
        printf(S,"%s %d\n",[A1,A2]).

We can use the same schema for creating tab or comma separated files, which provides a simple interface to spreadsheets and data base readers.

6.4 Good and bad data formats

When defining the data format for an input or output file, we should choose a representation which suits the ECLiPSe application. Table 6.1 shows good and bad formats. Prolog terms are very easy to read and to write, simple tabular forms are easy to write, but more complex to read. Comma separated files need a special tokenizer which separates fields by comma characters. The most complex input format is given by a fixed column format , for example generated by FORTRAN applications. We should avoid such data formats as input if possible, since they require significant development effort.

Format Input Output

Prolog terms ++ ++

EXDR ++ ++

White space separated + ++

Comma separated - ++

Fixed columns - - +

Table 6.1: Good and bad input formats

6.5 Report generation

There is another output format that can be generated quite easily. RiskWise uses a report library, which presents lists of items as HTML tables in hyper linked files. This format is very useful to print some data in a human readable form, as it allows some navigation across files and sorting of tables by different columns. Figure 6.1 shows an example from the RiskWise application.

Figure 6.1: HTML Report

1: We should at this point again mention the possibilities of the EXDR format which can be easily read into ECLiPSe, and which is usually simpler to generate in other languages than the canonical Prolog format.

Format	Input	Output
Prolog terms	++	++
EXDR	++	++
White space separated	+	++
Comma separated	-	++
Fixed columns	- -	+