Re: [eclipse-clp-users] Hunting for bottlenecks: Is factbase stored on-disk?

From: Kish Shen <kisshen_at_cisco.com> Date: Tue, 04 Sep 2012 20:36:53 +0100 · This archive was generated by hypermail 2.2.0 : Thu Sep 06 2012 - 06:18:21 CEST

Hi David,

On 04/09/2012 02:45, -dp- wrote:
> We have a Java-embedded ECLiPSe 6.1 process running on Win7 Pro 64-bit
> 2.67GHz that takes more than two weeks to run, but all the while the CPU
> and RAM utilization is quite low. We are looking for ways of increasing CPU
> utilization in order to shorten the run time.

It is difficult to be very specific without more information....

However, how are you measuring CPU utilisation? If you are doing so 
using what Windows Task manager is giving you under 'processes' tab, the 
figure is based on the total number of cores you have, and as ECLiPSe 
does not run in parallel, you can only ever use 1 core (1 hyperthread), 
so even if that core is 100% busy, the utilisation % will still be low 
if you have many cores, as in many modern CPUs? For example, I just 
tried running a program on ECLiPSe that I know will maximise the CPU 
usage, and on my 4 core laptop (8 hyperthreads or whatever Intel calls 
it), I got a utilisation of about 16%.

> One of our colleagues speculated that "you are doing lots of read or write
> to your database on the disk and that is causing the IO bottleneck." Is any
> of the ECLiPSe database kept on disk instead of RAM?

No, your program code ("database") is kept in memory. This is virtual 
memory, so bits of it may be swapped to disc at the control of the OS.
If this happens, you should see a lot of I/O to your disc, but I don't 
think this is easy to see with the Windows task manager. If you total 
memory usage is small, then this is also a little unlikely (but it does 
depend on what else you are running on your machine at the same time).

> Any suggestions on where our bottleneck might be?
>
> Regards,
> David
>
> btw, my own speculation is that, because we have 250,000+ facts with the
> same functor, that the long runtime is due to serially attempting matches
> across most or all of those facts. We are looking into storing few facts or
> using more functors, but I've wondered if Google's parallelization of graph
> search algorithms in BigQuery might someday come to ECLiPSe to allow
> parallel search for backchaining.

ECLiPSe performs indexing on one argument during the matching of your 
goal against clauses in your program. That is, in addition to the clause 
name, one of the argument is used to filter out clauses that will not 
match.

If your program is performing search, have you looked at the search 
space your program is exploring? It is very easy for your program to be 
exploring a very large search space that will take very long to explore
if it is poorly constrained. If you are using constraints, do you know 
how effective those constraints are in reducing your search space (you 
also need to take into account the work done in the constraint reasoning 
(propagation), as this can also be expensive). If you are running 
constraint program, you can reduce the search space by using the "right" 
constraint (either a pre-existing one or writing your own specialised 
constraint), and/or using the right heuristics for labelling your 
variables (if you are looking for a feasible solution only) -- again you 
might be able to develop your own labelling heuristics.

This is one reason why we recommend that you develop your ECLiPSe code 
directly, as there are profiling and visualising tools that can help 
you. Note that the time-profiling tool is not available for Windows, so 
if you want to use this, you will need to run your program on a 
different platform.

Another point: although the program space of Prolog is sometimes call 
the "database", this is not what it is really designed for. That is, if 
you really want to use database functions, i.e. to store and retrieve 
many items of information quickly and efficiently, you might be better 
off using a real modern database. ECLiPSe does have an interface to 
MySQL that may be useful for such situations.

Finally, as I started my research on parallel Prolog, I think I will 
also comment a little on parallelisation of ECLiPSe: it is quite a 
complex task, and will involve some significant changes to the 
implementation of the engine. In fact, a lot of the early research at 
ECRC (where ECLiPSe originate from) was in parallising Prolog, and 
ECLiPSe still have some commented out code in the source for an 
or-parallel version that was distributed with ECLiPSe many years ago. 
You probably need a lot of work to get this code working again.

This work on parallelising ECLiPSe was dropped before I joined IC-Parc, 
where the research concentrated on constraint and hybrid solving. I 
think one reason for this is that parallelising will mostly give you 
linear speedups (and probably at the expense of increased memory usage 
and overheads), while the right algorithm/heuristics can produce orders 
of magnitudes of speedup in solving your problem. [another consideration 
was that at the time (about 15-20 years ago), parallel processors were 
uncommon and expensive, which is of course no longer true]

Cheers,

Kish