False positive dependency checks break package install order

Tue Aug 13 20:08:38 CEST 2013

On Tue, 13 Aug 2013 12:26:38 -0400
Jeffrey Johnson <n3npq at me.com> wrote:
> Yes. The Bloom filters been in use for about 3 years with no known
> problems, including randomized installations for 5-10 linux distros
> under a CI harness.
> 
> THe performance/memory gains are significant.
> 
> > Are there any other places in the code with algos like this?
> > 
> 
> Yes. Bloom filters are quite convenient.

I am sure they are. For a _preliminary_ check. This can still a huge
performance improvement.

Let's say we have 1000 packages each 1000 files average and 
we are looking for 'a package containing given file'.

And let's say the Bloom filter gives 1 in 1000 false positives on
average and a cost of Bloom filter lookup is '2', while cost of 
file list scan comparision is '1'.

1. No filter:

1000 * 1000 = 1000000 comparisions, no false positives

cost: 1000000

2. Bloom filter only:

1000 filter lookups, 2 false positives

cost: 2000

500 times faster, but unreliable.

3. Bloom filter plus scan:

1000 filter lookups -> give us 2 'candidates' -> these require 2000
comparisions

cost: 4000

Only twice slower than Bloom filter only and still 250 times faster than
just scanning the file list.

One could also estimate b-tree index lookup cost for this case.

Greets,
	Jacek