False positive dependency checks break package install order

Tue Aug 13 20:49:05 CEST 2013

On Aug 13, 2013, at 2:08 PM, Jacek Konieczny wrote:

> On Tue, 13 Aug 2013 12:26:38 -0400
> Jeffrey Johnson <n3npq at me.com> wrote:
>> Yes. The Bloom filters been in use for about 3 years with no known
>> problems, including randomized installations for 5-10 linux distros
>> under a CI harness.
>> 
>> THe performance/memory gains are significant.
>> 
>>> Are there any other places in the code with algos like this?
>>> 
>> 
>> Yes. Bloom filters are quite convenient.
> 
> I am sure they are. For a _preliminary_ check. This can still a huge
> performance improvement.
> 

I've actually measured and reported the gains when implemented.

Reality is always better than an estimate, though reality has a statistical
probability of not achieving a desired coverage goal as well.

> Let's say we have 1000 packages each 1000 files average and 
> we are looking for 'a package containing given file'.
> 
> And let's say the Bloom filter gives 1 in 1000 false positives on
> average and a cost of Bloom filter lookup is '2', while cost of 
> file list scan comparision is '1'.
> 
> 1. No filter:
> 
> 1000 * 1000 = 1000000 comparisions, no false positives
> 
> cost: 1000000
> 
> 2. Bloom filter only:
> 
> 1000 filter lookups, 2 false positives
> 
> cost: 2000
> 
> 500 times faster, but unreliable.
> 
> 3. Bloom filter plus scan:
> 
> 1000 filter lookups -> give us 2 'candidates' -> these require 2000
> comparisions
> 
> cost: 4000
> 
> Only twice slower than Bloom filter only and still 250 times faster than
> just scanning the file list.
> 
> One could also estimate b-tree index lookup cost for this case.
> 

One could.

But lets start with other analysis in parallele.

False positives are perfectly acceptable if the consequences
are minimal and the gains are significant.

A false LOOP detected isn't fatal in any meaningful sense to RPM, just confusing
to someone used to rpm-4.5 output.

The Bloom filter implementation is also trivially tunable: I claim that there is
a threshhold at which false LOOP's is indistinguishable with other
errors, including packager induced failures. Whether that is one-in-a-million
or one-in-a-billion needs to be decided through other means than the algorithm.

But let's assume PLD package is exactly perfect (it isn't) and two false positives
create a LOOP spontaneously. RPM deals with that LOOP by ignoring each
of the edges while ordering, an utterly harmless failure mode.

The likelier manifestation (because single rather than double incidence) is
that a false positive creates a LOOP where none existed. All edges in the LOOP
are ignored while ordering. The difference there is that useful information
is discarded when all edges in the dependency LOOP are ignored.

The core problem isn't whether Bloom filters have false positives, but that
LOOP's are corrected by ignoring all edges in the LOOP, thereby discarding
useful information.

The other false positive failures of Bloom filters can be analyzed similarly.

Meanwhile one of the major benefits of a compact/efficient sparse Bloom filter store
for files is that the partial ordering -- which often fails because packagers
fail to add necessary relations, aka missing data -- is augmented by the rules

	Every file "requires" its parent directory
	Every symlink "requires" its end point.

The file system hierarchy is a DAG already, and using the 2 rules above
adds determinism to package ordering transparently, thereby enhancing
robustness and reliability of RPM installs.

(aside)
The first person who points out that a file system with symlinks need
not be a DAG will be asked:
	Exactly what "real world" packaging purpose is served by adding
	symlinks that return ELOOP when traversed?
I.e. its a packaging error to distribute a symlink LOOP.

73 de Jeff

> Greets,
> 	Jacek