[Debtags-devel] AI-Tagger

Fri Sep 16 09:22:28 UTC 2005

Hello,

the promised version of the AI-Tagger which is capable of testing all
packages against a given tag is now available from
        svn+ssh://<username>@svn.debian.org/svn/debtags/autodebtag/trunk/ai-tagger

The following steps need to be performed to get the set of packages
proposed to contain the uitoolkit::qt tag

        ./create-data.pl --max-good=100 --bad-ratio=2  uitoolkit::qt
        ./bayesian-tagger.pl --train uitoolkit::qt
        ./bayesian-tagger.pl --perform-test uitoolkit::qt
        ./bayesian-tagger.pl --test-all-packages --print-names-only uitoolkit::qt

The third step is optional, it only tests the result of the training,
but it should give you a rough idea of what to expect. For me the output
is:
        ./bayesian-tagger.pl --perform-test uitoolkit__qt/

        Tested packages: 150
        Expected to be good: 50
        Expected to be bad: 100
        Matches: 114 ^= 0.76
        Mismatches: 12 ^= 0.08
        Unsure: 24 ^= 0.16
        Expected good, but wielded bad: 0 ^= 0
        Expected good, but wielded unsure: 9 ^= 0.18
        Expected good, and wielded good: 41 ^= 0.82
        Expected bad, but wielded good: 12 ^= 0.12
        Expected bad, but wielded unsure: 15 ^= 0.15
        Expected bad, and wielded bad: 73 ^= 0.73
The line "Expected bad, but wielded good: 12 ^= 0.12" will give you an
idea of how many packages to expect when running the test-all-package
thing.
        0.12 * 17000 = 2040
Additionally there will be the packages classified correctly as good.
Note that the false positive may even be true positives, because the
current database may be faulty. However for QT packages, it should be in
pretty good shape, because they were autogenerated based on the libqt
dependency.
I was able to decrease the false positive to 0.07 using a broader
training set created by the first command:
	./create-data.pl --max-good=400 --bad-ratio=2  uitoolkit::qt

With 
	./create-data.pl --max-good=800 --bad-ratio=2  uitoolkit::qt
I was able to achieve the following results:
        Tested packages: 1632
        Expected to be good: 326
        Expected to be bad: 1306
        Matches: 1324 ^= 0.811274509803922
        Mismatches: 50 ^= 0.0306372549019608
        Unsure: 258 ^= 0.158088235294118
        Expected good, but wielded bad: 5 ^= 0.0153374233128834
        Expected good, but wielded unsure: 25 ^= 0.0766871165644172
        Expected good, and wielded good: 296 ^= 0.907975460122699
        Expected bad, but wielded good: 45 ^= 0.0344563552833078
        Expected bad, but wielded unsure: 233 ^= 0.178407350689127
        Expected bad, and wielded bad: 1028 ^= 0.787136294027565
Which is pretty good with only 3.4% false positives and only 3.1%
mismatches. However we do not have such a large training set for many
facets.

With last step, 
	./bayesian-tagger.pl --test-all-packages --print-names-only uitoolkit::qt
the tagger tests each package against the given tag, and if it matches
it prints the name of this package to the commandline. So it will output
each package where it "thinks" it should have this tag, one on each
line.
Creating a debtags patch from the output should be pretty
straightforward, using your favourite script language :-)

Don't forget to delete the generated folder (uitoolkit__qt) when trying
a new training set - otherwise the new packages will be added to the
database already existent. Play around a little to get a feeling for it
and don't hesitate to ask questions.

Use "./create-data.pl --man" and "./bayesian-tagger.pl --man" to see all
the options available. You can combine options so you can do something
like: "./bayesian-tagger.pl --train --perform-test  uitoolkit::qt", a
sane order will e ensured.

Greetings 

Ben