Further ideas for Debtags AI

Tue Jun 13 22:58:17 UTC 2006

Hi Alex,
> Neat idea! Would have to make it clear that these are merely suggestions
> though. Also, evaluating a package for all tags would take some time.
> But it would be a great feature :)

For the end user, I'd just pre-compute this information on a regular
interval and store it in a lookup database, when possible.

> What was required for the database rewrite, any pointers? Although the

Sorry, not much information available on that. Basically it's a
"design the database in a way that suits your needs" way. The current
central database dates back to pre-faceted time.

> Yup, overfitting would indeed be a problem.

That's not overfitting, strictly speaking, but just statistically
invalid use, using the same reasoning to verify something that you
used to "estimate" it in the first place. It's known from statistics
that this will easily result in bad effects.
Overfitting would be if we had an AI basically putting every tag
change on "reject" (trying to reproduce the trained tag set exactly),
since the tagging used in training didn't have that change.

> algorithms we can do some further experimentation in this area. Maybe
> the AI tagger could randomly take a subset (half or so) of the packages
> to train on each week, instead of the whole deal.

Or be trained incrementally with the reviewed changes.
Naive bayes should be easy to train iteratively; the itemset
approaches aren't easy updateable I think.

best regards,
Erich Schubert
--
    erich@(mucl.de|debian.org)      --      GPG Key ID: 4B3A135C    (o_
  To understand recursion you first need to understand recursion.   //\
  Wo befreundete Wege zusammenlaufen, da sieht die ganze Welt für   V_/_
        eine Stunde wie eine Heimat aus. --- Herrmann Hesse