[pymvpa] Train and test on different classes from a dataset

Tue Feb 5 10:35:58 UTC 2013

Interesting discussion, 

I have similar expectations to those of Michael, and I find this 4% hard to justify. Could this be due to the randomization algorithm used to permute the training set? Non-perfect randomization could be one reason for this difference. 

Cheers,
-Rawi

>________________________________
> From: Michael Hanke <mih at debian.org>
>To: pkg-exppsy-pymvpa at lists.alioth.debian.org 
>Sent: Tuesday, February 5, 2013 8:11 AM
>Subject: Re: [pymvpa] Train and test on different classes from a dataset
> 
>On Mon, Feb 04, 2013 at 07:37:51PM -0500, Yaroslav Halchenko wrote:
>> I have tortured your script a bit to bootstrap multiple cases and add
>> plotting ROCs (even cheated and used scikit-learn for that since we
>> apparently need overhaul of ROCCurve).  As you see, keeping
>> testing  portion intact results in lower detection power
>
>Thanks for the update!
>
>Lower detection power? Do you mean the 4% difference from the
>theoretical maximum? I can live with that. And that is because it is,
>conceptually, a quite different test IMHO.
>
>Let us ignore the cross-validation case for a second and focus on two
>datasets for simplicity (although it doesn't change things).
>
>Why are we doing permutation analysis? Because we want to know how
>likely it is to observe a specific prediction performance on a
>particular dataset under the H0 hypothesis, i.e. how good can a
>classifier get at predicting our empirical data when the training
>did not contain the signal of interest -- aka chance performance.
>
>We assume that both training and testing dataset are rather similar --
>generated by random sampling from the underlying common signal
>distribution. If we permute the training dataset, we hope that this will
>destroy the signal of interest. Hence if we train a classifier on
>permuted datasets it will, in the long run, not have learned what we are
>interested in -- no signal: H0.
>
>For all these classifiers trained on permuted data we want to know how
>well they can discriminate our empirical data (aka testing data) -- more
>precisely the pristine testing data. Because, in general, we do _not_
>want to know how well a classifier trained on no signal can discriminate
>any other dataset with no signal (aka permuted testing data).
>
>It is true that, depending on the structure of the testing data,
>permuting the testing data will destroy the contained signal with
>varying degrees of efficiency. But again, we want to answer the question
>how well a classifier could perform by chance when predicting our actual
>empirical test data -- not any dataset that could be generated from it
>by degrading the potentially contained signal of interest.
>
>
>Michael
>
>-- 
>Michael Hanke
>http://mih.voxindeserto.de
>
>_______________________________________________
>Pkg-ExpPsy-PyMVPA mailing list
>Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
>http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
>
>
>