[pymvpa] Train and test on different classes from a dataset

Tue Feb 5 14:32:39 UTC 2013

On Tue, 05 Feb 2013, Michael Hanke wrote:

> On Mon, Feb 04, 2013 at 07:37:51PM -0500, Yaroslav Halchenko wrote:
> > I have tortured your script a bit to bootstrap multiple cases and add
> > plotting ROCs (even cheated and used scikit-learn for that since we
> > apparently need overhaul of ROCCurve).  As you see, keeping
> > testing  portion intact results in lower detection power

> Thanks for the update!

> Lower detection power? Do you mean the 4% difference from the
> theoretical maximum? I can live with that. And that is because it is,
> conceptually, a quite different test IMHO.

> Let us ignore the cross-validation case for a second and focus on two
> datasets for simplicity (although it doesn't change things).

> Why are we doing permutation analysis? Because we want to know how
> likely it is to observe a specific prediction performance on a
> particular dataset under the H0 hypothesis, i.e. how good can a
> classifier get at predicting our empirical data when the training
> did not contain the signal of interest -- aka chance performance.

Actually I would like even to go back even more and forget not only
about cross-validation but about prediction altogether...  Why are
we carrying out some estimation and then permutation testing on such
estimates?  It is not about cross-validation and even not about
prediction performance per se -- it is all about "detection" (unless you
dive further into analyzing constructed model) to reject the H0 of
having no signal of interest.

We are using multivariate methods primarily in the same vane as we had
been using GLM/ANOVA etc -- to figure out either data at hands carries
signal of interest.  We started to use multivariate methods since they
were capable of detecting signal buried in multi-voxel patterns.  With a
(cross-validated) prediction accuracy  as our measure we ended up at
first without a reliable reference "ground truth" distribution to judge
either it is trustfully indicative of detecting the signal.  Permutation
testing gives us an opportunity to characterize our estimates to state
how likely any given dataset carries signal of interest.  And that is
why "detection power" is actually of interest for us.  And that is why
4% difference from the theoretical maximum with a HUGE SNR is somewhat
"suboptimal" to say the least.

NB But I guess there is also a glitch with my use of ROC for this
"analysis" since we are not per se interested in the whole
"distribution" of obtained p-values, thus not in a full curve but rather
only in power of making correct  decisions at commonly accepted levels
(e.g. 0.05 and 0.01) so I would include that in later simulations (if
those to come).

> We assume that both training and testing dataset are rather similar --
> generated by random sampling from the underlying common signal
> distribution. If we permute the training dataset, we hope that this will
> destroy the signal of interest. Hence if we train a classifier on
> permuted datasets it will, in the long run, not have learned what we are
> interested in -- no signal: H0.

> For all these classifiers trained on permuted data we want to know how
> well they can discriminate our empirical data (aka testing data) -- more
> precisely the pristine testing data. Because, in general, we do _not_
> want to know how well a classifier trained on no signal can discriminate
> any other dataset with no signal (aka permuted testing data).

> It is true that, depending on the structure of the testing data,
> permuting the testing data will destroy the contained signal with
> varying degrees of efficiency. But again, we want to answer the question
> how well a classifier could perform by chance when predicting our actual
> empirical test data -- not any dataset that could be generated from it
> by degrading the potentially contained signal of interest.

And -- as we discussed probably a year or two ago -- I concur with such
reasoning in general.  I just got intrigued with quite a huge widening
of such "chance distribution" in your plots, thus decided to
investigate.  Altogether, I think that we are battling now ways on how
to deal with divergence from the main assumption behind MC permutation
testing -- independence of samples.  If samples are independent, and we
"know" that -- regular "permute all" would be the most powerful
technique.  If we need  to battle some dependence structure -- we need
to alter permutation to account for it (and that is what we are doing
here).  So here is a summary of cases so far from the simulations
(so nothing theoretical, just empirical observations)

1. samples are independent
   possible scenario in fMRI data:
    - independent samples per each label (or degenerate case of 1
      sample/label per chunk) -- e.g. beta's from GLM

   regular permutation of both training and testing sets is providing
   the best power toward detection of signal

2. samples are not independent
   possible scenario in fMRI data:
   - dependent samples per each label, e.g. multiple volumes from
     the same trial/block in multiple samples with the same label

   permutation of training only, keeping testing intact provides
   better power so far.

Now to complement 1 more -- the idea of "reassigning" the labels
(instead of permutting them all), which pretty much boils down to
regular permutation if we have only 1 sample/label per chunk.  Obviously
in the case of only 2 chunks with 2 labels, it becomes completely
degenerate since we would end up only with 2 possible "permutations",
thus the picture looks quite ugly:

http://www.onerussian.com/tmp/permutation_test_rocs_nonindep5_reassign.png
so mention only 2 possible "green" values.  So such strategy is not
applicable to such cases.  If we simulate 5 chunks, # of possible permutations
"grows" to  2^4 (2^5/2(mirroring case)) = 16, also quite a small number,
that is why the histogram looks still unpleasing BUT detection power
grows up to match "keep testing intact" case:

http://www.onerussian.com/tmp/permutation_test_rocs_nonindep5_reassign_5chunks.png
I guess I am doomed to try now a case with some reasonable number of
chunks (e.g. 10) to see where we are standing there.  Note: such
strategy could also be used in tandem with "keep testing intact" while
accounting for "structure" also within training data -- might be the
ultimate opportunistic test for "heavily structured" data ;-)

-- 
Yaroslav O. Halchenko
Postdoctoral Fellow,   Department of Psychological and Brain Sciences
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik