[pymvpa] Train and test on different classes from a dataset

Michael Hanke mih at debian.org
Fri Feb 1 10:21:16 UTC 2013


On Thu, Jan 31, 2013 at 02:13:14PM -0600, J.A. Etzel wrote:
> Why do you say in the tutorial that "Doing a whole-dataset
> permutation is a common mistake ..." ? I don't see that permuting
> the test set labels hurts the inter-sample dependencies ... won't I
> still have (say) 5 A and 5 B in my test set?

I am attaching some code and a figure. This is a modified version of

http://pymvpa.org/examples/permutation_test.html

I ran 24 permutation analysis for 12 combinations of number of
chunks/runs/... and SNR. In the figure you can see MC sample histograms
for all these combinations (always using 200 permutations). The greenish
bars represent the permutation results from permuting both training and
testing portion of the data (note that only within chunk permutation was
done -- although this should have no effect on this data). The blueish
histogram is the same analysis but only the training set has been
permuted (I can't think of any good reason why one would only permute the
testing set -- except for speed ;-).

The input data is pure noise, plus a bit of univariate signal (according
to SNR) added to two of three features. In all simulations there are 200
samples in the dataset, but either grouped in 2, 3 or 5 chunks.

I am using the SNR parameter in this simulation as a way to increase
within category similarity. In a real dataset inter-sample similarity
could have many reasons, of course.

The dashed line shows the theoretical chance performance at 0.5, the red
line the empirical performance for the unpermuted dataset.

Now tell me that it doesn't make a difference what portion of the data
you permute ;-) Depending on the actual number of chunks and data
consistency the "permutability" of the dataset varies quite a bit -- but
this is only reflected in the distributions when the testing portion is
not permuted as well. For example, look at the upper right (high sample
similarity, smallish training portion), in a significant portion of all
permutations the training dataset isn't "properly" permuted at all
(within category label swapping), in the other extreme case the labels
are swapped entirely between categories. This can happen with small
datasets and large chunks -- however, the green histogram doesn't tell me
about it, at all.

[BTW sorry for the poor quality of the figure, but I was hoping to be
 gentle to the listserver. If you run the attached code, it will generate
 a more beautiful one]

Please point me to any conceptual of technical mistake you can think of
-- this topic comes up frequently, the more critical feedback the better...

Cheers,

Michael

-- 
Michael Hanke
http://mih.voxindeserto.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: permutation_sim.jpg
Type: image/jpeg
Size: 72505 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20130201/db26cf36/attachment-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: permutation_test.py
Type: text/x-python
Size: 3134 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/pkg-exppsy-pymvpa/attachments/20130201/db26cf36/attachment-0001.py>


More information about the Pkg-ExpPsy-PyMVPA mailing list