[pymvpa] Performance distribution with random labels

Mon Dec 12 17:42:33 UTC 2016

> On 12 Dec 2016, at 18:02, Raúl Hernández <raul at lafuentelab.org> wrote:
> 
> Hi all,
> 
> I’m having trouble getting my head around something and I was wondering if you can give me a hand.
> 
> I’m running a classification with 4 possible categories, 10 runs. My data is balanced and I’m using CSVM and a leave one out cross-validation.
> 
> Just for fun, I wanted to create a distribution of the possible performance if I randomized the labels of the runs, so I was expecting a performance around 0.25, after 12,000 reps, I got 0.200, I don’t get it, do you have any idea?
> 
> 
> 
> This is part of the code I used:
> 
> 
> 
> clf = LinearCSVMC()
> 
> SensitivityBasedFeatureSelection(OneWayAnova(), FractionTailSelector(0.01, mode='select', tail='upper'))   
> 
> fclf = FeatureSelectionClassifier(clf, fsel)
> 
> cvte = CrossValidation(fclf, NFoldPartitioner(), errorfx=lambda p, t: np.mean(p == t), enable_ca=['stats'])
> 
> for k in range(0,rndReps):         
> 
> 	np.random.shuffle(fds.sa.targets)            
> 
> 
>         cv_results = cvte(fds)

I'm not sure if this explains the below-chance performance, but it seems that the way that you shuffle the labels does not take into account the chunk structure. This messes up the (in)dependency information. It also means that data in datasets with shuffled targets can be un-balanced, unlike the original data. 

I would suggest to randomly re-assign targets in each chunk (run) separately.