[pymvpa] Balancing with searchlight and statistical issues.

Tue Mar 1 21:29:04 UTC 2016


On 3/1/2016 11:43 AM, Roberto Guidotti wrote:
> So, thank you Jo for your response and sorry because I didn't explained
> clearly my strategy as well.
>
> I balanced the dataset within runs, so if I have 8A and 2B, after
> balancing I will have 2A and 2B chosen randomly (by pymvpa), since I
> could have some high unbalanced runs (2A vs 2B) I decided to use a two
> run out cross-validation, in order to have more samples in the testing
> set, thus a less biased accuracy (with 2 samples per class, I can have
> 0, 0.5, 1 accuracies) , but I did not replicate the balancing process,
> because I definetely increase the computational time (using either a two
> run out cross-validation).
>
> So do you suggest to use more balanced dataset replications and a leave
> one run out cross-validation?
Oh, so that's good, you're already balancing. No, if you get closer 
balancing with leave-two-runs-out, you should use it. Running 
replications of the balancing can give you an idea of how much it's 
affecting your results. Hopefully the replications will produce fairly 
similar results, but if they're very different, you probably need a 
different balancing strategy.


> Do you think that using a data oriented balancing (e.g. remove beta
> images that are not similar to the image average) or I am introducing
> some other bias?
Sounds risky ... whether it actually introduces bias would depend on the 
details, but I think you'll need to be very careful.


> OT: I always thought that SVM was not so sensible to unbalancing,
> because it uses only few samples as support vectors!
Seems like it might be, but in practice, I've found them highly 
sensitive to imbalance. I suspect it might be related to the typical low 
signal and high dimensionality in MVPA datasets.

Jo


>
> Thank you,
> Roberto
>
> On 1 March 2016 at 16:00, Jo Etzel <jetzel at wustl.edu
> <mailto:jetzel at wustl.edu>> wrote:
>
>     Here's a response to the second part of your question:
>
>     On 2/29/2016 11:30 AM, Roberto Guidotti wrote:
>
>              Also, you say the dataset is unbalanced, but has 12 runs,
>         each with
>              10 trials, half A and half B. That sounds balanced to me
>
>         I classified in few subject the motor response with good
>         accuracies, but
>         now I would like to decode decision, since is a decision task,
>         which is
>         the main reason why my dataset is unbalanced. Stimuli are balanced,
>         since the subject views half A and half B, but he has to respond
>         if the
>         stimulus is either A or B, thus I could have runs with unbalanced
>         condition (e.g. 8 A vs 2 B, etc.).
>
>
>     I see; you're classifying decisions, not stimuli, and the people's
>     decisions were unbalanced. (As far as the classifier is concerned,
>     the balanced stimuli are totally irrelevant; it's the labels
>     (decisions, here) that matter.)
>
>     Classifying with an imbalanced training set is not at all a good
>     idea in most cases; you'll need to balance it so that you have equal
>     numbers of each class. I'll try to get a demo up with more
>     explanation, but the short version is that linear SVMs (and many
>     other common MVPA algorithms) are exquisitely sensitive to
>     imbalance: a training set with 21 of one class and 20 of the other
>     can make seriously skewed results.
>
>     While there are ways to adjust example weighting, etc, with fMRI
>     datasets I generally recommend subsetting examples for balance
>     instead. Since you have 12 runs, you might find that the balance is
>     a bit closer if you do leave-two-runs-out (or even three or four)
>     instead of leave-one-run-out cross-validation.
>
>     Say you have 21 of one class and 20 of the other in a training set.
>     You'll then want to remove one of the larger class (at random), so
>     that there are 20 examples of both classes. To make sure you didn't
>     happen to remove a "weird" example (and so your results were totally
>     dependent on which example was removed), the balancing process
>     should be repeated several times (e.g. 5, 10, depending on how
>     serious the imbalance is) and results averaged over those replications.
>
>     I don't know how to set it up in pymvpa, but when dealing with
>     imbalanced datasets my usual practice is to look at how many
>     examples are present for each person, and figure out a
>     cross-validation scheme that will minimize the imbalance as much as
>     possible. They I precalculate which examples will be omitted in each
>     person for each replication (e.g., the first replication leave out
>     the 3rd "A" in run 2, the second replication, omit the 5th).
>     Ideally, I omit examples before classifying, so that all
>     cross-validation folds will be fully balanced, then do the
>     classification with that balanced dataset. (This parallels the idea
>     of dataset-wise permutation testing - first balance the dataset,
>     then do the cross-validation.)
>
>     hope this makes sense,
>     Jo
>
>
>
>
>
>     ____________________________________________
>
>         Pkg-ExpPsy-PyMVPA mailing list
>         Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org
>         <mailto:Pkg-ExpPsy-PyMVPA at lists.alioth.debian.org>
>         http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa