<div dir="ltr"><div>So, thank you Jo for your response and sorry because I didn't explained clearly my strategy as well.</div><div><br></div><div>I balanced the dataset within runs, so if I have 8A and 2B, after balancing I will have 2A and 2B chosen randomly (by pymvpa), since I could have some high unbalanced runs (2A vs 2B) I decided to use a two run out cross-validation, in order to have more samples in the testing set, thus a less biased accuracy (with 2 samples per class, I can have 0, 0.5, 1 accuracies) , but I did not replicate the balancing process, because I definetely increase the computational time (using either a two run out cross-validation). </div><div><br></div><div>So do you suggest to use more balanced dataset replications and a leave one run out cross-validation?</div><div>Do you think that using a data oriented balancing (e.g. remove beta images that are not similar to the image average) or I am introducing some other bias?</div><div><br></div><div>OT: I always thought that SVM was not so sensible to unbalancing, because it uses only few samples as support vectors!</div><div><br></div><div>Thank you,</div><div>Roberto</div></div><div class="gmail_extra"><br><div class="gmail_quote">On 1 March 2016 at 16:00, Jo Etzel <span dir="ltr"><<a href="mailto:jetzel@wustl.edu" target="_blank">jetzel@wustl.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Here's a response to the second part of your question:<span class=""><br>

<br>

On 2/29/2016 11:30 AM, Roberto Guidotti wrote:<br>

</span><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">

    Also, you say the dataset is unbalanced, but has 12 runs, each with<br>

    10 trials, half A and half B. That sounds balanced to me<br>

<br></span><span class="">

I classified in few subject the motor response with good accuracies, but<br>

now I would like to decode decision, since is a decision task, which is<br>

the main reason why my dataset is unbalanced. Stimuli are balanced,<br>

since the subject views half A and half B, but he has to respond if the<br>

stimulus is either A or B, thus I could have runs with unbalanced<br>

condition (e.g. 8 A vs 2 B, etc.).<br>

</span></blockquote>

<br>

I see; you're classifying decisions, not stimuli, and the people's decisions were unbalanced. (As far as the classifier is concerned, the balanced stimuli are totally irrelevant; it's the labels (decisions, here) that matter.)<br>

<br>

Classifying with an imbalanced training set is not at all a good idea in most cases; you'll need to balance it so that you have equal numbers of each class. I'll try to get a demo up with more explanation, but the short version is that linear SVMs (and many other common MVPA algorithms) are exquisitely sensitive to imbalance: a training set with 21 of one class and 20 of the other can make seriously skewed results.<br>

<br>

While there are ways to adjust example weighting, etc, with fMRI datasets I generally recommend subsetting examples for balance instead. Since you have 12 runs, you might find that the balance is a bit closer if you do leave-two-runs-out (or even three or four) instead of leave-one-run-out cross-validation.<br>

<br>

Say you have 21 of one class and 20 of the other in a training set. You'll then want to remove one of the larger class (at random), so that there are 20 examples of both classes. To make sure you didn't happen to remove a "weird" example (and so your results were totally dependent on which example was removed), the balancing process should be repeated several times (e.g. 5, 10, depending on how serious the imbalance is) and results averaged over those replications.<br>

<br>

I don't know how to set it up in pymvpa, but when dealing with imbalanced datasets my usual practice is to look at how many examples are present for each person, and figure out a cross-validation scheme that will minimize the imbalance as much as possible. They I precalculate which examples will be omitted in each person for each replication (e.g., the first replication leave out the 3rd "A" in run 2, the second replication, omit the 5th). Ideally, I omit examples before classifying, so that all cross-validation folds will be fully balanced, then do the classification with that balanced dataset. (This parallels the idea of dataset-wise permutation testing - first balance the dataset, then do the cross-validation.)<br>

<br>

hope this makes sense,<br>

Jo<div class="HOEnZb"><div class="h5"><br>

<br>

<br>

<br>

<br>

____________________________________________<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Pkg-ExpPsy-PyMVPA mailing list<br>

<a href="mailto:Pkg-ExpPsy-PyMVPA@lists.alioth.debian.org" target="_blank">Pkg-ExpPsy-PyMVPA@lists.alioth.debian.org</a><br>

<a href="http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa" rel="noreferrer" target="_blank">http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa</a><br>

<br>

</blockquote>

<br>

-- <br>

Joset A. Etzel, Ph.D.<br>

Research Analyst<br>

Cognitive Control & Psychopathology Lab<br>

Washington University in St. Louis<br>

<a href="http://mvpa.blogspot.com/" rel="noreferrer" target="_blank">http://mvpa.blogspot.com/</a><br>

<br>

_______________________________________________<br>

Pkg-ExpPsy-PyMVPA mailing list<br>

<a href="mailto:Pkg-ExpPsy-PyMVPA@lists.alioth.debian.org" target="_blank">Pkg-ExpPsy-PyMVPA@lists.alioth.debian.org</a><br>

<a href="http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa" rel="noreferrer" target="_blank">http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa</a><br>

</div></div></blockquote></div><br></div>