<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<br>

<br>

Today's Topics:<br>

<br>

   1. Re: No samples of a class in a chunk (J.A. Etzel)<br>

<br>

<br>

----------------------------------------------------------------------<br>

<br>

Message: 1<br>

Date: Tue, 06 Aug 2013 16:04:12 -0500<br>

From: "J.A. Etzel" <<a href="mailto:jetzel@artsci.wustl.edu" target="_blank">jetzel@artsci.wustl.edu</a>><br>

To: <a href="mailto:pkg-exppsy-pymvpa@lists.alioth.debian.org" target="_blank">pkg-exppsy-pymvpa@lists.alioth.debian.org</a><br>

Subject: Re: [pymvpa] No samples of a class in a chunk<br>

Message-ID: <<a href="mailto:520164CC.9050203@artsci.wustl.edu" target="_blank">520164CC.9050203@artsci.wustl.edu</a>><br>

Content-Type: text/plain; charset=ISO-8859-1; format=flowed<br>

<br>

It doesn't look like anyone's replied to this yet, so here's my two cents.<br>

<br>

I think of this sort of situation as a case of imbalance - there aren't<br>

equal numbers of examples of each class in each training/testing set<br>

(aka chunk). This happens in all sorts of situations, such as when which<br>

trials are included depends upon participant behavior (e.g.<br>

correctly-performed trials). </blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

There isn't a universally appropriate strategy to regain balance, but<br>

either the chunks or the examples will need to be changed.<br>

<br>

For example, in one dataset we wanted to do leave-one-run-out<br>

cross-validation, but the imbalance was too great (e.g. some runs with<br>

very few examples), so we combined runs, for leave-three-runs-out<br>

cross-validation. We combined temporally adjacent runs (e.g. 1-3, 4-6,<br>

7-9) to make sure we didn't somehow inflate the accuracy.<br></blockquote><div><br></div><div>Could you explain the reasoning here about combining temporally adjacent runs to make sure to not inflate accuracy scores.  First, I'm assuming that what you're claiming is that adjacent runs would likely be more similar to one another than farther apart runs -- so, like you mention, leaving out adjacent runs might give you lower accuracy scores for those left out runs, but why is something like this more desirable than, say, just leaving out a random split of 1/3 of the data?  Unless you had some kind of reason to look at the temporal nature of classification with previous and/or subsequent runs, it's not clear to me why this is needed.  Maybe I'm a bit too unsure as to what the pipeline was in doing this, but I don't fully understand the reasoning.  Also, it seems that leaving out runs 1-3 or 7-9 (with 9 being the last run) could fulfill that assumption nicely, but if 4-6 were used as the left out split, then this seems less likely to fit that assumption -- run 4 would be just as similar (i.e. temporally close) to run 3 as to run 5, and run 6 would be just as similar to run 5 as to run 7; likewise, if you used any set of 3 adjacent runs that weren't the first 3 runs or the last three runs, then you'd have the same issue for every triple, such that 2 out of 3 of the triple would be just as similar to non-left-out runs as to the 3 runs in that left-out triple anyways.  Whereas the first 3 runs would have only 1 run, run 3, as temporally close to non-left-out runs; likewise for run 7 in the last triple of runs.  Why not just use a random split of 1/3 of the data or do split-halves (or something similar) for hold data?</div>


</div></div><div class="gmail_extra"><div><br>---<br>Jason Gors<br>Dartmouth College<br>Dept. of Psychological and Brain Sciences<br>6207 Moore Hall<br>Hanover, NH 03755<br>Phone: <a href="tel:%28603%29%20646-9689" value="+16036469689" target="_blank">(603) 646-9689</a>    Fax: <a href="tel:%28603%29%20646-1419" value="+16036461419" target="_blank">(603) 646-1419</a></div>


</div></div>