Thanks,<br><br>CrossValidation(clf,<br>

                Balancer(amount=0.8, limit=None, attr=&#39;targets&#39;, count=3),<br>

                splitter=Splitter(&#39;balanced_set&#39;, [True, False]))<br><br>This seems like a good idea, but I get this error message:<br><br>TypeError: Unexpected keyword argument splitter=&lt;Splitter&gt; for &lt;CrossValidation&gt;. Valid parameters are [&#39;datasets&#39;, &#39;training_stats&#39;, &#39;raw_results&#39;, &#39;calling_time&#39;, &#39;training_time&#39;, &#39;null_t&#39;, &#39;null_prob&#39;, &#39;stats&#39;, &#39;repetition_results&#39;]<br>


<br>I think I understand the role of the &#39;chunks&#39; attribute, and I see how I should use it. I guess my samples are not all independent...<br><br>Regards<br>Brice<br><br><br><div class="gmail_quote">On Fri, Apr 8, 2011 at 3:29 AM, Yaroslav Halchenko <span dir="ltr">&lt;<a href="mailto:debian@onerussian.com">debian@onerussian.com</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="im">&gt;    This should run 5 evaluation, using 1/5 of the available data each time<br>


&gt;    to test the classifier. Correct?<br>

<br>

</div>correct in that it should generate for you 5 partitions, where in first<br>

one you would obtain nsamples/5 first samples (and corresponding<br>

&quot;chunks&quot; unique per each sample in your case)<br>

<div class="im"><br>

&gt;    Now, for this to work properly, it requires that targets are properly<br>

&gt;    randomly distributed in the dataset...<br>

<br>

</div>well... theoretically speaking, if you  have lots of samples, you might<br>

escape by doing classical leave-1-out cross-validation.  That would be<br>

implemented by using NFoldPartitioner on your dataset (ie without<br>

NGroupPartitioner).  But it would take a while to do such a<br>

cross-validation -- might be not desired unless coded explicitly for it<br>

(e.g. for SVMs either using CachedKernel to avoid recomputation of<br>

kernels, or even more trickery...)<br>

<div class="im"><br>

&gt; for instance if the last 1/5 of<br>

&gt;    the samples only contain target 2, then it won&#39;t work...<br>

<br>

</div>yeap -- that is the catch ;)<br>

<br>

you could use NFoldPartitioner(cvtype=2) which would combine all<br>

possible combinations of 2 chunks with a consecutive Sifter (recently<br>

introduced) to get only those partitions which carry labels from both<br>

classes, but, once again, it would be A LOT to cross-validate (i.e.<br>

roughly (nsamples/2)^2), so I guess not solution for you either<br>

<div class="im"><br>

&gt; What do you suggest to solve this problem?<br>

<br>

</div>If you have some certainty that samples are independent, then to get<br>

reasonable generalization estimate, just assign np.arange(nsamples/2)<br>

(assuming balanced initially classes) as chunks to samples per each<br>

condition.  Then in each chunk, you would guarantee to have a pair of<br>

conditions ;)  And then you are welcome to use NGroupPartitioner to<br>

bring number of partitions to some more cost-effective number , e.g. 10.<br>

<div class="im"><br>

&gt; I have tried to use a ChainNode,<br>

&gt;    chaining the NGroupPartitioner and a Balancer but it didn&#39;t work,<br>

<br>

</div>if I see it right -- it should have worked, unless you had really<br>

degenerate case, e.g. one of partitions contained samples only of 1<br>

category.<br>

<div class="im"><br>

&gt;    apparently due to a bug in Balancer (see another mail on that one).<br>

<br>

</div>oops -- need to check emails then...<br>

<div class="im"><br>

&gt;    My main question though is: it seems weird to add chunks attribute like<br>

&gt;    this. Is it the correct way?<br>

<br>

</div>well... if you consider your samples independent from each other, then<br>

yes -- it is reasonable to assign each sample into a separate chunk.<br>

<div class="im"><br>

<br>

&gt;    Btw, is there a way to pick at random 80% of the data (with equal<br>

&gt;    number of samples for each target) for training and the remaining 20%<br>

&gt;    for testing, and repeat this as many times as I want to obtain a<br>

&gt;    consistant result?<br>

<br>

</div>although I think we haven&#39;t tried it, but this should do:<br>

<br>

CrossValidation(clf,<br>

                Balancer(amount=0.8, limit=None, attr=&#39;targets&#39;, count=3),<br>

                splitter=Splitter(&#39;balanced_set&#39;, [True, False]))<br>

<br>

should do cross-validation by taking 3 of those (raise 3 to the number you like).<br>

<br>

what we do here -- we say to balance targets, take 80% and mark them True,<br>

while other 20% False.  Then we proceed to the cross-validation.  That thing<br>

uses an actual splitter which splits dataset into training/testing parts.<br>

Usually such splitter is not specified and constructed by CrossValidation<br>

assuming operation on partitions labeled as 0,1 (and possibly 2) usually<br>

provided by Partitioners.  But now we want to split based on balanced_set --<br>

and we can do that, and instruct it to take 80% True for training, and the rest<br>

(False) for testing.<br>

<br>

limit=None is there to say to not limit subsampling to any attribute (commonly<br>

chunks), so in this case you don&#39;t even need to have chunks at all.<br>

<br>

is that what you needed?<br>

<br>

--<br>

=------------------------------------------------------------------=<br>

Keep in touch                                     <a href="http://www.onerussian.com" target="_blank">www.onerussian.com</a><br>

Yaroslav Halchenko                 <a href="http://www.ohloh.net/accounts/yarikoptic" target="_blank">www.ohloh.net/accounts/yarikoptic</a><br>

<br>

_______________________________________________<br>

Pkg-ExpPsy-PyMVPA mailing list<br>

<a href="mailto:Pkg-ExpPsy-PyMVPA@lists.alioth.debian.org">Pkg-ExpPsy-PyMVPA@lists.alioth.debian.org</a><br>

<a href="http://lists.alioth.debian.org/mailman/listinfo/pkg-exppsy-pymvpa" target="_blank">http://lists.alioth.debian.org/mailman/listinfo/pkg-exppsy-pymvpa</a><br>

</blockquote></div><br>