Thanks,<br><br>CrossValidation(clf,<br>
Balancer(amount=0.8, limit=None, attr='targets', count=3),<br>
splitter=Splitter('balanced_set', [True, False]))<br><br>This seems like a good idea, but I get this error message:<br><br>TypeError: Unexpected keyword argument splitter=<Splitter> for <CrossValidation>. Valid parameters are ['datasets', 'training_stats', 'raw_results', 'calling_time', 'training_time', 'null_t', 'null_prob', 'stats', 'repetition_results']<br>
<br>I think I understand the role of the 'chunks' attribute, and I see how I should use it. I guess my samples are not all independent...<br><br>Regards<br>Brice<br><br><br><div class="gmail_quote">On Fri, Apr 8, 2011 at 3:29 AM, Yaroslav Halchenko <span dir="ltr"><<a href="mailto:debian@onerussian.com">debian@onerussian.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="im">> This should run 5 evaluation, using 1/5 of the available data each time<br>
> to test the classifier. Correct?<br>
<br>
</div>correct in that it should generate for you 5 partitions, where in first<br>
one you would obtain nsamples/5 first samples (and corresponding<br>
"chunks" unique per each sample in your case)<br>
<div class="im"><br>
> Now, for this to work properly, it requires that targets are properly<br>
> randomly distributed in the dataset...<br>
<br>
</div>well... theoretically speaking, if you have lots of samples, you might<br>
escape by doing classical leave-1-out cross-validation. That would be<br>
implemented by using NFoldPartitioner on your dataset (ie without<br>
NGroupPartitioner). But it would take a while to do such a<br>
cross-validation -- might be not desired unless coded explicitly for it<br>
(e.g. for SVMs either using CachedKernel to avoid recomputation of<br>
kernels, or even more trickery...)<br>
<div class="im"><br>
> for instance if the last 1/5 of<br>
> the samples only contain target 2, then it won't work...<br>
<br>
</div>yeap -- that is the catch ;)<br>
<br>
you could use NFoldPartitioner(cvtype=2) which would combine all<br>
possible combinations of 2 chunks with a consecutive Sifter (recently<br>
introduced) to get only those partitions which carry labels from both<br>
classes, but, once again, it would be A LOT to cross-validate (i.e.<br>
roughly (nsamples/2)^2), so I guess not solution for you either<br>
<div class="im"><br>
> What do you suggest to solve this problem?<br>
<br>
</div>If you have some certainty that samples are independent, then to get<br>
reasonable generalization estimate, just assign np.arange(nsamples/2)<br>
(assuming balanced initially classes) as chunks to samples per each<br>
condition. Then in each chunk, you would guarantee to have a pair of<br>
conditions ;) And then you are welcome to use NGroupPartitioner to<br>
bring number of partitions to some more cost-effective number , e.g. 10.<br>
<div class="im"><br>
> I have tried to use a ChainNode,<br>
> chaining the NGroupPartitioner and a Balancer but it didn't work,<br>
<br>
</div>if I see it right -- it should have worked, unless you had really<br>
degenerate case, e.g. one of partitions contained samples only of 1<br>
category.<br>
<div class="im"><br>
> apparently due to a bug in Balancer (see another mail on that one).<br>
<br>
</div>oops -- need to check emails then...<br>
<div class="im"><br>
> My main question though is: it seems weird to add chunks attribute like<br>
> this. Is it the correct way?<br>
<br>
</div>well... if you consider your samples independent from each other, then<br>
yes -- it is reasonable to assign each sample into a separate chunk.<br>
<div class="im"><br>
<br>
> Btw, is there a way to pick at random 80% of the data (with equal<br>
> number of samples for each target) for training and the remaining 20%<br>
> for testing, and repeat this as many times as I want to obtain a<br>
> consistant result?<br>
<br>
</div>although I think we haven't tried it, but this should do:<br>
<br>
CrossValidation(clf,<br>
Balancer(amount=0.8, limit=None, attr='targets', count=3),<br>
splitter=Splitter('balanced_set', [True, False]))<br>
<br>
should do cross-validation by taking 3 of those (raise 3 to the number you like).<br>
<br>
what we do here -- we say to balance targets, take 80% and mark them True,<br>
while other 20% False. Then we proceed to the cross-validation. That thing<br>
uses an actual splitter which splits dataset into training/testing parts.<br>
Usually such splitter is not specified and constructed by CrossValidation<br>
assuming operation on partitions labeled as 0,1 (and possibly 2) usually<br>
provided by Partitioners. But now we want to split based on balanced_set --<br>
and we can do that, and instruct it to take 80% True for training, and the rest<br>
(False) for testing.<br>
<br>
limit=None is there to say to not limit subsampling to any attribute (commonly<br>
chunks), so in this case you don't even need to have chunks at all.<br>
<br>
is that what you needed?<br>
<br>
--<br>
=------------------------------------------------------------------=<br>
Keep in touch <a href="http://www.onerussian.com" target="_blank">www.onerussian.com</a><br>
Yaroslav Halchenko <a href="http://www.ohloh.net/accounts/yarikoptic" target="_blank">www.ohloh.net/accounts/yarikoptic</a><br>
<br>
_______________________________________________<br>
Pkg-ExpPsy-PyMVPA mailing list<br>
<a href="mailto:Pkg-ExpPsy-PyMVPA@lists.alioth.debian.org">Pkg-ExpPsy-PyMVPA@lists.alioth.debian.org</a><br>
<a href="http://lists.alioth.debian.org/mailman/listinfo/pkg-exppsy-pymvpa" target="_blank">http://lists.alioth.debian.org/mailman/listinfo/pkg-exppsy-pymvpa</a><br>
</blockquote></div><br>