[pymvpa] RFE & Permutation

Tue Feb 2 19:40:52 UTC 2010

Thanks for sharing this information!

Yaroslav Halchenko wrote:
> btw -- few hints.
>
> if you have some assumptions (e.g. indeed you have independent samples
> in testing etc) about chance distribution, then you could revert to
> parametric testing... e.g., if I think that it should be close to
> binomial distribution, then with sufficient number of trials (like in
> your case) it is
> relatively well approximated with normal.

independent samples -- like if you train/predict for different subjects?
Otherwise its hard to argue what defines independent samples, right?
(might be difficult even with different subjects in the case of same
fMRI machine, but lets neglect this for a moment :))

or would you say that loooong null events btw chunks are sufficient to
use binomial/normal distribution?

Then, instead of using
> default Nonparametric distribution estimator in MCNullDist you can use
> smth like
>
> null_dist = MCNullDist(scipy.stats.norm, permutations=100, tail='left')
>
> That would fit normal distribution to the data from 100 permutations and
> assess p-value from it.
>
> NB Normal approximates binomial quite well for a reasonable number of
> trials.  Above function though doesn't do
> http://en.wikipedia.org/wiki/Continuity_correction yet, but that is
> negligible under reasonable sample size
>
> More over, lets say I know that by chance mean performance should be
> 0.5, then I can help it out by fixating it at that mean
> (unfortunately for that you would need to use maint/0.4
> or yoh/0.4 or yoh/master with the fix I've submitted yesterday):
>
> null_dist = MCNullDist(rv_semifrozen(scipy.stats.norm, loc=0.5),
>                        permutations=100, tail='left')

Thanks a lot. I am currently behind a stupid firewall, but I'll try it
as soon as I am at home.

> Advantage of those parametric tests is exactly for your case -- very low
> p-values, where you simply don't have enough power from doing
> non-parametric testing (e.g. to get any p-value as low as 10^(-x) you
> would need to do 10^x permutations), e.g. in your case you simply can't
> get precision higher than 0.001 since you are doing 1000 permutations.

OK, very important point!

> On the other hand, parametric testing approximates non-parametric
> results even when tested value (error) lies in the heavy part of the
> distribution.
>
> I hope this is of some value ;)

yes, definitely :)

Thanks,
 Matthias

>
> On Wed, 27 Jan 2010, Yaroslav Halchenko wrote:
>
>> could you also enable storing all estimates from MC... i.e.
>
>>         cv = CrossValidatedTransferError(
>>             TransferError(clf),
>>             splitter,
>>             null_dist=MCNullDist(permutations=no_permutations,
>>                                  tail='left',
>>                                  enable_states=['dist_samples']),
>>             enable_states=['confusion'])
>
>> weird enough -- either I do not thing straight or smth is strange --
chance
>> distribution after permutation on our testdata is indeed quite biased
into high
>> values (which are errors), although I would expect it to mean at chance
>> (i.e.  0.5 since I did binary classification).
>

Yaroslav Halchenko wrote:
> btw -- few hints.
> 
> if you have some assumptions (e.g. indeed you have independent samples
> in testing etc) about chance distribution, then you could revert to
> parametric testing... e.g., if I think that it should be close to
> binomial distribution, then with sufficient number of trials (like in
> your case) it is
> relatively well approximated with normal.  Then, instead of using
> default Nonparametric distribution estimator in MCNullDist you can use
> smth like
> 
> null_dist = MCNullDist(scipy.stats.norm, permutations=100, tail='left')
> 
> That would fit normal distribution to the data from 100 permutations and
> assess p-value from it.
> 
> NB Normal approximates binomial quite well for a reasonable number of
> trials.  Above function though doesn't do
> http://en.wikipedia.org/wiki/Continuity_correction yet, but that is
> negligible under reasonable sample size
> 
> More over, lets say I know that by chance mean performance should be
> 0.5, then I can help it out by fixating it at that mean
> (unfortunately for that you would need to use maint/0.4
> or yoh/0.4 or yoh/master with the fix I've submitted yesterday):
> 
> null_dist = MCNullDist(rv_semifrozen(scipy.stats.norm, loc=0.5), 
>                        permutations=100, tail='left')
> 
> Advantage of those parametric tests is exactly for your case -- very low
> p-values, where you simply don't have enough power from doing
> non-parametric testing (e.g. to get any p-value as low as 10^(-x) you
> would need to do 10^x permutations), e.g. in your case you simply can't
> get precision higher than 0.001 since you are doing 1000 permutations.
> 
> On the other hand, parametric testing approximates non-parametric
> results even when tested value (error) lies in the heavy part of the
> distribution.
> 
> I hope this is of some value ;)
> 
> On Wed, 27 Jan 2010, Yaroslav Halchenko wrote:
> 
>> could you also enable storing all estimates from MC... i.e.
> 
>>         cv = CrossValidatedTransferError(
>>             TransferError(clf),
>>             splitter,
>>             null_dist=MCNullDist(permutations=no_permutations,
>>                                  tail='left',
>>                                  enable_states=['dist_samples']),
>>             enable_states=['confusion'])
> 
>> weird enough -- either I do not thing straight or smth is strange -- chance
>> distribution after permutation on our testdata is indeed quite biased into high
>> values (which are errors), although I would expect it to mean at chance
>> (i.e.  0.5 since I did binary classification).
>