[pymvpa] Alternatives to Cross-validation

Thu Oct 25 13:23:08 UTC 2012

On Thu, 25 Oct 2012, Jacob Itzhacki wrote:

>    "e.g. �some super-ordinate category (e.g. �animate-vs-inanimate) �you
>    would like to cross-validate not across functional runs BUT across
>    sub-ordinate stimuli categories (e.g. train on
>    humans/reptiles/shoes/scissors to discriminate animacy and
>    cross-validate into bugs/houses, then continue with another pair to take
>    out)."
>    BTW, this exactly what I would like to do but I still don't figure out how
>    to leave out the test trials from the training trials, so they don't get
>    classified into themselves.

ok then -- the point is to craft such an interesting partitioner.  And there
are actually 2 approaches to this.  Let's first look into

https://github.com/PyMVPA/PyMVPA/blob/HEAD/mvpa2/tests/test_usecases.py#L50

which I am citing here with some additional comments and omitting import
statement(s) -- it is a bit more cumbersome since in it we have 6 subordinate
categories and 3 superord (not 2 which would make explanation easier):

    # Let's simulate the beast -- 6 categories total groupped into 3
    # super-ordinate, and actually without any 'superordinate' effect
    # since subordinate categories independent

# in your case I hope you would have a true superordinate effect like in 
# example study I am referring to below

    ds = normal_feature_dataset(nlabels=6,
                                snr=100,   # pure signal! ;)
                                perlabel=30,
                                nfeatures=6,
                                nonbogus_features=range(6),
                                nchunks=5)
    ds.sa['subord'] = ds.sa.targets.copy()

# Here  I am creating a new 'superord' category as a remainder of division by 3
# of original 6 categories (in 'subord')

    ds.sa['superord'] = ['super%d' % (int(i[1])%3,)
                         for i in ds.targets]   # 3 superord categories
    # let's override original targets just to be sure that we aren't relying on them
    ds.targets[:] = 0

    npart = ChainNode([
    ## so we split based on superord

# So now this NFold partitioner would select 3 subord categories (possibly where we even
# have multiple samples from the same superord category)

        NFoldPartitioner(len(ds.sa['superord'].unique),
                         attr='subord'),
        ## so it should select only those splits where we took 1 from
        ## each of the superord categories leaving things in balance
        Sifter([('partitions', 2),
                ('superord',
                 { 'uvalues': ds.sa['superord'].unique,
                   'balanced': True})
                 ]),
                   ], space='partitions')

# And with that NFold + Sifter we achieve desired effect that we would get only
# those splits where into testing we place 3 different subord categories with 1
# of each superord

    # and then do your normal where clf is space='superord'
    clf = LinearCSVMC(space='superord')

    cvte_regular = CrossValidation(clf, NFoldPartitioner(),
                                   errorfx=lambda p,t: np.mean(p==t))

# below we use our NFold + Sifter partitioner instead of a simple NFold on chunks

    cvte_super = CrossValidation(clf, npart, errorfx=lambda p,t: np.mean(p==t))

# apply as usual ;)

    accs_regular = cvte_regular(ds)
    accs_super = cvte_super(ds)

If you are interested in how that would effect the results -- I would invite
you to look at my recent poster at SfN 2012:
http://haxbylab.dartmouth.edu/publications/HGG+12_sfn12_famfaces.png
2nd column, scatter plot "Why across identities?"

where on x-axis you have z-scores for CV stats across identities while
cross-validating searchlights on classification of personal familiarity to
faces across functional runs, while on y-axis -- across pairs of individuals.
Both results are in high agreement BUT in the "blue areas" -- early visual
cortex, where if we cross-validate across functional runs, classifier might
just learn identity information. Since identity of a face (subordinate
category) here has clear association with familiarity (superordinate), it would
provide significant classification results in those areas where there is strong
identity information on stimuli (in our case in early visual cortex since the
faces were actually different ;) ) but possibly no (strong) superord effects
(let's forget for now about possible attention/engagement etc effects).  By
cross-validating across identities (subord), we can easily get rid of those
subord-specific effects and capture the notion of the superord category
effects more clearly.

Alternative, even more stricter cross-validation scheme would involve
cross-validation across runs BUT also bootstrapping then additional folds for
each such a split with generating all those splits across identities.  For that
we have ExcludeTargetsCombinationsPartitioner docs for which are
http://www.pymvpa.org/generated/mvpa2.generators.partition.ExcludeTargetsCombinationsPartitioner.html?highlight=excludetargetscombinationspartitioner
and unittest
https://github.com/PyMVPA/PyMVPA/blob/HEAD/mvpa2/tests/test_generators.py#L266

This one was used in the original hyperalignment paper
(http://haxbylab.dartmouth.edu/publications/HGC+11.pdf) to do not fall into the
trap of run order effects...

I would be glad to see people reporting back comparing these 3 schemes (just
across runs, across subord, across runs+subord) of cross-validation on their
data with hierarchical categories design. Thanks in advance for sharing  -- it
would be great if we get a dialog going instead of my one-way blurbing... doh
-- sharing! ;)

Cheers,

>    On Wed, Oct 24, 2012 at 5:15 PM, Jacob Itzhacki <[1]jitzhacki at gmail.com>
>    wrote:

>      Please do!

>      and thank you for all the responses :D
>      Don't want to come across as lazy but I'm not a master coder at all so
>      sometimes figuring out what one line of code does can be quite the
>      ordeal, in my case.
>      J
>      On Wed, Oct 24, 2012 at 3:54 PM, Yaroslav Halchenko
>      <[2]debian at onerussian.com> wrote:

>        On Wed, 24 Oct 2012, MS Al-Rawi wrote:
>        > � �Cross-validation is fine even in this case, you'll just need to
>        rearrange
>        > � �your data in a way to leave-a-set-of-stimuli out, instead of
>        > � �leave-one-run-out. Perhaps PyMVPA has some functionality to do
>        this.�

>        now it is getting interesting -- I think you got close to what I
>        thought
>        the question was about: �to investigate the conceptual/true effect of
>        e.g. �some super-ordinate category (e.g. �animate-vs-inanimate) �you
>        would like to cross-validate not across functional runs BUT across
>        sub-ordinate stimuli categories (e.g. train on
>        humans/reptiles/shoes/scissors to discriminate animacy and
>        cross-validate into bugs/houses, then continue with another pair to
>        take
>        out). �And that is what I thought for a moment the question was
>        about ;)

>        This all can be (was) done with PyMVPA although would require 3-4
>        lines of code instead of 1 to accomplish ATM. �If anyone interested I
>        could provide an example ;)... ?
-- 
Yaroslav O. Halchenko
Postdoctoral Fellow,   Department of Psychological and Brain Sciences
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik