[pymvpa] Dataset with multidimensional feature vector per voxel

Yaroslav Halchenko debian at onerussian.com
Tue Nov 17 13:36:20 UTC 2015


On Tue, 17 Nov 2015, Ulrike Kuhl wrote:

> Dear Yaroslav, dear all,

> thanks for your explanations - generating the dataset is working out. :-) 

> I have been playing around with classification for a while now. Unfortunately, there is one really basic thing I don't get when I look at the result: the 'accuracy' is perfect on perfect data (that does not contain any noise). That means that the output accuracy map has a 1 at those voxels within the searchlight space that contain the signal difference between two groups - nice! 
> All other voxels in the original data are identical for both groups. Classification accuracy at these voxels is 0, but shouldn't accuracy be at the 0.5 level, showing 'random' classification performance?

First of all: on behalf of the whole community thank you for "testing"
your analysis scripts to validate their correct operation on fake data
before going full steam ahead and applying them to empirical data ! ;-)

And indeed, in case of two-class classification it should be around 0.5
(if classes are balanced etc)

> If I use other error functions within the cross-validation like the 'mean_mismatch_error' the situation is completely reversed - error at the signal encoding voxels is 0 and 1 everywhere else. Again I would have expected 0.5 representing random classification error.

> Am I just confusing how accuracy and error are computed or is there still something wrong with my code? Likewise, if I add random noise to the toy data classification fails completely (all voxels at 0 for accuracy or at 1 for error).

something is funky indeed... most probably in the data itself

could you provide output of 

print DS_clean.summary()
print res_clean.summary()

?

> # Here's the code for classification using the 'setup_ds' function we talked about so far:

> [...]

> DS_clean = setup_ds ( sub_list, param_list, data_path )

> clf_clean = LinearCSVMC()
> cvte_clean = CrossValidation(clf_clean, NFoldPartitioner(attr='subject'),errorfx=lambda p, t: np.mean(p == t),enable_ca=['stats'],postproc=mean_sample())
> # alternatively for getting computing an error: cvte_clean = CrossValidation(clf_clean, NFoldPartitioner(attr='subject'),errorfx=mean_mismatch_error,enable_ca=['stats'],postproc=mean_sample())

> sl  = Searchlight(
>    cvte_clean,
>    IndexQueryEngine(                                 # so we look for neighbors in both space and across modality_index
>        voxel_indices=Sphere(3),                      # that is pretty much what sphere_searchlight(radius=3) does
>        modality_index=Sphere(len(param_list))),      # cover all parameter values we have
>    postproc=mean_sample(),
>    enable_ca=['stats'],
>    roi_ids=np.where(DS_clean.fa.modality_index == 0)[0]  # restrict to search for "modality_index" neighbors when modality_index==0
> )

> res_clean = sl(DS_clean)

> # get accuracy values per voxel
> sphere_accuracy = res_clean.samples
> data_path='/path/to/data'
> map2nifti(DS_clean, sphere_accuracy).to_filename(data_path)



> Thanks again for your help!
> Ulrike



> ----- Original Message -----
> From: "Yaroslav Halchenko" <debian at onerussian.com>
> To: "pkg-exppsy-pymvpa" <pkg-exppsy-pymvpa at lists.alioth.debian.org>
> Sent: Monday, 9 November, 2015 22:26:45
> Subject: Re: [pymvpa] Dataset with multidimensional feature vector per voxel

> Few fixs up lister below inline in code 

> On Mon, 09 Nov 2015, Ulrike Kuhl wrote:

> > Hello again!

> > I'm still stuck at my analysis using multiple diffusion-derived-parameter-values for the same voxel.
> > Thanks to your help, I think I've got the dataset together, but I run into an error when attempting the final searchlight approach.


> > This is what I've got by now for setting up the dataset (defined as a function):


> > def ds_setup ( sub_list, param_list, data_path ):
> >     "Setting up the dataset, loading data etc."

> >     # make space:
> >     dss = []
> >     modality_all = []
> >     modality_idx_all = []
> >     voxel_idx_all = []
> >     subject_all = []
> >     targets_all = []
> >     chunks_all = []

> >     for sub_index, sub in enumerate(sub_list):
> >         if sub.startswith('sub1'):
> >             learn = 1
> >         elif sub.startswith('sub0'):
> >             learn = 0
> >         else:
> >             raise ValueError("Do not know what to do with %s" % sub)

> >         # dss_subj will contain datasets for different features for the same subject
> >         ds_param = []
> >         # idea: collect the feature attributes and glue it all together in the end
> >         modality_param = []
> >         modality_idx_param = []
> >         subject_param = []
> >         targets_param = []
> >         chunks_param = []
> >         voxel_idx_param = []

> >         for suf_index, suf in enumerate(param_list):
> >             ds = fmri_dataset(data_path + '/%s/%s_%s.nii.gz' % (sub,sub,suf))

> wasn't there some kind of a mask?

> >             ds.fa['modality'] = [suf]             # for each feature, set the modality attribute
> >             ds.fa['modality_index'] = [suf_index] # this as numeric might come handy for searchlights later
> >             ds.fa['targets'] = [learn]
> >             ds.fa['chunks'] = [sub_index]

> those targets and chunks should go to '.sa' (samples attributes)
> not '.fa' feature attributes...   you don't need all those manually
> constructed _param lists -- everything should be simply contained in the
> dataset(s).  It is somewhat important since then necessary checks would
> be done during hstack/vstacking those datasets and if some fa or sa
> values don't match (i.e you don't have exactly aligning features or
> samples) -- those fa/sa would be dropped... is that what you have
> observed may be?


> >             modality_param.extend(ds.fa['modality'].value)
> >             modality_idx_param.extend(ds.fa['modality_index'].value)
> >             subject_param.extend(ds.fa['subject'].value)
> >             targets_param.extend(ds.fa['targets'].value)
> >             chunks_param.extend(ds.fa['chunks'].value)
> >             voxel_idx_param.extend(ds.fa['voxel_indices'].value)
> >             ds_param.append(ds)

> >         ds_subj = hstack(ds_param)
> >         # collect future feature attributes:
> >         modality_all.append(modality_param)
> >         modality_idx_all.append(modality_idx_param)
> >         voxel_idx_all.append(voxel_idx_param)

> >         # collect future sample attributes
> >         targets_all.append(learn)
> >         chunks_all.append(sub_index)
> >         subject_all.append(sub)

> >         dss.append(ds_subj)

> >     dsall = vstack(dss)

> >     # create the actual dataset
> >     DS = Dataset(dsall)
> >     DS.fa['modality'] = modality_all[0]
> >     DS.fa['modality_idx'] = modality_idx_all[0]
> >     DS.fa['voxel_indices'] = voxel_idx_all[0]
> >     DS.sa['targets'] = targets_all
> >     DS.sa['chunks'] = chunks_all
> >     DS.sa['subject'] = subject_all

> >     return DS;



> > This gives me the following: DS is a dataset with dimensions 'no_subjects' * ('no_voxels' * 'no_parameters'). Importantly, each feature has the corresponding voxel indices associated with it - since the different parameters are just concatenated, the sequence of these indice-arrays repeats for each parameter. 

> > I've simulated a little set of perfect two group toy data (= 100% SNR) and fed the corresponding dataset into an SVM leave-one-out-cross-validation like this:

> > clf = LinearCSVMC()
> > cvte = CrossValidation(clf, NFoldPartitioner(),

> you could say explicitly here what you would like to cross-validate
> over -- makes code easier to read (and then you don't need to define
> overgeneric 'chunks'):

> NFoldPartitioner(attr='subject')

> > ...                        errorfx=lambda p, t: np.mean(p == t),
> > ...                        enable_ca=['stats'])
> > cv_results = cvte(DS)

> > The result is indeed perfect - classification performance is 100%. :-) So far, so good. 



> > However, when I want to use this dataset for a searchlight approach, I run into an error. Here's the code:

> > sl = sphere_searchlight(cvte, radius=3, postproc=mean_sample())
> > res = sl(DS)
> > print res_clean

> > ValueError: IndexQueryEngine has insufficient information about the dataset spaces. It is required to specify an ROI generator for each feature space in the dataset (got: ['voxel_indices'], #describable: 125000, #actual features: 375000).

> > So the code picks up on the fact that there are multiple instances of the same voxel-index-array within one sample (three parameters = three times the same index-array). I wrongly expected that it would take all values with the corresponding indices into account for MVPA. 

> > How do I "glue" these multiple values per sample for the same voxel index-array together such that the searchlight algorithm works? Am I on the right track at all or did I get carried away?

> that is why I defined that 'modality_index' ... Let me explain what is
> happening briefly in hope that it makes sense ;-) -- our "search for the
> neighborhood" by default relies only on .fa.voxel_indices and has an
> assumption that those are somewhat unique.  Now that you have 3
> different values per each physical voxel it gets confused.  So we just
> need to tell it more about what is the neighborhood really is and not
> rely on sphere_searchlight helper

> sl  = Searchlight(
>    cvte, 
>    IndexQueryEngine(
>        voxel_indices=Sphere(3),    # that is pretty much what sphere_searchlight(radius=3) does
>        modality_index=Sphere(10)), # 10 is just big enough to cover all 3 values you have
>    postproc=mean_sample())

> if you run this sl(DS) it will work BUT it would pretty much do the same
> analysis 3 times.  So we could restrict it to search for
> "modality_index" neighbors only when we are looking at
> modality_index==0.

> sl  = Searchlight(
>    cvte, 
>    IndexQueryEngine(   # so we look for neighbors in both space and across modality_index
>        voxel_indices=Sphere(3),    # that is pretty much what sphere_searchlight(radius=3) does
>        modality_index=Sphere(10)), # 10 is just big enough to cover all 3 values you have
>    postproc=mean_sample(),
>    roi_ids=np.where(DS.fa.modality_index == 0)[0]
> )

> This should produce results of necessary "1 brain" dimensionality ;)

> I know -- this all is a bit of unnatural.  Primarily such functionality
> was created to carry out "hybrid" spatial-temporal searchlighting where
> you could explore  both space and time.  But it seems that such use-case
> becomes popular enough that we better implement it without requiring so
> much of code/magic ;)

> Let me know how it goes
-- 
Yaroslav O. Halchenko
Center for Open Neuroscience     http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik        



More information about the Pkg-ExpPsy-PyMVPA mailing list