[Neurodebian-upstream] [Nipy-devel] Standard dataset

Wed Sep 22 14:15:26 UTC 2010

On Tue, Sep 21, 2010 at 06:11:09PM -0700, Matthew Brett wrote:
> Right - but I wanted to try and separate the machinery for the testing
> (code) from providing the data.  In the particular example we're
> thinking of, DICOM, I was thinking of providing a dcm2nii or SPM (or
> whichever was most successful) nifti version with each set of DICOM
> files, so I could test that we got close to that, or differed in a
> sensible way.   But that test itself would be in nibabel machinery.

I got that and it makes sense. Moreover, it is perfectly compatible with
what we aim at. I guess the difference is the perspective. You view the
problem from the "upstream" point, i.e. what can I do to trust/test my
software in a sensible way (no unnecessary duplication of logic and data).
We are aiming at the system-level, were we have a lot of "untrusted"
code that works in unforeseen ways when combined into a single
environment. Our major focus lies on extensive testing of the whole
thing not only pieces -- a data sharing/deployment framework is part of
that problem.

That being said, I'm confident that whatever data deployment solution is
appropriate for NiPy, will also have significant benefits for us -- even
if we might want to have alternative "backends" for deployment that
allow for using all infrastructure that we have in Debian.

I guess we should fade the testing aspect out of this
discussion and focus on the data.

> > Getting it shouldn't be a problem -- although it would need to have ways
> > for robust distribution. The whole neuroimaging world downloading TB
> > datasets from a single machine probably doesn't do well.
> 
> That's true, but maybe that's a second level problem once we have
> worked out how it should be correctly done.  Maybe I'm just saying
> that because I don't want to think about it ;)

I understand ;-) but it would be good to keep alternatives for a
"transport layer" in mind and not exclusively focus on system that are
permanently connected to the web and get everything via HTTP GET.

> The way we were trying to solve with with the nipy data packages was
> just to have a predictable algorithm for how to find the data packages
> - specifically - they'd by default be in certain places, and if not in
> those places, they can be in places pointed to by a environment
> variable or user configuration file.  Once you have such an algorithm,
> the code to implement it is really easy in whatever language you're
> running it.

Right, we might want to aim for an abstraction layer of such data
package registration system. Something like:

$ whereisdata colin27
/usr/share/data/mni-colin27

$ showdatacontent colin27
colin27_t1_tal_lin_headmask.nii.gz
colin27_t1_tal_lin_mask.nii.gz
colin27_t1_tal_lin.nii.gz

plus some meta data magic like

$ showdatacontent colin27 T1
colin27_t1_tal_lin.nii.gz

$ showmealldata T1
/usr/share/data/mni-colin27/colin27_t1_tal_lin.nii.gz
/usr/share/data/spm8/canonical/avg305T1.nii

(or similar stuff in Python, Perl, C, Haskell,....)

That would conveniently split the underlying system into:

1. A facility to get the data onto a system

   Several readily usable solution are available.

2. A system to register the data (once available on a system)

   I bet there is something that already provides this facility. It
   would need to be something with a system-level config/content that is
   extensible by user-provided data (just like any reasonable
   application allows for customizations), and simply knows about
   datasets and their content, where each piece (whole dataset and
   individual content item) is tagged with whatever meta data.

This is not saying that there couldn't be something that sits above and
obtains necessary data when it is requested but not yet present. But
having these to mechanisms clearly separated in the design would allow
for more efficient alternative implementations/data sources whenever
there are available -- think: torrents, system package manager, local
repositories, ...

> Maybe we should start a conversation with the okfn people (datapkg)
> for their thoughts?

+1

Michael

-- 
GPG key:  1024D/3144BE0F Michael Hanke
http://mih.voxindeserto.de