[Neurodebian-upstream] [Nipy-devel] Standard dataset

Wed Sep 22 01:11:09 UTC 2010

Hi,

> On Tue, Sep 21, 2010 at 01:54:09PM -0700, Matthew Brett wrote:
>> It seems to me that 3) - the tests - have to be configured by the
>> individual software packages.
>
> If they are package-specific unit tests (or regression tests). But we
> also want to have comparative tests that test multiple implementations
> regarding similar or identical output -- think: 15 DICOM->NIfTI
> conversion implementations should ideally all be identical, but right
> now are often not even in the same ballpark. That is a meta-test that
> lives outside of a single 'upstream' package.

Right - but I wanted to try and separate the machinery for the testing
(code) from providing the data.  In the particular example we're
thinking of, DICOM, I was thinking of providing a dcm2nii or SPM (or
whichever was most successful) nifti version with each set of DICOM
files, so I could test that we got close to that, or differed in a
sensible way.   But that test itself would be in nibabel machinery.

>> a) The user has the name and version of a data package they want from
>> the internet.  They can install a data package matching that name and
>> version
>
> Getting it shouldn't be a problem -- although it would need to have ways
> for robust distribution. The whole neuroimaging world downloading TB
> datasets from a single machine probably doesn't do well.

That's true, but maybe that's a second level problem once we have
worked out how it should be correctly done.  Maybe I'm just saying
that because I don't want to think about it ;)

>> b) If the package is installed, there is an algorithm for the system
>> to find what package is installed, and the version of that package.
>
> Big problem on a platforms without a 'package manager'. In general,
> available once you limit the scope to some form of environment, e.g.
> Python, MacPorts, Cygwin, ... But for a global solution this is a major
> problem.

The way we were trying to solve with with the nipy data packages was
just to have a predictable algorithm for how to find the data packages
- specifically - they'd by default be in certain places, and if not in
those places, they can be in places pointed to by a environment
variable or user configuration file.  Once you have such an algorithm,
the code to implement it is really easy in whatever language you're
running it.

The versions we just stored in an .ini file with the packages
themselves.   I guess, I hadn't yet considered what to do about
multiple versions of the same package.  We could also have a directory
scheme for that I suppose.

>> c) The user can install the package into any named location, and tell
>> the system where to look for the data package
>> d) The user can install the data package as root so that the algorithm
>> in b) can find the package
>> e) ditto as non-root
>
> I take those as: should be a non-chaotic system. Or did they aim at
> something specific.

Non-chaos I guess is the aim...

>> f) A user can create their own data package locally
>
> This is a must. Also related to the extensibility of the system.
>
>> g) The user can install their local data package in the same way as a
>> remote package
>
> Not sure if it has to be exactly identical -- think: APT vs dpkg. But
> that way those 'packages' are 'registered in the system should make no
> difference between local and remote.
>
>> h) The user can allow the system to find the local package without
>> installation (develop mode)
>
> I don't see how this would work -- or I have a different concept of an
> 'installation'. If you set the PYTHONPATH to a python module source
> tree, you effectively install it (just no copying into system paths).
> How could any system know about the presence of a package without
> installation?

I should rephrase as 'it should be possible for the user to point the
data package system at a local and developing set of files, as a
package, without having to copy the files to another location'.

>> i) The user can upload their package somewhere such that another user
>> can find the package as in a)
>
> This is a must. Although not anybody may be allowed to upload to any
> location (obviously) -- but there has to be a common distribution
> format/channel.
>
>> So, actually, our original draft was meant to try and deal with at
>> least some of these problems in an language neutral way.  By language
>> neutral, I mean, you might need python installed to install the
>> package, but you can use the package from any language.
>
> Hmm, so you are aiming at some form of package manager for python?
> That would have to be written? How could you implement the link between
> data versions and software versions? For example: AFNI as of yesterday,
> needs a dataset with NIfTI files that have a new magic header for its
> regression tests (hypothetical usecase) -- previous versions would be
> fine. That would only work (IMHO) if the regression test is part of AFNI
> and AFNI is aware of the data package manager and knows how to make it
> get the right data? And it would also be AFNI's duty to do that in a
> platform-appropriate way.

Well, we left the link between data versions and code versions as
stuff in the 'info.py' file for the particular distribution.  For
example, in:

http://github.com/nipy/nipy/blob/master/nipy/info.py

you see:

DATA_PKGS = {'nipy-data': {'version':'0.2'},
             'nipy-templates': {'version':'0.2'}}

That is, once it algorithmically easy to check whether you have a
package, it's relatively easy to check at install time, whether you
are using python or some other language.  I mean, we could have a
reference implementation in python, and other implementations.
Maybe.

Maybe we should start a conversation with the okfn people (datapkg)
for their thoughts?

See you,

Matthew