[Neurodebian-upstream] [Nipy-devel] Standard dataset

Tue Sep 21 19:26:52 UTC 2010

On Tue, Sep 21, 2010 at 11:23:07AM -0700, Matthew Brett wrote:
> I guess, what I'm asking, is, are you Michael or anyone else out there
> interested in trying to work out a generic data packaging first-pass
> draft thing?

Yes. Well, ... (being worried about generic) ... probably ...

Let me start small by giving some background and what we want to
achieve:

1. Need a way to share data between different software packages.
2. Need more standard data to be able to write and run more complex tests.
3. Need tests to verify that individual software and whole
   heterogeneous analysis pipelines on a particular system are working.
4. No preconditions on programming language, etc.

Let me describe a tentative solution that would work in Debian (not
claiming that it is optimal!). Maybe we can somehow link that into a
generic concept -- although I see many problems (see end notes).

- We use a 'standard location' for system data: right now
  '/usr/share/data'.

- We create 'data packages' that are modular (generally not containing any stuff
  is only useful for a specific software).

- We create 'test packages' that specify (versioned) dependencies on data and
  software. A test implementation can be anything that can be executed
  and has a meaningful return value. This can be a wrapper for a proper
  testing framework (e.g. nose) or a simple script like FSL's feeds
  test.

- We will provide a common tool that can run all (or a subset of all)
  tests on a system, collects logs and reports results.

Regarding the way data is packaged we only have the following
requirements:

1. Software needs to know where its own data is [easy]
2. Other software needs to know where other software's data is (or at
   least users need to know where all data is) [may need social
   engineering, or distribution patches, but doable]
3. Tests need to be written to be able to find all required input and
   output data [easy, once there is an idea where to put data --
   virtually no tests need to be changes, since there are hardly any
   (by definition this rant is excluding all readers of the email)]

When integrating such facility into Debian we have a relatively easy
job. We have a uniform way to package, a uniform way to assure system
states regarding (versioned) package dependencies, we can have a
standard system-wide location for all data. Moreover, we will soon have
a big machine that can distribute various data packages world-wide
without having to come up with a different management technology (and a
large mirror network).

That looks relatively straightforward to me. BUT if you want a generic
solution that cannot rely on any of the facilities that we have in
Debian, one would have to replace them with equivalents in other
environments or with a global solution to everything (OMG!).

Probably people would also want to have support for data installed in
places where users have write permissions. That is possible, but one
would loose the ability to easily ensure proper versioned dependencies
between data and system-level packages (at least in environments where
this was possible before).

To be honest, I cannot easily see a solution that works with all software
in any language and all data in all environments. I can see how to do it
just for Python, and I can see how to do it for everything in Debian,
though.

--- Quoting: http://knowledgeforge.net/ckan/doc/datapkg/design.html ---
One possibility is to just treat data packages as a software package and
reuse existing packaging systems such as:

* apt (debian/ubuntu)
* distutils/easy_install/pypi (python)

While one would definitely want to reuse such existing infrastructure as
far as possible are there any modifications/additions one need to make
to such system?

It would be unfortunate if a data package system were directly linked to
a particular language or system. Better if specified by a standard that
can be implemented inside any system.  Metadata specs for some of these
systems are a) software oriented b) not obviously available (e.g. apt).
We have therefore chosen to build on top of the python distutils
approach.
-------------------------------------------------------------------------

Looks like they also considered APT. I guess the major concern was that
APT doesn't run on all platforms (because I fail to see what relevant
meta data could not be implemented).

Just a dump of my initial thoughts on this topic. I'd love to get this
sorted out.

Michael

-- 
GPG key:  1024D/3144BE0F Michael Hanke
http://mih.voxindeserto.de