[Pkg-exppsy-pynifti] [Nipy-devel] Example data - a proposal

Mon Jul 13 18:44:55 UTC 2009

On Mon, Jul 13, 2009 at 10:42:56AM -0700, Christopher Burns wrote:
> On Sat, Jul 11, 2009 at 11:28 AM, Matthew Brett<matthew.brett at gmail.com> wrote:
> > As we're working on stuff, the problem of example data keeps coming
> > up.  We often need example data for

> > 1) tests
> > 2) examples

> I think we need to handle these two cases separately.
> 1) Data used for tests:  Set of small files, less than 100K, committed
> to the source repository.
> 2) Data used for examples:  normal data sets that can be run through
> an entire processing stream.

Yes, Matthew and I have been coming to the very exact same conclusion,
with the addition of a third set of files: templates and atlases.

> #1 was the original intention of the functional and anatomical files
> in:  <nipy>/testing/

> These should replaced with a matching set of sub-sampled images.
> Jonathan and I hacked those together in a hurry last year at a sprint.

I think these data files are good. I am not sure if there is any need to
replace them right now (remember: whatever we are doing to our bzr repo,
we will have to live with it for ever :O ).

> But it's important that the test suite be fast and lean, otherwise
> it's a burden to run and as a result gets run less often.

> Also, Debian packaging.  There's two problems with our current test
> data in regard to debian packaging.
> 1) We require an active network connection to download the data.  Not
> all of the test machines have active networks.
> 2) We store the data in $HOME.  Not all test machines have this.

Yes, both aspects are really problematic. Matthew and I have tried to
address both aspects over the week end. This has resulted in the
data-refactor branch:
https://code.launchpad.net/~nipy-developers/nipy/nipy-datarefactor
Mainly, the two ideas are that there re two site.cfg entries to
specify a location for templates and example data, and that the 
template tarball (which is the current nipy_data.tar.gz) can be given
at install time. It should make packaging possible. 

> If the tests ran on a couple small test files committed to the
> repository, these 2 problems would be solved.

In the data-refactor branch, we have changed the tests so that they no
longer require files not checked in the repo. You can now run nipy
without downloading data!

> Currently, nipy.test() is a memory hog and takes too long.

Yes, we (as a community) need to work on that. Writing good tests for
statistical algorithms is hard, but I have found that you learn a lot
about the numerical and mathematical properties of your algorithms when
trying to write robust and fast tests for them.

> And it takes a while to run our tests.  Below is the results of
> running the tests on our cluster:

> numpy.test()
> Ran 2027 tests in 4.739s
> OK (KNOWNFAIL=1, SKIP=2)

> nipy.test()
> Ran 1869 tests in 58.990s
> FAILED (SKIP=1, errors=13, failures=3)

That's not fair, though, nipy is much more similar to scipy than to
numpy:

scipy.test()
Ran 3569 tests in 58.387s
FAILED (KNOWNFAIL=2, SKIP=28, errors=29, failures=79)

> The examples can rely on a larger dataset, which can be packaged
> independently (option C), but the larger dataset would not be part of
> the test suite, and therefore is not required to run the tests.

Well, Matthew and I have been leaning towards the idea that most tests 
should not require data of any significant size (we can alway use 
surrogate data:
http://neuroimaging.scipy.org/site/doc/manual/html/neurospin/simul_activation.html
). I am also favoring big datasets used for examples (eg fractions of the 
FIAC database) not to be shipped with any nipy package, but downloaded 
as we run the examples. That's how we do it in Mayavi:
http://code.enthought.com/projects/mayavi/docs/development/html/mayavi/auto/example_mri.html
They can be stored in a cache (under Unix, it seems that /var/tmp would
be a good place), though.

> I'm happy to assist in moving to this sort of a split if folks decide
> this is the right way to go.

Do you want to review our branch? We are trying to provide the tools to
address these problems. We are not claiming that we are providing the
perfect answer, just a step toward something better, in order to
iteratively solve the problem in a satisfactory way.

Gaël