Saturday, April 10, 2010

better than it is

An interesting isotope is detected in the CRU report fall-out plume. Apart from the very high concentrations of concern-troll, tone-troll, and pure drivel, there is something worth learning from.

For this reason, many software professionals encountering science software for the first time may be horrified. How, they ask, can we rely on this crude software, developed in primitive conditions - by amateurs, working with such poor tools and such poor understanding of the field? This is a common reaction to GISTEMP, and is exactly the reaction which many critics have had, some very publicly, to the software published with the CRU emails. Such critics do have a point. Science software should be better than it is. Scientists should be provided with more training, and more support. But consider the uses to which science software is put. Most
software written by scientists:

* consists of tiny programs;
* which will only ever be run a small number of times;
* over the course of a few weeks as it is being developed;
* by the scientist who wrote it;
* on data gathered by that scientist's team;
* concerning a scientific field in which that scientist is expert;
* to perform data processing on which that scientist is expert; and will be discarded, never to be used again, as soon as the paper containing the results is accepted for publication.


There are hardly any scientists today who don't do some programming of some sort; there's not much science that doesn't involve churning through really big data sets. As a result, there's a lot of it about. Which reminds me of this Eric Sink post from 2006, about the distinctions between "me-ware, us-ware, and them-ware". Me-ware is software that you write and only you use; us-ware is software that is used by the same organisation that produces it; them-ware is software that is produced by a software company or open-source project for the general public.

There's a gradient of difficulty; the further from you the end-user is, the less you know about their needs. On the other hand, if you're just trying to twiddle the chunks to fit through the ChunkCo Chunkstrainer without needing to buy a ChunkCo Hyperchunk, well, although you know just how big they are, you're unlikely to spend time building a pretty user interface or doing code reviews. Which only matters up to a point; nobody else would bother solving your problem.

But this can bite you on the arse, which is what happened to the climate researchers. It's fair to say that if you're processing a scientific data set, what actually matters is the data, or the mathematical operation you want to do to it. You won't get the paper into Nature because you hacked up a really elegant list comp or whatever; they won't refuse it because the code is ugly. Anyone who wants to replicate your results will probably roll their own.

This is OK, but the failure mode is when the political equivalent of Brian Coat comes snooping around your #comments or lack of them. Perhaps I should tidy up the Vfeed scripts while I'm at it.

2 comments:

Adam Sampson said...

The problem with the "roll their own" approach is when your software's sufficiently buggy that it's no longer a valid implementation of the algorithms or analysis processes you've described in your paper -- so your results can't be reproduced. Nobody checks this when reviewing a paper, of course.

I work for a research project that's building a complex systems simulation framework, and as part of that we sat down and reimplemented the algorithms from a number of existing peer-reviewed papers. The failure mode above has come up about one time in three so far...

gawp said...

Agreed, it doesn't have to be brilliant code; however as you say pretty much everyone has to program now and it should be possible to review and validate the code. If you're claiming some algorithm, you should be able to make it clear you have a reasonable implementation.

As such you should have to provide clear and literate (a-la Knuth) code, just like you have to provide derivations for equations in CS journals.

I believe the fact that you could have crapulent (and unpublished!) code at the core of your publication, even in Nature, is a relic of a previous age when people couldn't run or understand the code. Now pretty much everyone has access to sufficient computation (if only through something like Amazone EC2). So it's time to learn that your code is as a part of the presentation of your idea as the peer reviewed paper. Time to learn to code well.

The best implementation I've seen of this idea is this paper:
http://mbe.oxfordjournals.org/cgi/content/full/23/3/523
Where all the data is online and the analyses (in R) can be executed online too, here: http://pbil.univ-lyon1.fr/datasets/SemonLobryDuret2005/

Scientists should aspire to this level of professionalism and transparency.

It may take a while.

kostenloser Counter