Can we guess how he's going to do it? Yes, that's right. With a big database and a stunning ignorance of the rules of statistics.
Senior police officers within Ceop - the child exploitation and online protection agency - appealed yesterday for anyone who had been on holiday in Praia de la Luz in the two weeks to May 3, the day Madeleine disappeared, to send in photographs taken in the area of the Ocean Club complex, where the McCann family was staying. Jim Gamble, chief executive of Ceop, said police were looking for pictures with people in the background who were not connected to the photographer...Of course, we're faced with the problem of distinguishing national press crap from police crap here. But it's mental suicide to funk the data one has, Watson. Apparently, they want to collect a lot of photos from the area and run them past an image hash algorithm that compares them with a database of...well, it would seem to be their collection of Internet filth, right?
The technology, known as the Child Base, uses image recognition to analyse and compare pictures of online abuse and abusers in a fraction of the time it takes to do so manually. The system can scan and analyse 1,000 images per hour.
Officers believe whoever abducted Madeleine must have been watching children at the complex run by tour operator Mark Warner for some time. They hope by scanning holiday snaps they might be able to match up the perpetrator with their online library of paedophiles.
The flaw is obvious. The comparison sample will be full of a few people who took a lot of photos. Taking lots of paedophile photos and putting them on the Web is a fundamentally weird and strange thing to do - after all, it's as good as walking into a police station with your photos. So, unless you are Cesare Lombroso, this isn't going to be very informative. The chance of one person in their sample from the whole online world being on the spot is not that great. Given that the hash algorithm is fallible, this is basically an automated version of my mother's principle that "His eyes are too close together - he must be guilty!" After all, the people they have culled from whatever horrible images they have on file will be representative of at least some of one group - humans. People with Internet access tend to be quite self-similar - between 15 and 65 and white, sad to say. Like people who go to posh resorts in Portugal, really.
This gives us two failure modes - either the killer doesn't look remotely like anyone on file, and therefore the search is trivially defeated, or a lot of people in photographs look enough like the ones on file to drown them in false positives. Worse, it looks like they're trying to investigate a priori this way - look for anyone who looks weird and therefore must be a suspect.
Further, it goes without saying that the upload form is not SSL or START TLS encrypted.
Update: To put this a little more formally - given that Cesare Lombroso was wrong and not all criminals look the same, and that there is a large number of people who appear in the photos, there's a significant chance that someone in the large set of people looks like someone in the database because they are both sets of people. You have to add the false positive rate of the comparison to this, too. This actually gets worse the more photos the public send in. The chance that the killer (realistically) happens to look like someone in the database is likely to be tiny.
This illuminates the fundamental difference between starting with someone who is linked to the crime in some way and looking for them, and looking for someone the crime can be linked to in some way..