So I was trying to parse the London Diplomatic List (this month's edition yet to make an appearance). Cian suggested pulling out the fontspec tags on the grounds that they're often redundant and it might be possible to identify groups among them. So I did just that and then a little bit of data reduction.
25 tag declarations squash to 11 unique font/size/colour declarations. Mmm, compression. The bad news is that, for example, countries and ambassadors (or rather, chiefs of mission - not all of them are ambassadors) are in font 1 - but font 1 is actually identical to fonts 2, 7, and 8, which include diplomats' names, spouses, and styles. The good news is that at least font-grouping will help to filter the crap like lists of national days and page numbers and obvious MS Word copy-paste artefacts.
(wordpress.com still eats embedded spreadsheets: here's a link.)