Thursday, November 22, 2007

Python and CSV; know your limits

Right, I've got this honking great MySQL dump file, and I'd like to use the data in it without needing a MySQL db server; so I thought I'd turn it into a SQLite db, as python has native sqlite3 support. Various suggestions are on offer around the web; SQLiteBrowser, for example, claims to import and export from various SQL flavours and CSV files. Nice; but it chokes on the file.

So I saved a copy of the thing as a plain text file with the mysql tags trimmed off and tried a few options; a posting on the UK Python list reminded me that the csv module in Python can take arbitrary characters as delimiters, not just commas, which sounded useful. After all, I couldn't just split it at the commas because the contents are basically a lot of tuples, like this: (data, data, data),(data, data, data) And I need them in groups.

I thought I was being clever when I did a global find/replace, taking out the ),( because the csv module doesn't support multiple characters as delimiters, and replacing it with \t; then I wrote this script:
#! usr/bin/env python

import sqlite3
import csv

f = open('/home/yorksranter/Documents/Geekery/airports.txt')
csv.field_size_limit(100000) #see below!
input = csv.reader(f, delimiter='\t')
conn = sqlite3.connect('/home/yorksranter/Desktop/airport.sql')
curse = conn.cursor()

curse.execute('''CREATE TABLE airports ('id', 'country', 'latitude', 'longitude', 'name', 'timezone', 'shortname')''')

for item in input:
........curse.execute('INSERT INTO airports VALUES (%s)' % item)
Each item should be a tuple of seven values, and they should be in the same order they were in the original db, so this ought to recreate the data in an SQLite 3 file.

Then my problems began; I got the following error message:
_csv.error field larger than field limit
. Google found me this and this; it seems as far as I understand that the DB is too big for the csv module; there does seem to be a way of altering the limit, going by the module source code.

Thoughts? Update: There is; csv.field_size_limit(), and I altered it until the thing ran properly; but there's still no data in the db!


Duane said...

I'm no pythonista and the docs I find with a quick google look incomplete, but a few thoughts nonetheless:

1) You should probably be using SQL parameters instead of a formatted string. It will be more robust in the presence of dodgy data. Exactly what does item.str() return anyway? Is it properly quoted and escaped?

2) Does execute return an error in case of failure? If so, check it. If not, perhaps check whether the INSERT had an effect by checking conn.total_changes or something.

3) From the commit at the end (and a quick look at the docs) it appears those INSERTs are all happening in a single transaction. Perhaps that is getting too large and hitting limits somewhere. Try using autocommit mode or committing every n INSERTs.

David said...

Another possibility: the string you're passing is of the form "INSERT INTO a VALUES (['1', '2', '3', '4', '5', '6', '7'])", and the brackets might be confusing sqlite.

curse.execute('INSERT INTO airports VALUES (?,?,?,?,?,?,?)', item)

kostenloser Counter