The world changes when you get over a few million rows in a database table. Mine changed due to a bug that reprocessed the same data over and over again and caused a table that should have been a few million rows to grow way out of proportion. The duplicate inserts happened, because the table had no unique index. (Having a unique index is another topic altogether.) Forget about the bug that caused all this; it has been fixed, but pruning the duplicate data was another task altogether.
The application that uses this database is written in Python. I thought I could just open the offending tables (one at a time), process each record; look for each record’s duplicates; preserve one row; delete all rows; and re-insert one row. Well, that did not work. mysqldb blew up. I have never had a problem with mysqldb. It is well written, but it probably was not written to handle a full select on fifty-five million rows or so.
It finally dawned on me. MySQL is a great database. It works fine. I dumped the tables (there were two tables that had duplicates) into .csv files; merged the two files into one; wrote a small Python program using the Python csv module to read a file of duplicates and write out one copy of the full record. I used a temporary table to keep track of one version of the record, and if not found would write out one copy into the output .csv file, but not process that record if already recorded in the temporary table.
What lessons did I learn from this?
- If you have more robust logging with email, use it. I would have seen months ago that files were being reprocessed.
- Try to use unique indexes. Take the time to find out why you can’t insert data if the index is unique. I took a shortcut, because a project was in full swing and it would not have hurt me if the database applications did not run for a few days.
- Be careful around large database tables.