Sometimes a mysql dump file contains latin-1 text encoded wrongly as UTF-8 unicode encoding. That leads to some characters out of the ASCII range being garbled and displayed as two UTF-8 characters. This is called mojibake. E.g. “l’Oreal” for “L’Oreal”.
When you dump such a database using:
mysqldump --opt -h localhost -u dbuser -p db > dump.sql
You get the mojibake in this dump file.
How to correct the mojibake
Luckily there is a python3 tool called ftfy to remove mojibake and replace it with the garbled characters. Ftfy is a nifty work of programming and educated guesswork, because ftfy must guess what the original encoding was, and which transformation to apply to the dump file. To make this guessing work, ftfy uses a line by line approach to guessing any encoding mistakes. However this approach does not work with standard mysqldump files, because they can contain extremely long lines. This causes the ftfy guessing algorithm to not work effectively, because the algorithm assumes that lines are not very long in order to guess if there are wrong character sequences in a line.
The right mysqldump
Reduce the length of the lines in mysqldump by using the command line parameter –skip-extended. This parameter writes multiple SQL INSERT statements per table, such that each INSERT is on a new line. The drawback is that dumping and restoring a database is slower.
mysqldump --opt --skip-extended -h localhost -u dbuser -p db > dump.sql
This dump still contains the mojibake and is now ready to be processed further.
Using ltfy
First install lftfy:
virtualenv -p python3 venv source ./venv/bin/activate pip install ftfy
Then prepare a file to call ftfy on your mysql dump file (with thanks to Pielo):
import ftfy # Set input_file input_file = open('dump.sql', 'r', encoding='utf-8') # Set output file output_file = open('dump.utf8.sql', 'w') # Create fixed output stream stream = ftfy.fix_file( input_file, encoding=None, fix_entities=False, remove_terminal_escapes=False, fix_encoding=True, fix_latin_ligatures=False, fix_character_width=False, uncurl_quotes=False, fix_line_breaks=False, fix_surrogates=False, remove_control_chars=False, remove_bom=False, normalization='NFC' ) # Save stream to output file stream_iterator = iter(stream) while stream_iterator: try: line = next(stream_iterator) output_file.write(line) except StopIteration: break
Then you just need to call:
python dbconvert.py
Thereafter you can just restore the dump file into mysql:
mysql -h localhost -u dbuser -p db < backup.sql
Beware of duplicates introduced by removing mojibake in SQL
Sometimes removing the mojibake can result in duplicate rows in the database, even when there was a UNIQUE KEY constraint or UNIQUE index. The reason is that different two character encodings for a single UTF-8 character can be recognized by ftfy. This then leads to duplicate rows.