Tag: character encoding

mysqldump in PowerShell corrupts non-Latin characters when exporting database (SOLVED)

mysqldump is a MySQL utility for creating database and table backups. Unlike phpMyAdmin, which, although it offers a web interface, is a slower tool due to the limitations of intermediates such as PHP and Apache, mysqldump is a much more efficient tool without limitations for backing up very large data.

But on Windows, mysqldump has some nuances. Due to the peculiarities of PowerShell for working with encodings, all non-Latin characters can be corrupted in exported databases. This issue is not seen in CMD, but recent versions of Windows use PowerShell by default, so the issue in question affects all users who run mysqldump to back up databases in Windows.

The following command, executed in PowerShell 7:

.\mysqldump.exe -u root --all-databases > all-databases_ps7.sql

Creates a UTF-8 encoded all-databases_ps7.sql file into which all MySQL databases will be exported using mysqldump (for backup purposes). But in these databases all non-Latin characters will be irretrievably corrupted!

That is, instead of Cyrillic, it will be something like this:

'╨Р╤А╨▒╨╕╤В╤А╨░╨╢╨╜╤Л╨╣ ╨┐╤А╨╛╤Ж╨╡╤Б╤Б: ╤Г╤З╨╡╨▒╨╜╨╕╨║ / ╨Ъ.╨Ь. ╨Р╤А╤Б╨╗╨░╨╜╨╛╨▓, ╨Ф.╨е. ╨Т╨░╨╗╨╡╨╡╨▓, ╨а.╨Э. ╨У╨╕╨╝╨░╨╖╨

To avoid this problem, use mysqldump with the --result-file option. The following command will save the database in the correct encoding:

.\mysqldump.exe -u root --all-databases --result-file=all-databases.sql

You can also use the following two-command construct to fix the encoding problem:

[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding("UTF-8")
.\mysqldump.exe -u root --all-databases > all-databases.sql

Output encoding issues in PowerShell and third-party utilities running in PowerShell (SOLVED)

What encoding is used in PowerShell by default. How to change the default output encoding to UTF-8 in PowerShell

If you run the following command in PowerShell 5:

"Testing" > test.file

And check the encoding in the newly created test.file, it turns out that it is UTF-16LE.

If you run the following command in PowerShell 7:

"Testing" > test.file

And check the encoding in the newly created test.file, it turns out that it is UTF-8.

The following command, executed in PowerShell 5:

.\mysqldump.exe -u root --all-databases > all-databases_ps5.sql

Creates a UTF-16LE encoded all-databases_ps5.sql file into which all MySQL databases will be exported using mysqldump (for backup purposes). In these databases, ALL non-Latin characters will be irretrievably corrupted!

That is, instead of Cyrillic, it will be something like this:

'╨Р╤А╨▒╨╕╤В╤А╨░╨╢╨╜╤Л╨╣ ╨┐╤А╨╛╤Ж╨╡╤Б╤Б: ╤Г╤З╨╡╨▒╨╜╨╕╨║ / ╨Ъ.╨Ь. ╨Р╤А╤Б╨╗╨░╨╜╨╛╨▓, ╨Ф.╨е. ╨Т╨░╨╗╨╡╨╡╨▓, ╨а.╨Э. ╨У╨╕╨╝╨░╨╖╨

The following command, executed in PowerShell 7:

.\mysqldump.exe -u root --all-databases > all-databases_ps7.sql

Creates a UTF-8 encoded all-databases_ps7.sql file into which all MySQL databases will be exported using mysqldump (for backup purposes). But in these databases, ALL non-Latin characters will be irretrievably corrupted AGAIN!

That is, it would seem that the default encoding has changed to UTF-8, but the problem with completely corrupted database backups has not gone away.

How to save output from third-party programs in UTF-8 encoding in PowerShell

The above behavior, which corrupts the output of commands in PowerShell, is not acceptable. Let's look at how to ensure that output in PowerShell is saved in UTF-8 encoding.

How to save PowerShell output in UTF-8 encoding, use the Out-File cmdlet

Consider the following command:

.\mysqldump.exe -u root --all-databases > all-databases.sql

As already shown above, it corrupts non-Latin characters due to incorrect encoding.

In some sources, as a solution to the problem, it is recommended to replace the “>” character with “Out-File”, and also specify the encoding using the “-Encoding UTF8” option. That is, use the following command:

.\mysqldump.exe -u root --all-databases | Out-File -Encoding UTF8 all-databases_fixed.sql

In fact, both in PowerShell 7 and in PowerShell 5, this command changes little. First, the command is analogous to the output redirection symbol “>”. Secondly, by default, the Out-File cmdlet uses UTF-8 encoding, that is, it is not necessary to specify it specifically.

But the most important thing is that despite the fact that the data is saved to a file with UTF-8 encoding (in the previous command, this is the all-databases_fixed.sql file), non-Latin characters in this file are still corrupted! The thing is that the Out-File cmdlet initially processes the received data in the wrong encoding. Therefore, it no longer matters how exactly Out-File saves the data – the data is corrupted already at the time of receipt in this cmdlet.

The problem was solved by explicitly specifying the encoding for the received data:

[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding("UTF-8")

The subsequent run of the command saved all the data in the correct encoding:

.\mysqldump.exe -u root --all-databases | Out-File -Encoding UTF8 all-databases_fixed.sql

This method works equally well in PowerShell 7 and PowerShell 5.

By the way, the original command using this method also saves the data in the correct encoding:

[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding("UTF-8")
.\mysqldump.exe -u root --all-databases > all-databases.sql

Saving output by third-party utilities

In all previous commands, we redirected the output of the mysqldump utility to a file, or to the Out-File cmdlet. But the mysqldump utility has a --result-file option after which you can specify a filename to output. That is, as a result of using this option, you do not need to use output redirection or PowerShell cmdlets.

The following command will save the database in the correct encoding:

.\mysqldump.exe -u root --all-databases --result-file=all-databases.sql

How to find and remove non-UTF-8 characters from a text file

Filtering invalid UTF-8 characters

Files that, in addition to ordinary characters, contain characters that are invalid from the UTF-8 point of view, cause a problem both when they are processed by utilities and when opened in text editors.

An example of an error in Python 3 when trying to process a file with non-UTF-8 characters:

[-] Exception as following:

Traceback (most recent call last):
  File "/home/mial/bin/pydictor/lib/fun/decorator.py", line 21, in magic
    for item in unique(func()):
  File "/home/mial/bin/pydictor/lib/fun/fun.py", line 30, in unique
    for item in seq:
  File "/home/mial/bin/pydictor/tools/uniqifer.py", line 31, in uniqifer
    for item in o_f.readlines():
  File "/usr/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 933: invalid continuation byte
None

An example of an error in Perl:

Malformed UTF-8 character (fatal)

Sample message when trying to open in gedit text editor:

There was a problem opening the file. The file you opened has some invalid characters. If you continue editing this file you could corrupt this document. You can choose another character encoding and try again.

How to find non-UTF-8 characters in a file

To find lines with unreadable characters, use a command like this:

grep -axv '.*' FILE

In the UTF-8 locale, you will get strings containing at least one invalid UTF-8 sequence (at least this works with GNU Grep).

As you can see in the screenshot, there are indeed strange symbols.

If you check the contents of these files manually, you will also find some strange characters in them.

Is there a way to remove them all automatically?

What are non-UTF-8 characters?

You might think that all the characters in a well-formed UTF-8 string are UTF-8 (actually Unicode) characters! Some of them are UTF-8 encoded in several consecutive bytes.

But in fact, there are malformed UTF-8 characters. This means that a byte has appeared that cannot be part of a valid UTF-8 file. It is not difficult; it could be 0xC0 or 0xC1 byte, or 0xF5..0xFF, or a sequence problem with bytes that would otherwise be valid.

How to automatically remove garbled characters (not UTF-8) from a file

This command:

iconv -f utf-8 -t utf-8 -c FILE.txt

will clean up your UTF-8 file by skipping any invalid characters.

  • -f - original format
  • -t - target format
  • -c - skips any invalid sequence

On Mac, use the command:

iconv -f utf-8 -t utf-8 -c file.txt

That is, a hyphen is needed between 'f' and '8'.

By default, the cleared data will be written to standard output. Therefore, use one of the following commands to save the results to a new file:

iconv -f utf-8 -t utf-8 -c FILE.txt -o NEW_FILE.txt
iconv -f utf-8 -t utf-8 -c FILE.txt > NEW_FILE.txt

Examples:

iconv -f utf-8 -t utf-8 -c ~/rockyou.txt > ~/rockyou_clean.txt
iconv -f utf-8 -t utf-8 -c ~/rockyou.txt -o ~/rockyou_clean.txt
Loading...
X