Defining text encoding in PHP instead of mb_detect_encoding

There are several encodings of Cyrillic characters.

When creating sites on the Internet, they usually use:


More popular encodings:


This is probably not the whole list, these are the encodings that I often encounter.

Sometimes it becomes necessary to determine the encoding of the text. And in PHP there is even a function for this:

mb_detect_encoding 

But, as m00t wrote in the article Defining Text Encoding in PHP - An Overview of Existing Solutions plus Another Bike
In short, it does not work.
After reading the m00t articles , I was not inspired by its method and found this solution: Determining the text encoding in PHP and Python
As m00t said
again character codes
I tested the function of determining the encoding by character codes, the result satisfied me and I used this function for a couple of years.

Recently I decided to rewrite the project where I used this function, I found a ready-made package on packagist.org cnpait / detect_encoding , in which the encoding is determined using the m00t method

At the same time, the specified package was installed more than 1200 times, so it’s not for me alone that the task of determining the text encoding periodically arises.

I would install this package and calm down, but I decided to "get confused."

In general, I made my package: onnov / detect-encoding .

How to use it is written in README.md

I’ll write about testing it and comparing it with the cnpait / detect_encoding package.

Testing methodology


Take the big text: Tolstoy - Anna Karenina
In total - 1'701'480 characters

We remove all unnecessary, we leave only the Cyrillic alphabet:

 $text = preg_replace('/[^--]/ui', '', $text); 

There remained 1'336'252 cyrillic signs.

In the loop we take part of the text (5, 15, 30, ... characters), convert it to a known encoding and try to determine the encoding by the script. Then compare correctly or not.

Here is the table in which the encoding is on the left, the number of characters by which the encoding is determined on top, the table shows the reliability result in %%
letters ->5fifteenthirty60120180270
windows-125199.1398.8398.5499.0499.7399.93100.0
koi8-r99.8999.98100.0100.0100.0100.0100.0
iso-8859-581.7999.2799.98100.0100.0100.0100.0
ibm86699.8199.99100.0100.0100.0100.0100.0
mac-cyrillic12.7947.4973.4892.1599.3099.94100.0

Worst accuracy with Mac Cyrillic, you need at least 60 characters to determine this encoding with an accuracy of 92.15%. Windows-1251 encoding also has very low accuracy. This is due to the fact that the numbers of their characters in the tables overlap greatly.

Fortunately, mac-cyrillic and ibm866 encodings are not used to encode web pages.

Let's try without them:
letters ->510fifteenthirty60
windows-125199.4099.6999.8699.97100.0
koi8-r99.8999.9899.98100.0100.0
iso-8859-581.7996.4199.2799.98100.0

The accuracy of the determination is high even in short sentences from 5 to 10 letters. And for phrases of 60 letters, the accuracy of determination reaches 100%. And yet, the encoding is determined very quickly, for example, text longer than 1,300,000 Cyrillic characters is checked in 0.00096 seconds. (on my computer)

And what results will the statistical method described by m00t show :
letters ->510fifteenthirty60
windows-125188.7596.6298.4399.90100.0
koi8-r85.1595.7197.9699.91100.0
iso-8859-588.6096.7798.5899.93100.0

As you can see, the results of determining the encoding are good. The speed of the script is high, especially in short texts, in huge texts the speed is significantly inferior. Text longer than 1,300,000 Cyrillic characters is checked in 0.32 seconds. (on my computer).

My findings



Which method to use is up to you. In principle, you can use both at once.

Source: https://habr.com/ru/post/466113/


All Articles