Defining text encoding in PHP instead of mb_detect_encoding

There are several encodings of Cyrillic characters.

When creating sites on the Internet, they usually use:

utf-8
windows-1251
koi8-r

More popular encodings:

iso-8859-5
ibm866
mac-cyrillic

This is probably not the whole list, these are the encodings that I often encounter.

Sometimes it becomes necessary to determine the encoding of the text. And in PHP there is even a function for this:

mb_detect_encoding

But, as m00t wrote in the article Defining Text Encoding in PHP - An Overview of Existing Solutions plus Another Bike

In short, it does not work.

After reading the m00t articles , I was not inspired by its method and found this solution: Determining the text encoding in PHP and Python
As m00t said

again character codes

I tested the function of determining the encoding by character codes, the result satisfied me and I used this function for a couple of years.

Recently I decided to rewrite the project where I used this function, I found a ready-made package on packagist.org cnpait / detect_encoding , in which the encoding is determined using the m00t method

At the same time, the specified package was installed more than 1200 times, so it’s not for me alone that the task of determining the text encoding periodically arises.

I would install this package and calm down, but I decided to "get confused."

In general, I made my package: onnov / detect-encoding .

How to use it is written in README.md

I’ll write about testing it and comparing it with the cnpait / detect_encoding package.

Testing methodology

Take the big text: Tolstoy - Anna Karenina
In total - 1'701'480 characters

We remove all unnecessary, we leave only the Cyrillic alphabet:

 $text = preg_replace('/[^--]/ui', '', $text);

There remained 1'336'252 cyrillic signs.

In the loop we take part of the text (5, 15, 30, ... characters), convert it to a known encoding and try to determine the encoding by the script. Then compare correctly or not.

Here is the table in which the encoding is on the left, the number of characters by which the encoding is determined on top, the table shows the reliability result in %%

letters ->	5	fifteen	thirty	60	120	180	270
windows-1251	99.13	98.83	98.54	99.04	99.73	99.93	100.0
koi8-r	99.89	99.98	100.0	100.0	100.0	100.0	100.0
iso-8859-5	81.79	99.27	99.98	100.0	100.0	100.0	100.0
ibm866	99.81	99.99	100.0	100.0	100.0	100.0	100.0
mac-cyrillic	12.79	47.49	73.48	92.15	99.30	99.94	100.0

Worst accuracy with Mac Cyrillic, you need at least 60 characters to determine this encoding with an accuracy of 92.15%. Windows-1251 encoding also has very low accuracy. This is due to the fact that the numbers of their characters in the tables overlap greatly.

Fortunately, mac-cyrillic and ibm866 encodings are not used to encode web pages.

Let's try without them:

letters ->	5	10	fifteen	thirty	60
windows-1251	99.40	99.69	99.86	99.97	100.0
koi8-r	99.89	99.98	99.98	100.0	100.0
iso-8859-5	81.79	96.41	99.27	99.98	100.0

The accuracy of the determination is high even in short sentences from 5 to 10 letters. And for phrases of 60 letters, the accuracy of determination reaches 100%. And yet, the encoding is determined very quickly, for example, text longer than 1,300,000 Cyrillic characters is checked in 0.00096 seconds. (on my computer)

And what results will the statistical method described by m00t show :

letters ->	5	10	fifteen	thirty	60
windows-1251	88.75	96.62	98.43	99.90	100.0
koi8-r	85.15	95.71	97.96	99.91	100.0
iso-8859-5	88.60	96.77	98.58	99.93	100.0

As you can see, the results of determining the encoding are good. The speed of the script is high, especially in short texts, in huge texts the speed is significantly inferior. Text longer than 1,300,000 Cyrillic characters is checked in 0.32 seconds. (on my computer).

My findings

Both methods give good results.
The accuracy of the methods is close.
The speed of determining by character codes is higher in large texts, but this is hardly of great importance, because it is unlikely that anyone will check such huge texts.
The statistical method still has the potential to increase the accuracy of encoding determination.

Which method to use is up to you. In principle, you can use both at once.

Source: https://habr.com/ru/post/466113/

All Articles

Defining text encoding in PHP instead of mb_detect_encoding

Testing methodology

My findings

More articles: