Here's an example of CAPTCHA recognition using PHP
. The recognition process includes image binarization, noise reduction, compensation, segmentation, skew correction, database construction, and matching. Finally, sample code will be provided that can directly run the recognition.
The CAPTCHA to be recognized is relatively simple, with no overlapping characters, but it may feature bolded fonts to varying degrees, as well as a skew of approximately 0-30 degrees. The number of characters can also vary between 4-5. Generally, using Python
for CAPTCHA recognition is relatively simple. For further reference, you can read the following articles:
CAPTCHA Recognition in Qiangzhi Education System using OpenCV
CAPTCHA Recognition in Qiangzhi Education System using Tensorflow CNN
Images are composed of individual pixels, with each pixel having quantifiable RGB
color values. Based on the colors of the CAPTCHA, the threshold for the three colors is adjusted to filter out the background and characters, setting the background to 1
and the characters to 0
.
CAPTCHAs often contain some noise, which typically consists of isolated points or occasionally a few single-pixel points forming interference lines. When reducing noise, it's necessary to eliminate these noisy points and interference lines. I adopted the approach of extracting the values of the four surrounding pixels for each pixel, and if two or more of these surrounding pixels are background (i.e., 1
), then it's considered a noisy point and set to background.
During binarization, it's inevitable that some small character pixels will be filtered into the background. In such cases, it's necessary to compensate for these characters. I also used the same approach of calculating the values of the four surrounding pixels, and if two or more of them are characters (i.e., 0
), then the pixel is considered a character pixel and set accordingly.
As this captcha is not connected, the segmentation of the characters is relatively simple. Vertically, the start and end positions of the segmented characters are counted, and then placed into an array after segmentation. The horizontal spaces are removed, and similarly, the start and end rows with '0' values are counted, and then segmented, keeping only the characters.
I've tried two approaches for skew correction, one is using linear regression, the other one is using projection.
With linear regression, I obtained the coordinates of the midpoint of character pixel points on each line, used the least squares method to fit the curve, and obtained a slope, which is equivalent to obtaining the skew angle of the character. Then, I corrected the skew of the character based on the slope. This method works quite well for characters like 'n', but not so well for characters like 'j'.
As the direct linear fitting method does not work well for some characters, I adopted the projection method. When a character is rotated, its width will inevitably increase. Therefore, I attempted to rotate the character within a certain range and obtained the character with the minimum width during the rotation process, which is the corrected character. Since directly rotating the vertical characters according to the slope is not feasible as 'tan90°' does not exist, it's difficult to define the counterclockwise rotation range. Therefore, I first transpose the character array, then rotate it clockwise within the range of '-0.5-0.5' of the slope, and then transpose it back. In the implementation process, there is a considerable amount of repetitive calculation, which mainly requires mathematical deduction. Additionally, if the character width changes from small to large during the rotation process, the reverse calculation or stopping the calculation can be performed, similar to a gradient descent method. Besides, I didn't use matrix operations. If matrix operations are used, the implementation will be simpler. In PHP
, there are machine learning libraries like PHP-ML
, which provide methods for matrix operations. Of course, you can also directly use PHP-ML
for neural network training.
After correcting the CAPTCHA, it's necessary to build a feature matching database. Here, I directly used the binarized array converted into a string as the entire feature and wrote it into a feature matching array. Then, I manually entered the codes. If the recognized character does not match the manually entered character, it was added to the feature matching array. Then, the character array was serialized and stored in a file. After that, the serialized string was compressed and stored in a file. I extracted about 150 character feature codes, occupying about 8KB. Note that I used PHP
as a script, configured the environment variables, inserted empty data, and then used php Build.php
to start extracting the feature codes.
Because all the feature information is directly stored in the file, you can directly use a loop to compare the values of the strings. To improve accuracy, I align the first 0
of the two comparison strings and then iterate through, obtaining the number of identical characters. In addition, because the lengths of the strings being compared are different, I multiply the length information of the string by a certain weight and include it as part of the similarity. Of course, in PHP
, the similar_text
function is provided for comparing string similarities. Using this function will improve recognition rate, but because the string length is too long, the comparison matching time is relatively slow. To balance time consumption and accuracy, I still chose the self-matching method.