Bug: the masquerade
A dot on the radar
This bug appeared to be like any other at first. A plain 500 error code showed up on the page. As this was a homepage it was rather serious. However, one annoying thing with it was that after a reload the error was gone. Luckily for us we got a log which stated clearly: “Warning: imagecreatefrompng(): gd-png: fatal libpng error: Read Error: truncated data”. The stacktrace pointed to the module that was converting images into the webp format.
Let me fix that quickly
The initial investigation focused on finding the files that caused the problem. With a quick log added to the webp processor, I quickly found 3 files that were generating errors. They loaded correctly in the browser, which was strange, but it could be that the browser was more fault-tolerant than the image processing library. I saved the files locally, edited them in GIMP, tested if the webp processor was correctly processing the files and re-uploaded them to the server. All fine and dandy, case closed, another bug squashed… or so I thought.
A little game of whack-a-mole
The next day I checked the logs and more files were there. Moreover, the ones that I had fixed appeared again. Upon checking application logs I noticed that the original error of truncated data has been showing for over a month. That was weird but I looked at what I was seeing:
- the error appeared for the first time around a month prior
- it is not reproducible in any straightforward way
- the issue is related to the file not being fully read from the disk
- the issue is not the file itself as this case was tested and did not resolve the issue
The random pattern of the error with the fact that it relates directly to reading the file from the disk was clear and my conclusion was that it must be a disk error. After all, if it looks like a duck, swims like a duck and quacks like a duck, then it probably is a duck. I moved the task to the ops team and asked to conduct a disk analysis to see if there were any errors and maybe change it if necessary.
The check came back negative. There were no apparent errors with the disk and everything should be working fine. The hell it should when it evidently does not! However, it is hard to argue with raw performance data. My next suspect was system configuration. Since there is a limit to how many files a process can load, maybe in some cases we hit this limit. Our production limits are set to 1024 and 524,288 for soft and hard limits respectively. The hard limit is rather big but we are dealing with Magento installation with additional modules so who knows. Just for testing let’s increase it and see if this resolves or at least helps the issue as it has been appearing more and more often.
This is ridiculous
As anyone would expect, this did not help at all. The problem still appeared and its frequency had not decreased. The only thing that was left was to roll up my sleeves and try to create the conditions for when the problem occurs. I wrote a simple script which searched for every png file in the upload folder and tried to load it in the same way that the webp library does. As expected, when it was run, everything went smoothly and no errors appeared. So I ran it again with the same result. And again because the third time’s the charm. Unfortunately, not this time. The fourth time, the fourth surely will be different. But it wasn’t. At this point the problem became personal. I stood like John McClane checking explosives and this little Hans Gruber just tried to kill me on the roof of Nakatomi Plaza.
So I looped it. I ran the full test a hundred times and printed the result to see if any error was detected while loading the files. I waited patiently as the lines appeared on the screen one by one and each one was showing no errors. And finally there it was. The 33rd line showed an error. This little rascal was there. I knew it. The script continued running and everything again was ok. Once it finished there were a total of 3 occurrences of the error, which means it happened about once every 30 iterations. At this point the only thing I could think of was a disk error again. The pattern did not make any other sense. To be sure I copied the files from the media folder to a different location on the disk and ran the script there. As expected, after 300 iterations the problem did not appear. What other thing could cause this random error with reading a file by a native library?
A blind spot
As it turned out there was a crucial piece of information that I had missed. In all of this chasing, I hadn't noticed to which files the issue pertained. If I'd been a little more focused and a little less entrenched in my initial assessment I would have caught it. I mean, it was right there in front of my eyes. Even when I copied the folder to another location I did not copy the entire upload folder but only a single one. The one containing the files which appeared in my log when I first created it. And by looking at the path it should give me a hint. Namely, the error appeared only when dealing with files from [vendor]/resized/[digits].
Resized. This was not a regular folder when some module dumped all the files that were uploaded. It was a subfolder generated by the module when it created a resized version of the uploaded files. So those files were never uploaded by the administrator but they were created by the module.
But wait… this module does not create webp files, it simply resizes the original file and saves it in the exact same format. So how could resizing a png or a jpg file cause an issue with the webp processor?
Good question, my dear Watson. Let’s look at the code. We have two modules. One provides the administrator with an easy-to-use graphic interface when preparing content for the site. The other makes sure that the produced HTML contains images optimised for the web. This means using webp as it is more efficient and widely supported. On their own, those two modules work without any problems. However, the way they were created made them unsuitable to use in tandem.
The left-hand side of the equation
Let’s start with the webp module. This was designed to be a plug’n’play module for the Magento store. As such it had to make sure that every possible image used in the page would have a webp version. This is fairly easy to do with product images as Magento uses a specialised class to generate the url to those images. You could easily hook into this class and produce a link to the webp version. Similarly, content provided by the CMS module also uses a dedicated function to generate the url to the file. This can be intercepted and replaced with the webp. But there is a catch. This way allows only to generate one link so all images would be directed to the webp version. Since this was intended as a general purpose module, it meant that it needed to be compatible with sites that had users whose browsers aren't compatible with the webp format. To do that instead of producing a single <img> tag, it opted to replace it with a <picture> tag with two sources inside it. One original link and one webp link. For such a change there is no good way to hook into the system. So the way this module works is it hooks into the generated HTML content that is about to be sent to the browser, it scans for <img> tags and it replaces them with <picture> tags. It is not an elegant solution and it has some performance drawbacks but this is the only way it can be done without the need of a programmer to use webp.
However, this approach is not what was causing the issue. When generating the webp source, the module created the webp version out of the original file on the fly. This is one of the performance problems as it extends the time needed for the request to be processed. It could only generate the url based on the original one and hook into the direct request for the file. If the file was missing, it would generate a webp version and save it for future requests. This would not prevent the error. Instead of a blank page with an internal error, the customer would see the actual page with some missing images.
The right-hand side of the equation
As I said before, this module was not the sole culprit of the problem. If we had only this one installed it would not be a problem, maybe a bit of performance drop but rather negligible, as most of the requests are cached either way. To go to the root of the problem we must check the second module, the one that provided the editor team with an easy-to-use page builder mechanism.
This module is rather complex and it saves the entire structure of a page in a tree of elements. Some of those elements deal with images. One of the features of such elements is that not only can you provide it with the image but you can also specify the desired dimensions. When the module renders the page, it resizes the image to the required dimensions to reduce the data that needs to be sent through the network. This is a good practice which helps in page rank so it is nothing unusual to see it. However, one unfortunate thing was done in this module. The rendering engine always calls for the resize function as it not only generates the miniature version but, most importantly, it returns the url to the file. And here again we see the same thing as in the webp module, where the miniature image is generated on the fly rather than when directly asked for. But this is not that big of a deal. It will be generated once, saved on the disk and then simply used. In this case, however, there was this important piece of code missing which checks if the resized version has already been created. So in reality every time this module was used and it rendered part of the page, it regenerated the resized image whether it had already been done or not.
When looking at the above situation there is no obvious way in which those two modules would be problematic. After all, the webp processor runs only after the page builder has generated all its miniatures. Given that we are using full page cache extensively, there should not be that many instances where those two would clash. It surely could happen that there were two requests for the same page, ie. the homepage and handled in parallel by two php workers. This could cause the situation where the webp processor would run on the first request and would try to generate the webp version of a file exactly when the page builder of the second request would regenerate that same file. This, however, should be very rare and it should only happen when cache is cleared. We noticed a much more frequent occurrence of the issue.
The final piece of the puzzle is the way both modules deal with generating on-the-fly versions of the file. As I mentioned, the page builder module did not check if the resized version had already been created to avoid duplicating the process. The webp processor did not make this mistake and it checked if the webp version of the file was already there. But in order to avoid the situation when the file has been updated, it also checks if the original file was created after the webp version that is on the disk. This would mean that the webp version was created on some other file and may not correctly represent the original file. Since there was no other method of invalidating the webp versions or clearing them from the backend panel, this was a reasonable approach.
However, this is when the two modules clash. With the page builder constantly recreating the resized version and webp constantly checking if the existing file is older than the original, it raised the odds of causing a situation when one thread is trying to generate the webp version while the other is recreating the file.
Once I realised all of this, the fix was extremely easy to do. I just added a check to the page builder image processor, which prevents recreating already existing miniatures.
The moral of this story is that if it looks like a duck, swims like a duck and quacks like a duck but the duck is a disk error then it probably is not a duck.
Additionally, always beware of other people using your code, especially when you are dealing with data that you do not generate yourself.