Ascendro Blog

Displaying 1-1 of 1 result.

Handling large strings in PHP

We recently needed to write an algorithm to unpack a specific file from a proprietary archive format.

The fun part is that the initial task quickly transformed into a research task as our colleague Michael wanted to dig deeper in the topic. Here is the story behind the performance boost.

I never worked with binary files before in PHP so I decided to first get it working and care about optimisations later. 

My straight forward approach of using file_get_contents and normal string operations resulted in very low performing and memory hungry process.

For getting a 10kb file out of an 2MB archive the algorithm took ~1.200ms and ~14mb memory at its peak.

This had to be optimized ... and I succeeded to bring it down to 30ms and 1MB memory at its peak ...

Analyzing the Algorithm

As I was to lazy to setup some serious debug tools on the system I just used microtime() at the start and end of the script for meassuring the used time as well as memory_get_usage() and memory_get_peak_usage() for getting memory usage.

The first bulk of memory was assigned, when the file contents where read and saved into the memory as a PHP string. The memory used matched the size of the file which was not unexpected and ok so far.

$pboString = file_get_contents($pboFile, FILE_USE_INCLUDE_PATH);

Further smaller memory bulks where assigned due to copying the string or part of the strings. This was done by using parameter by value in function calls, creating copies for further processing or just due to the use of PHPs string operations.

public static function createByConsumeString(&$string,$header) {

After the use of references the performance slightly improved:

MEM: 411896
PEAK: 428544
Processing 0
Processing 1
Processing 2
Processing 3
Processing 4
Processing 5
Processing 6
Processing 7
Processing 8
Processing 9
MEM: 777064
PEAK: 8887368
Total Elapsed: 10.239467144012 seconds
Average time: 1.0239467144012 seconds

A slight improvemend of 200ms and the saving of 6mb was a nice step forward, still far to much concidering that the input file was just 2.5mb and we are basically using 5 times the memory.

The real performance taker where the string functions of PHP. Especially substr() which I used extensivley for taking out informations and getting rid of the already processed data. Every usage result into the copy of the complete file.

Trying to reduce the usage of substr() by taking out bulks of data instead of using it for every byte needed to be read got a small advantage in speed but the peak memory consumption stayed the same. Also the existance of null terminated C-strings didn't make it possible to predict the correct size of the bulks all the time.

A better solution was needed.

Finding better alternatives

As consuming the already processed part of the string was a way I liked, using streams instead of a strings popped into the mint very fast - but the description on php.net for file_get_contents() said 

file_get_contents() ist der empfohlene Weg, um den Inhalt einer Datei in einen String zu lesen. Es werden Techniken zur Speicherabbildung genutzt, um die Performance zu erhöhen, falls das Betriebssystem dies unterstützt.

meaning that it is using specific methods to have better performance (surprisingly the english version of php.net expressed that less euphoric) and everybody knows that file system usage is much slower then in memory operations and should be prevented as much as possible. 

Using fread() on the filestream would have resulted in a lot of microreads which for sure would suffer the performance.

The idea to implement a fake stream came into mind. Using array indexes and looping from the beginning to the end seemed possible and memory friendly. Unfortunalty it meant a lot of work and changes in the current system.

No PHP native string functions could be used anymore, every byte needs to be processed manually and all classes need additional parameters like the current position in the string etc.

Dismissing the idea I got back to replacing substr() but couldn't find much usefull informations. I couldn't be the only person who had to work with large binary files. Thinking about solutions in other languages I got back to streams and how to may use them in memory. First searches gave me some libraries - which I would like to prevent to use as this extends server requirements but finally I got the needed knowledge:

$pboFile = fopen($pboFile, 'r');
$pboStream = fopen("php://memory", 'r+');
stream_copy_to_stream($pboFile,$pboStream);
rewind($pboStream);
fclose($pboFile);

$result = PBOExtractor::extractFromStream($fileToExtract,$pboStream);
fclose($pboStream);
return $result;

Rather easy to create i was surprised to find that few mentioning about php://memory as a solution. It provides the exact same interface like a file ressource (as it is handled as a ressource) and is therefore fast to learn and get used to.

Furthermore the stream_copy_to_stream() function provides a quick and performant way to copy one ressource to another. Replacing now the substr() calls with fread() calls where easy now. No expensive hard drive operations involved, no feeling of guilt.

Also the code change wasn't much - strings got streams, taking out bunch of data got fread(), that was basically it. 

MEM: 411400
PEAK: 430016
Processing 0
Processing 1
Processing 2
Processing 3
Processing 4
Processing 5
Processing 6
Processing 7
Processing 8
Processing 9
MEM: 777504
PEAK: 3375064
Total Elapsed: 0.34113693237305 seconds
Average time: 0.034113693237305 seconds

Miraculous! We got down to 34ms! This is 20 times faster then with PHP string operations. Also the memory peak went down to ~3mb and we save therefore half of the memory we needed before. The memory makes also sense as we loaded the 2.5mb file into the memory and also stored the extracted files each run. 

Back to the roots

Having now the system working with a ressource it is rather easy to just replace the memory stream with a file stream - just to check how bad the hard drive operations really impact the performance.

MEM: 412872
PEAK: 430656
Processing 0
Processing 1
Processing 2
Processing 3
Processing 4
Processing 5
Processing 6
Processing 7
Processing 8
Processing 9
MEM: 778984
PEAK: 1373752
Total Elapsed: 0.28629994392395 seconds
Average time: 0.028629994392395 seconds

HAHA! It is even faster (6ms) then the memory stream. How is that? The explanation is pretty easy: Using the file stream directly saves us one time copy the complete informations in the memory which takes time as well. 

Either way needs the expensive hard drive operation of reading the complete file - only the expectations that the read in one bulk is much faster then lots of smaller read operations made the decision that a memory stream is needed - which assumingly covered by caching mechanisms of the harddrive and os already.

Another real improvement is the memory usage. Taking only 1.3mb on it's peak it takes less memory then the opened file itself would need if taken in memory.

Lessons learned

The lessons learned are that whenever you have to work with huge binary data or even text: do not use PHP string functions! You will end up wasting memory and processing power by creating copies of your data.

Use ressources from the beginning on and with php://memory you are free to decide later where you wanne keep your data.

In the end I could use php://memory as well by saving the unpacked file not as a string but as a ressource in memory - so that later processing on the file can be done with streams as well.

LEAVE A COMMENT

Displaying 1-2 of 2 results.

Michael says:

Thats the point - i called it ressources but i was referring to them as in the context of streams. Reading a complete file at once or with a stream still requires to read the full file from a hard disc. Here file_get_contents() might bring an improvement as it already knows that you wanna have the complete file and therefore can optimize it (As described in its documentation). Anyway the performance difference between reading the complete file first in memory stream or reading it as a stream was just 6ms and not the real factor why the algorithm took 1200ms in the beginning. What was the problem with the substr i used in order to "consume" the string? They make a complete copy of the leftover string. For example you have 1000 characters and take out 1 character a time. This will result into (1000*(1000+1) )/2 = 500.000 bytes copied around. The goal of tha article is to promote streams and show on what factor the performance can impact. Streams aren't used a lot (compared to just take the string and work on that data) and PHP Developers tend to use what they are more confortable with.

Adrian says:

"The lessons learn are that whenever you have to work with huge binary data or even text: do not use PHP string functions! " Not sure if this is what you actually mean. If I understood what you wrote before, you changed your code from reading the whole file at once using file_get_contents() to a stream using fread(), correct? In that case lessons learned are: If you work with big files, use streams instead of reading/writing the whole file at once. Maybe I misunderstood your problem. Especially useful are streams when combined with PHPs stream wrappers like compression/decompression, encrypting/decrypting... see http://www.php.net/manual/en/wrappers.php for a complete list. You could also think about writing your own stream wrapper for your data format.