Scripts Timing Out? Try this!

Data import scripts are some of the most important, common, and challenging tasks I perform. Time-outs, memory limits, and boredom are just the most common of frustrations. Once a development project enters its final stages, the conversations invariably begin to shift from functionality to data. Most clients have a lot of existing content and data that needs to be incorporated into their brand new site. The information may reside in an old database, large volumes of text files, spreadsheets, or something else entirely. Regardless of the method of storage, any transfer of a large amount of data will require some automation in the form of a script. After you’ve written a few of these types of processes, you’ll undoubtedly encounter the same issues for each one.

I wanted to write about this because I was recently tasked with importing some rather complex data into WordPress. The data is for cascade filters that are attached to products in WooCommerce. For the sake of discussion, the products for this client uses two sets of cascade filters; one for a unique combination of values and another for various combinations of values. The first set of filters has a one-to-one relationship with a product, but the second set may have many to one relationship with the same product. Finally, the filters relate to each other in a many-to-many fashion. Needless to say, I doubt this makes any sense to anyone else other than me.

This data was presented to me in the form of an Excel file. In spite of the complicated nature of the data relationships, the file only had seven columns and about 70,000 rows. The amount of data is not what I would normally classify as large. Quite the contrary. I’ve built scripts to process hundreds of thousands and even millions of rows and had them run to completion in just a few minutes.

Not “Big Data”, “Big Processing”

Getting back to this most recent project, the amount of data offered no performance problems but processing it did. Long story short, the individual cascade filter values are stored as hierarchical terms, then term taxonomies. When a combination is attached to a product, serialized post meta is updated.1Serialized post meta that stores an array of one-to-many values just sucks. The only way to have a meta query of these is with “LIKE” which is slow. Needless to say that this stuff required some time and, as it turned out, a good bit of time.

I created a quick plugin that would read a CSV version of the Excel file and process the data. When I started the process, it quickly timed out. I increased the time limit, but finally realized that would not do. This script needed hours to run, and I didn’t want to open cans of worms by messing with php.ini. So, I had to come up with another solution: batch processing and AJAX.

Instead of processing the entire data set at once, I updated the code to process only a set number of rows at a time. Once one batch of rows had been processed, the next batch was processed. I created a “Block Size” option to change the number of rows to process in each batch. This allowed for flexibility in finding the most efficient batch size. I also converted the form to AJAX which allowed for seamless processing and displaying of progress to the user.

The Gory Bits

The code to handle this is pretty simple. The key is setting a couple of pointers that tell the PHP side which rows of the CSV file to process. Before we get into the PHP side of things, let’s look at the form HTML.

<form method="post" action="<?php echo esc_url( admin_url() ) ; ?>admin-post.php" id="script-form" enctype="multipart/form-data">         
    <p class="submit">
        <a id="script-submit" class="sync_button">Run Script</a>
    </p>
        
    <input type="hidden" name="action" value="run_script">
    <input type="hidden" name="current_row" value="0">
    <input type="hidden" name="ajax-nonce" id="ajax-nonce" value="<?php echo wp_create_nonce('ajax-nonce'); ?>">
  </form>

This is a bare-bones form with a few hidden fields. The action specifies which custom AJAX action will be called. In this case, it will be wp_ajax_run_script. We also have our nonce created by wp_create_nonce. Finally, we have the current_row field which is currently set to 0. This value is the crucial bit that will help us in our batch processing.

Now let’s take a look at the JavaScript side of things.

(function ($) {	
    "use strict";
		
    $(document).ready(function() {
        $("#script-submit").on('click', function(e) {
            e.preventDefault(); 

            var formData = $("FORM#script-form").serialize();

            processScript(formData);
        });
    });

    function processScript(formData) {	
        $.getJSON(ajaxurl, formData)
        .done(function(data) {	
            switch (data.result) {
                case 'COMPLETE':
                   alert("Done!");
                   break;
                case 'NEXT':
                   var newFormData = JSON.parse(data.form_data);

                   $("INPUT[name='current_row']").val(newFormData.current_row);

                   processScript(newFormData);
                   break;
               case 'FAIL':	
                   alert("Fail!");
                   break;
	    }
	}).fail(function(data, status, error) {	
	    alert('jQuery AJAX Failed!');					
	});
    }	
}(jQuery[/ref];

Things are pretty simple here as well. We have our bound event for the submit button click. This handles the form field data, passing it to our processScript function defined below. The processScript function sends the form data to the ajaxurl using jQuery’s $.getJSON function. The response is expected to be JSON and contains at least one key named “result”. The value for “result” is one of three possibilities: “NEXT”, “COMPLETE”, or “FAIL”. “FAIL” is reported if something went wrong on the PHP side, and “COMPLETE” means the entire CSV file has been processed. “NEXT” means there are additional CSV rows left to process. When this result is received, we update our “current_row” form value to the first row of the next batch. Then, our processScript function is run again. All of this results in a loop where processScript is run until either a “COMPLETE” or ‘FAIL” result is received.

Finally, we’ll take a look at the PHP bits. For the sake of simplicity, we’ll put these in our theme’s functions.php file.2Since I made this script as part of a plugin, all of my PHP code could reside in the plugin files, not fucntions.php.

<?php
add_action( 'wp_ajax_process_script', '9998123_process_script' );

function 9998123_process_script() {			
    try {
        $response = array();
        $form_data = $_GET;

        $nonce_check = check_ajax_referer( 'ajax-nonce', 'ajax-nonce', false );
			
	if ( current_user_can('administrator') && $nonce_check !== false ) {
            $file = 'file.csv';
            $handle = fopen($file, 'r');
		
	    $block_size = 200;
	    $total_rows = 0;						
	    $current_row = $form_data ['current_row'];
				
	    $row_limit = (  (int)$current_row + (int)$block_size );

            if ( false === ( $total_rows  = get_transient( 'ajax_csv_total_rows' ) ) ) {
                while ( ( $line = fgetcsv( $handle, 0, "," ) ) !== false ) {
                    ++$total_rows;
                }

                set_transient( 'ajax_csv_total_rows', $total_rows, 0 );
            }

	    if ( $row_limit > (int)$total_rows ) {
	        $row_limit = (int)$total_rows;
	    }

	    rewind( $handle );

            while ( ( $data = fgetcsv( $handle, 0, "," ) ) !== false ) {
                if ( $current_row < $row_limit ) {
                    // DO YOUR ROW HANDLING HERE...
                    $current_row++;
                }
	    }	
			
	    fclose( $handle );	
	    
	    $form_data['current_row'] = $current_row;

	    if ( $current_row >= $total_rows ) { 
                delete_transient( 'ajax_csv_total_rows' );

                $response['result'] = 'COMPLETE'; 
                wp_send_json($response);
	    }
		
	    $response['result'] = 'NEXT';
	    $response['form_data'] = json_encode($form_data);					
	} else {
            $response['result'] = 'FAIL';
            wp_send_json($response);
	}
    } catch (Exception $e) {
        $response['result'] = 'FAIL';
    }

    wp_send_json($response);
}

There’s quite a bit going on here, so let’s run through it. First, a custom AJAX action is created to reference our PHP function 9998123_process_script. Inside that function, we first check the nonce and make sure the request is from an administrator.3Your script may be available to other roles, so you can change this to whatever to need. Then we grab the CSV file, which is static in this case.

Now we have to set some defaults that help with our batch processing. The $block_size represents the number of rows to be processed in a single request, and $total_rows contains the total number of rows in the current CSV file.4In the plugin I built, the block size is a form element that the user can adjust. I also included an AJAX file upload, so the user can upload a new file or select one that already exists. Then, $current_row houses our pointer that is passed back and forth between the client side and server side. The $row_limit is provisionally set by adding our pointer to the block size. This establishes the row number that is to be processed last in the current batch.

Getting the Total Number of CSV Rows

Normally, to determine the number of rows in the current CSV file, we’d have to loop through the entire file on each request. Although by setting a quick transient, we only need to perform this potentially slow action during the first batch. The transient will store the total number of rows, and we can access that on subsequent batches. Since we don’t really know how long our script may run, we set the $expiration parameter to ‘0’. However, we will delete this transient when our script has run to completion.

Once there is a value for $total_rows, we check the $row_limit to determine if it needs to be adjusted based on our absolute upper bounds. If $total_rows is less than the current batch’s $row_limit, we make them equal. This may only happen during the final batch if the number of rows left to process is less than the $block_size value.

Processing the Current Batch

With the current batch rows determined, it’s time to process them. We loop through the CSV file using fgetcsv. On each iteration, we check the $current_row and $row_limit to decide if the current row is in our batch. If it is, we can do whatever processing of the row that’s required and increment the $current_row pointer.

Incrementing the $current_row value is the critical part of our method of batch processing. As you can see on line 38, we are updating the value sent back to our JavaScript. It bounces back and forth, increasing by the $block_size each time until the last batch. Starting on line 46 you can see the business logic that determines whether the current batch is the last one or if more batches are necessary. By referencing the $total_rows variable, we can check whether we’ve processed the entire file or not. If we have more rows left after the current batch, we pass “NEXT” as the value for “result” in our JSON response. On the other hand, if there are no more rows left to process, we pass “COMPLETE”.5I’m skipping error handling to keep this post from getting even longer. Passing “FAIL” in our JSON response along with an array of error messages can let the user know exactly what went wrong. Finally, if we’ve processed all the data, we can delete the transient we created earlier.

I use wp_send_json to create properly formatted JSON from our $response array and safely exit.

Looking back at our JavaScript, it’s easy to see what happens when “NEXT” is received as the “result”. Since we included the incremented ‘current_row’ as part of the JSON response, it can be fed right back to the next iteration of the AJAX. Once the JavaScript receives “COMPLETE” from the PHP side, the AJAX stops, and the user is alerted.

Wrapping Up

The TL;DR version of this is to pass an incrementing value between the client and server side. That value, along with a static batch size value, allows for partial processing of a data set. This technique is certainly not novel, but it comes in handy when dealing with large data sets or resource intensive processing. Although my example above uses a CSV file as the data source, other data sources can be used. As long as the data has some sort of indexing or sorting that’s consistent between batches, it can be processed in this manner.6A database is could be another possible data source. Query some data, sort by a key that’s unique, and use “LIMIT” to handle the data in chunks. For the WooCommerce plugin I created, I was able to add more features. Since I already had a “current_row” pointer coming from the PHP, I included the $total_rows and $row_limit values in each response. Combined with the $block size, I was able to add an overall progress bar and messages with information about the current batch being processed. I also added an AJAX file upload and a template of the acceptible CSV format.

If you found this article useful, please leave a comment below. Happy scripting!

About

Web Developer

References

References
 1 Serialized post meta that stores an array of one-to-many values just sucks. The only way to have a meta query of these is with “LIKE” which is slow.
 2 Since I made this script as part of a plugin, all of my PHP code could reside in the plugin files, not fucntions.php.
 3 Your script may be available to other roles, so you can change this to whatever to need.
 4 In the plugin I built, the block size is a form element that the user can adjust. I also included an AJAX file upload, so the user can upload a new file or select one that already exists.
 5 I’m skipping error handling to keep this post from getting even longer. Passing “FAIL” in our JSON response along with an array of error messages can let the user know exactly what went wrong.
 6 A database is could be another possible data source. Query some data, sort by a key that’s unique, and use “LIMIT” to handle the data in chunks.

3 Comments

Add Your Thoughts

Your email address will not be published. Required fields are marked *