Import Hundreds of Thousands of Records in 15 Seconds with Custom Entities

Recently we needed to import a fairly large dataset into Drupal. There were 200,000 items to import, the data changed relatively frequently, and we needed them to be Drupal entities so that we could make them searchable with Apache Solr. We would normally import data using one of

  • Feeds, which provides a UI for importing a wide variety of data formats
  • Migrate, which makes it easy to write code to import data
  • a custom Drush script, calling entity_save() for each record

We soon realized that no matter which of these options we chose, entity_save() or node_save() would be called 200,000 times. After some experimentation, we decided that this was just too slow for our purposes. We instead decided to write a Drush script to write the data directly to the database. This brought the time to import everything down to about 15 seconds. Obviously this approach isn’t appropriate in most circumstances, so I’ll explain how and why we we did it this way.

When (not) to take this approach

You can do anything you want to your Drupal database. However, editing the node table manually will almost certainly leave you in tears. This is because the node table is closely tied to the field tables and the node revision table, and contrib modules need a chance to run on hook_node_save(). So it’s a bad idea to take this approach to import

  • Nodes
  • Users
  • Taxonomy
  • Anything from core
  • Most contrib entities, unless you understand them really really well

But! If you create your own custom entity which isn’t fieldable, then all of its entity data lives in a single database table in your Drupal database. That means you can do anything you want to it without getting hurt, as long as there aren't any contrib modules that you're depending on to do things for you using hook_entity_save(). The only constraint on this table is that it has to have a numeric primary key for the entity ID.

So this might be for you if

Why we decided to do this

We started out by using the Migrate module together with the Migrate Extras, since we were already using it to import content into Drupal 7 from a Drupal 6 site. The Migrate module was reasonably easy to use, and I enjoyed using it. It’s even reasonably fast, but the fact of the matter is: running entity_save() 200,000 times takes a long time.

We’d already designed our entity so that it used just one database table, and we trusted the data that we were importing to not need any validation, so we decided to import directly to the database. Imports went from 40 minutes to 15 seconds.

How we did it

In a nutshell, the approach we took was to

  1. dump all the data into a table in the Drupal database $temp_table
  2. use that table to populate the as a base table $entity_table for your custom entity type.

We were importing CSV files, so we used a ‘LOAD DATA LOCAL INFILE’ statement to load the data into a temporary table. This statement runs very quickly. After that we ran other SQL statements to copy the data from our temporary table into the table for our custom entity type. We did all the data processing in MySQL, so everything ran extremely quickly. The downside to doing all the processing in SQL statements is that, well, you only have SQL statements available to work with. However, if there isn't much processing to do, then this may not be an issue.

It isn’t advisable to just write pure PHP code and SQL for your import script, though — Drupal provides lots of great database helper functions, such as:

You can put together select and insert queries to avoid having to iterate through all of the records in PHP. Here's a PHP code fragment which illustrates our strategy:

db_create_table($temp_table, $schema);
  LOAD DATA LOCAL INFILE '$csv_filename' INTO TABLE $temp_table
  IGNORE $lines_to_skip LINES;
// Move the data from the temporary table into the entity table
$select_query = db_select($temp_table, 'csv')
$insert_query = db_insert($entity_table)
// MySQL's UNIX_TIMESTAMP() doesn't support negative timestamps, so do this instead
$select_query->addExpression('timestampdiff(second, from_unixtime(0) , start_date)', 'start_date');
// Add a few constant fields to each row
$select_query->addExpression("'$language'", 'language');
$select_query->addExpression("'$updated_time'", 'updated_time');
// Drop the temporary table

Dealing with your data post-import

Once you have your data in a table, you'll probably need to display it on your site somehow. You can install the excellent Entity API module to create views of your data. If you need to make your data searchable with Apache Solr you can tell Apache Solr about your entity type.

Dumping data directly into the database is a closer-to-the-metal approach and you have to be much more careful about validation. It also means that you have to do a bit more work if you want to update the imported data. It's extremely fast, though, which may make the extra care you need to take worth it.