Recently we needed to import a fairly large dataset into Drupal. There were 200,000 items to import, the data changed relatively frequently, and we needed them to be Drupal entities so that we could make them searchable with Apache Solr. We would normally import data using one of
- Feeds, which provides a UI for importing a wide variety of data formats
- Migrate, which makes it easy to write code to import data
- a custom Drush script, calling
entity_save()for each record
We soon realized that no matter which of these options we chose,
node_save() would be called 200,000 times. After some experimentation, we decided that this was just too slow for our purposes. We instead decided to write a Drush script to write the data directly to the database. This brought the time to import everything down to about 15 seconds. Obviously this approach isn’t appropriate in most circumstances, so I’ll explain how and why we we did it this way.
When (not) to take this approach
You can do anything you want to your Drupal database. However, editing the node table manually will almost certainly leave you in tears. This is because the node table is closely tied to the field tables and the node revision table, and contrib modules need a chance to run on
hook_node_save(). So it’s a bad idea to take this approach to import
- Anything from core
- Most contrib entities, unless you understand them really really well
But! If you create your own custom entity which isn’t fieldable, then all of its entity data lives in a single database table in your Drupal database. That means you can do anything you want to it without getting hurt, as long as there aren't any contrib modules that you're depending on to do things for you using
hook_entity_save(). The only constraint on this table is that it has to have a numeric primary key for the entity ID.
So this might be for you if
- you need to import large amounts of data extremely quickly;
- you can make a custom entity type for your data. (For help with this, see also our blog posts Understanding the Entity API Module and Drupal 7 Entities: What are they, and what are they good for?);
- you’re comfortable writing a Drush script with a bit of SQL in it;
- you’re not worried about validating the data you’re importing.
Why we decided to do this
We started out by using the Migrate module together with the Migrate Extras, since we were already using it to import content into Drupal 7 from a Drupal 6 site. The Migrate module was reasonably easy to use, and I enjoyed using it. It’s even reasonably fast, but the fact of the matter is: running
entity_save() 200,000 times takes a long time.
We’d already designed our entity so that it used just one database table, and we trusted the data that we were importing to not need any validation, so we decided to import directly to the database. Imports went from 40 minutes to 15 seconds.
How we did it
In a nutshell, the approach we took was to
- dump all the data into a table in the Drupal database
- use that table to populate the as a base table
$entity_tablefor your custom entity type.
We were importing CSV files, so we used a ‘LOAD DATA LOCAL INFILE’ statement to load the data into a temporary table. This statement runs very quickly. After that we ran other SQL statements to copy the data from our temporary table into the table for our custom entity type. We did all the data processing in MySQL, so everything ran extremely quickly. The downside to doing all the processing in SQL statements is that, well, you only have SQL statements available to work with. However, if there isn't much processing to do, then this may not be an issue.
It isn’t advisable to just write pure PHP code and SQL for your import script, though — Drupal provides lots of great database helper functions, such as:
You can put together select and insert queries to avoid having to iterate through all of the records in PHP. Here's a PHP code fragment which illustrates our strategy:
db_create_table($temp_table, $schema); db_query(" LOAD DATA LOCAL INFILE '$csv_filename' INTO TABLE $temp_table FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\\n' IGNORE $lines_to_skip LINES; "); // Move the data from the temporary table into the entity table $select_query = db_select($temp_table, 'csv') ->groupby('country'); $insert_query = db_insert($entity_table) ->from($select_query); // MySQL's UNIX_TIMESTAMP() doesn't support negative timestamps, so do this instead $select_query->addExpression('timestampdiff(second, from_unixtime(0) , start_date)', 'start_date'); // Add a few constant fields to each row $select_query->addExpression("'$language'", 'language'); $select_query->addExpression("'$updated_time'", 'updated_time'); $insert_query->execute(); // Drop the temporary table db_drop_table($temp_table);
Dealing with your data post-import
Once you have your data in a table, you'll probably need to display it on your site somehow. You can install the excellent Entity API module to create views of your data. If you need to make your data searchable with Apache Solr you can tell Apache Solr about your entity type as described in this blog post.
Dumping data directly into the database is a closer-to-the-metal approach and you have to be much more careful about validation. It also means that you have to do a bit more work if you want to update the imported data. It's extremely fast, though, which may make the extra care you need to take worth it.