Data Import
Blog

Using Real Content: Data Import

Sometimes, we find issues with content that are not anticipated by the planning process since they don't show up by looking at sample content or discussing the major use cases of the site. By looking at real content during the data import phase, these issues can be dealt with at an early stage in the development process.

Content Outliers

Often during the data import process, outliers in the data are uncovered. While 95% of the concerts posted on a website might have a single date, 5% might have repeat performances every week. Somehow the site needs to be able to handle these multiple dates without storing duplicate content.

Content outliers sometimes consist of content with missing fields which are required under the site archiecture, either because the design will break without them or because they're integral to the meaning of the content itself. For large amounts of content, the best solution is sometimes to revise the design so that it is robust enough to work without the content. If the fields were required because they're integral to the meaning of the content, it's often best to leave those pieces of content unpublished and flag them as "needs review".

Content Irregularities

Sometimes, content issues need to be resolved not in the site architecture, but in the content itself. This is the case with duplicate content, references to pieces of content that no longer exist, malformed content, and spam content from a previous website.

By using real content to test importing or synchronizing data, importing these content irregularities into a brand new site can be avoided.

Drupal Settings

On the node level, Drupal expects content to have certain properties. Importing content into Drupal sometimes reveals that these are missing or undefined. For example, assigning a node to a particular author might be important for workflow reasons. Websites with extensive menus will require that content be assigned to a parent menu item as part of the import process (which also means that the parent item has to be created before the child item). Some content doesn't have an obvious node title, such as image nodes, and a decision will have to be made on what to use as a value for the node title.

Content Import Methods

While most large sites require writing an import script or using the migrate and table wizard modules, others have a limited set of content so that it can be imported manually. In this case, most of the advice above still holds. Testing the site with fake content to anticipate outliers might also be useful.