Migrating data from a Drupal site to Rails - Part 2

Posted by jon

A month or two ago I threw up some notes on how I migrated all my data from a legacy Drupal 5 PHP codebase to my nice new Rails based site.

The first post focused on the fundamental logistics of connecting to both databases at the same time and then how you move the data from one to the other, and in this post I’ll now go over some more specific issues situations I encountered which might be of use to others using Drupal (or PHP in general for that matter).

General Strategy

Primary keys in Drupal are of the form nid, vid, uid, cid (node id, version id, user id, comment id) so you can generally use these to keep track of everything.

At first I toyed with the idea of directly reusing the primary keys from the old database however I soon found this difficult to keep track of once the data structure started to change. It was easy enough following simple data structures which mapped one to one, but as soon as I had a situation like threaded (tree structure) comments which had all sorts of extra fields being populated it got nasty – because I was trying to manually populate all the data I would have had to work out how to fill that in myself/

So, the approach I chose was to simply add columns to my new tables which would hold the legacy primary keys, making it very straight forward to link the new records to their legacy counterpart. A simple find_by_uid or find_by_nid and I knew exactly where I was.

General Tips

First it’s probably good to go over some general things you could do with knowing before looking at specific areas of your site.

Times / Dates

In the case of timestamps, the Drupal convention is to store all it’s dates as epoch dates in an integer field. Once you have your epoch value, it’s as straight forward as:


timestamp = Time.at(my_epoch_date)

And you now have a Time object in timestamp.

PHP serialized fields

Quite a few fields in a Drupal DB use PHP’s serialization so for that you just need a copy of Thomas Hurst’s ruby implementation. Download the php_serialize.rb file, plonk it in lib/ and require ‘php_serialize’ in your rake file then it’s as easy as:


date = PHP.unserialize(serialized_date_string)

And there you have your native Ruby Date in date.

Boolean fields

I had a bit of confusion with boolean fields. Looking back over my old DB all the boolean fields were tinyint, but I found myself having to treat them as strings when doing any comparisons. Memory is a bit hazy as it’s a while ago now, so maybe some experimentation might be needed on your part. While it’s expected that find_by_sql might not return a boolean object from a boolean field I would then expect integers to be returned from a tinyint field, however I distinctly recall getting strings and on looking at the old DB, the fields are definitely tinyint.

Anyway, just a warning really, best test this one for yourself rather than just going on my advice.

Use save!

Lacking the feedback when using a rake task, using save! is an obvious good precaution as it will also help you make sure your data matches your new site’s validation rules and highlight any additional areas where you may have forgotten you need to modify your data to fit the new site. Occasionally though you might find it handy to use save_without_validation, such as if you’ve got validation rules in place that aren’t there specifically for data integrity. My example would be my private messaging which I toyed with having a size limit on, so rather than truncate the old messages I left them there and left the rule in place for any new ones.

I’m sure it goes without saying though that if you’ve got other important validation rules on your model then it’s probably a bit reckless to ignore validation entirely and a good idea look into how to override the validation in question.

Converting from BBCode to Textile

In my Drupal days I used BBCode for my user friendly markup, but since joining the Rails camp I’ve naturally adopted Textile. I simply have a helper m() that I use much like h(), which uses RedCloth to create html from the Textile source, then I run all that through white_list.

For the conversion I found the BBCodeizer plugin. With this I simply ran my fields through it and saved them in the new DB with the basic html in them like so:


string_with_html = BBCodeizer::bbcodeize(string_with_bbcode)

RedCloth and Hard Breaks

Another thing to do with markup that you might well encounter is to do with hard breaks – keeping line break as <br /> tags.

The most basic Drupal filter doesn’t add much markup other than simple addition of line breaks and paragraphs, so for most Drupal people this would be expected.

RedCloth adopts a similar behaviour by default for paragraphs and the line breaks should also work too by passing the :hard_breaks option, however it appears to be broken atm. Adding the following to my environment.rb (after the Initializer.run block) fixed it (source Rails Wiki).


class RedCloth
  def hard_break( text ) 
    text.gsub!( /(.)\n(?!\n|\Z| *([#*=]+(\s|$)|[{|]))/, "\\1<br />" ) if hard_breaks 
  end
end

Passing local files into attachment_fu

I found a helpful blog post from Ben Reubenstein, attachment_fu Now With Local File Fu, which told me all I needed to know.

His guide takes you through creating a model called LocalFile which you can then use like so:


avatar = Avatar.new()
avatar.uploaded_data = LocalFile.new(FULL_PATH_TO_FILE)
avatar.save

Pretty straight forward.

Tips for specific parts of Drupal / modules

User Accounts

I used restful_authentication for my main User model and using the general pointers described up to now it’s a very standard transfer of data. The only thing that gets in your way is the passwords which are encrypted differently, so you have no choice but to force a reset.

One thing to bear in mind here is that you need to manually create the salts yourself. restful_authentication / aaa create these when the account is created only, so unless you do this yourself all your migrated users will have no salt. They’ll work fine – an empty hash is allowed – it’s just a bit pointless from a security point of view.

You can find the code to do this towards the bottom of the user model file, in the encrypt_password method:


self.salt = Digest::SHA1.hexdigest("--#{Time.now.to_s}--#{login}--") if new_record?

Custom user profile fields

A standard module in Drupal offers the ability to add extra fields to the user accounts. Administrators can define them with any name and select various formats (text field, text area, check box, etc) for the user input.

Behind the scenes this is made up of two tables with fairly self explanatory names – profile_fields and profile_values. So, let’s say you’re creating a new user having already populated it with the data from the Drupal user table, you would then look up the data for a profile like so:


data = Drupal.find_by_sql "SELECT * FROM profile_values WHERE uid = #{old_user.uid} AND fid = 20" 
user.sex = data.first.value unless data.first.nil?

This works on the assumption that you’re manually going through the fields and writing the code for each profile item manually. If you want you could dynamically loop through the profile fields table, but this was much quicker due to it’s simplicity, not to mention that if you want to do that your internal naming, etc, has to be identical.

Comments on nodes

I was a little apprehensive about this one, what with the tree structure, I could see it getting a tad confusing. No need to worry in the end though.

I’m a big fan of better_nested_set, which builds on the acts_as_tree code that comes with Rails (or is a plugin now I think). I won’t go into detail on how it works, all you need to know is that once you’ve added the better_nested_set declaration to your model, all you have to do when you create an object is first save it, then if it’s a child of another record move it with the method move_to_child_of().

So, because of the way I decided to do things with leaving the legacy primary keys in the new database, all I had to do was loop through the legacy comments, creating them as I go, then check for a legacy parent_id and on finding look it up by it’s legacy uid and do a move_to_child_of() and it’s in it’s right place.

Couldn’t have turned out easier.

Oh, one more little thing to mention, having a field named comment clashes with a reserved word so you’ll need to do a SELECT comment as.

Moving from taxonomy to tags

Drupal’s beloved taxonomy system is quite comprehensive, so depending on how you use it, this may or may not be enough to get you by.

I never used anything more than a few vocabularies with a few terms in each, so I decided to go to tagging, adopting acts_as_taggable_on_steroids and literally was only concerned with keeping the appropriate tags linked to my new models which had taken place of various nodes (all of one type though).

So, there are a good few tables but the main ones I was concerned with was term_data and term_node, and this was the basic idea:

  • Loop through my (to be) tagged model
  • Select all the appropriate rows in term_node (which funnily enough, links terms to nodes), with a join to term_data so I can get the term names
  • Loop through all the term_node rows and on each one use tag_list.add(old_term_name)

Again, pretty straight forward.

Buddy Lists

My Drupal site used the BuddyList module (5.x-1.x-dev – 2007/02/25), and my Rails one uses has_many_friends.

To simplify things I warned people that any outstanding invites would be wiped (it wasn’t hugely used anyway). In addition, if you used the “Buddy Groups” feature of BuddyList you’ll have to do some more coding as has_many_friends doesn’t have an equivalent feature.

So, assuming then you just want to move the friendships, it’s nice and easy – one table in Drupal, one in has_many_friends. “uid” becomes “user_id”, “buddy” becomes “friend_id”, and then it’s just a few timestamps.

Like everything before – loop….. create…. loop.

Guestbooks

In Drupal I used the Guestbook (5.x-1.0) module but for my new site I just rolled my own – a simple threaded tree using better_nested_set in pretty much the same way as I’d done for comments.

It has to be said, this really was pretty crude – one table with each row containing not just the guestbook post, but the data for the one and only possible reply in it too.

To migrate it I simply went through each user and for each one selected all rows in the guestbooks table related to them and created a new guestbook post for it. I then looked for a reply, and if I found one did the same and used move_to_child_of() to establish it as the child of the other one I’d just created.

I also took this chance to make them all my friend, what with me being the equivalent of ‘Tom’ for my site.

Private Messages

I’ve left this one to the end because I’m not going to go into much detail on what I did as, 1 – there’s a reasonable chance you might want to use a different plugin with more features (such as folders), and 2 – if you don’t, I can’t honestly recommend the plugin I used.

My Drupal site used the Privatemsg module (5.x-1.7) and for my new Rails site I chose, restful easy messages.

If you’re literally only using basic functionality (no sent or trash folders, no user folders, only single recipients) there are a few choices to go for and it’s simply a case of using the techniques used above to transfer the data – it’s all the same, just slightly different format. Loop…. create…. loop.

So, why don’t I recommend restful easy messages? Well, I picked it as I wasn’t keen on the non restful design of acts_as_emailable and easy_messages so at first sight this looked like a nice option. But I soon ran into problems while migrating my data, populating fields such as receiver_deleted & receiver_purged. I naturally set those to true or false, and after a whole load of head scratching realised that both were effectively being evaluated true. Looking in the plugin the select conditions being used were “IS NULL” or “IS NOT NULL” rather than using AR to pass true or false in.

:-s

I also then had problems with the helpers in the generated views – personally feel using a helper to a simple link_to isn’t really necessary unless it’s being done say five or ten times, but I figured since it was in the generated view I’d leave it, but even they didn’t work – the anchor text wasn’t sanatized and the user instance var isn’t passed so the generated path is wrong.

Anyway, I don’t want to go on and don’t mean this to be some sort of flame, but has to be stated that the code is pretty funky.

If you’re only using basic messaging, it’s so bloody simple it’s hardly worth using a plugin anyway as it’s not a complex creation, which is why I just rolled my own next time around. If you want something more capable, the one I thought looked very complete and nicely done was Phil Sergi’s acts_as_messageable. I almost used it for another project but eventually ditched it for my own creation as it just seemed like overkill. has_messages sounded quite nice, but I didn’t like the look of having to install half a dozen other plugins just to get it working.

But what about nodes?

Oh yeah, nearly forgot them :-p

I didn’t actually migrate them as my site’s focus was entirely on the functionality provided by my custom module and there was very little else (a good indication as to why Drupal wasn’t suitable for me). Still, if you’ve understood everything so far it’ll be no different to everything else. If you need to preserve versioning, I’d imagine Rick Olson’s acts_as_versioned will no doubt be of use (can’t say from experience, but you can’t go wrong with the ‘weenie).

Wrapping up

Long post, no wonder I put it off for ages.

Hopefully if you’re trying to migrate from Drupal it’ll be of some help. Even if I’ve not covered the specific modules it should give you enough of info to tackle any other part of it.

Oh, and if you want to see the finished product, the site’s pearsontowers.com.

Quick heads up for anyone wanting to use libtorrent-ruby

Posted by jon

A tale of two libraries….

To make some use of some of my machines which don’t use all of their monthly bandwidth allowances I’ve installed the torrentflux web based Bit Torrent client and I just leave it helping seed some torrents. A while back I came across a Rails equivalent (there don’t seem to be many) so this morning I figured I’d give it a try.

So, I grabbed a copy of TorrentKeeper from their svn repository and went about making sure I had all the prerequisites. One of them was libtorrent-ruby, so I grabbed a copy of that too and went about installing it. Inevitably that had various dependencies too – g++, swig and of course libtorrent – which is where the confusion came from.

It required version 0.10 which as far as I was concerned I had satisfied ok (running Debian Etch, package showed 0.10.4-1), but on running setup.rb it kept on barfing looking for various files that weren’t there.


- snip
libtorrent_wrap.cpp:1830:34: error: libtorrent/bencode.hpp: No such file or directory
libtorrent_wrap.cpp:1831:32: error: libtorrent/entry.hpp: No such file or directory
libtorrent_wrap.cpp:1856:32: error: libtorrent/alert.hpp: No such file or directory
libtorrent_wrap.cpp:1857:38: error: libtorrent/alert_types.hpp: No such file or directory
libtorrent_wrap.cpp:1922:37: error: boost/filesystem/path.hpp: No such file or directory
libtorrent_wrap.cpp:1932:36: error: libtorrent/ip_filter.hpp: No such file or directory
libtorrent_wrap.cpp:1951:34: error: libtorrent/peer_id.hpp: No such file or directory
libtorrent_wrap.cpp:2027:36: error: libtorrent/peer_info.hpp: No such file or directory
libtorrent_wrap.cpp:2047:39: error: libtorrent/peer_request.hpp: No such file or directory
libtorrent_wrap.cpp:2050:38: error: libtorrent/fingerprint.hpp: No such file or directory
libtorrent_wrap.cpp:2097:39: error: libtorrent/torrent_info.hpp: No such file or directory
libtorrent_wrap.cpp:2125:33: error: libtorrent/hasher.hpp: No such file or directory
libtorrent_wrap.cpp:2128:34: error: libtorrent/storage.hpp: No such file or directory
libtorrent_wrap.cpp:2131:41: error: libtorrent/torrent_handle.hpp: No such file or directory
libtorrent_wrap.cpp:2226:43: error: libtorrent/session_settings.hpp: No such file or directory
libtorrent_wrap.cpp:2229:34: error: libtorrent/session.hpp: No such file or directory
- big snip

After a whole load of head scratching, a bit of Googling and a whole load of `find`ing I eventually twigged.

Turns out that there are two libtorrent projects around – Rakshasa and Rasterbar. Rakshasa is the one you’ll find packaged for Debian however the Rasterbar version is the one libtorrent-ruby is written for.

So, if you come across this while having problems trying to do anything that requires libtorrent and can’t work out why it’s saying it’s missing when you think it’s definitely there, make sure you’ve got the right version.

A Facebooker tutorial going a bit beyond 'hello world'

Posted by jon

A few weeks back David Clements released his Facebooker tutorial which he’s actually created as a Facebook application thereby offering the ability to show you live demonstrations, and with your own personal data too.

It looks like a work in progress that’ll grow and change as the API does, but it already covers a lot of ground and really quite comprehensively, so if you get through this I imagine you should be in a good position to find your own way through more advanced functionality without too much trouble.

Find it here.

A bit of an update on Facebooker

Posted by jon

So, my foray into the world of Facebook applications was stuck on the back burner a week or two after my last illustrious post. Would like to say I had to reprioritise but really it’s because I can never stick with coding on the same project for more than a few days and instead keep flitting around countless half finished apps to keep me safe from boredom (but completely lacking in the satisfaction of putting something into production).

The matter of routing

Still, there was a practical reason for waiting on my app, and that was because the situation with regards to routing was pretty up in the air and (I think) still is. Since I wanted to do a combined FB/non FB app my routing wasn’t so straight forward and I was struggling with the lack of clarity on how I should have been proceeding, along with what options I had and what might at the time simply not have been possible. I figured I might as well hold off for the first official release for this to be clarified.

It turns out though that this has been one of the bigger topics of conversation over the last few weeks and getting it right is involving a fair amount of discussion, the issue being clarifying, well, the sorts of things I mentioned I had – what issues are down to how people should be approaching the design of their app and what needs to be built into the plugin’s functionality.

As I said before though, despite some of these matters, Facebooker itself is quite stable and has for sometime been used in production by a good few people (including combined FB/non FB apps), I guess you’ve just got to be ready for the fact that you might have to be ready for a bit of refactoring as new code is checked in and better ways of doing things become available. All the core functionality of the API has long been implemented.

More users…

We’re also starting to see a few new faces and a lot more activity on the mailing lists, with a good few messages most days with people generally getting responses to queries within a few hours tops. I think a large part of this is the fact that Matt Pizzimenti is now promoting Facebooker to potential new users as he no longer has time to actively develop rFacebook and is likely to cease development on rFacebook once Facebooker gets it’s first official release. We’re now starting to see more blogging and tutorial writing for Facebooker too which is great and surely only going to increase over the next few months.

Some places to get started

Here are a few tutorials that have surfaced over the last few weeks which will help a new user get going.

Bebo support coming soon

I suppose this is the last main thing to mention is Bebo support which a good few people have expressed an interest in and a few have already started working on. I think I heard somewhere that rFacebook has already implemented this functionality so one would assume implementing this with Facebooker’s not going to be far off.

Migrating data from a Drupal site to Rails - Part 1

Posted by jon

A couple of weeks ago I finished moving a site of mine from a Drupal 5.1 codebase to a new one written in Rails so I figured I’d throw up a quick run down of the issues encountered and the solutions I employed to deal with them. What’s covered in this post is the basic mechanics involved in the migration and contains very little Drupal specific stuff. I’ll follow it up shortly with some other bits that will be handy for anyone wanting to get their data out of a Drupal schema.

All in all though, it was remarkably straight forward and painless. Having had it in the back of my mind for the previous month or two, on the morning I sat down to start working on it I had a pretty good idea of how I planned to tackle the problem and by the evening I had group of 7 rake tasks that did the whole job in about 15 minutes.

What’s involved

I’m not sure how many different ways one might consider attacking this, but I figured since it’s a one off task in a controlled environment I’d sacrifice any concern for speed and efficiency in favour of simplicity and therefore I chose to make use of similarities between the way Drupal models it’s data with ActiveRecord to iterate through the data I wanted in a predictable manner.

Drupal’s node system works such in a way that each node object relates to a row in a table, so having this basic property in common with AR is enough to know that you’ll most like be able to just select a whole load of rows from your Drupal site, loop through it and create new objects in your new app as you go.

Once you know this, assuming your old app’s fundamental modelling isn’t vastly different to your new one’s all you then have to work out is how you’re going to connect to both databases and then just iterate through the old data and create the new data as you go.

The basic rake task

After a bit of research into different methods I found out how to open up a connection to my legacy database by creating an AR class which I could use as my way to get the data out of it, while the rest of my app remained the same around it.

The rake task shown below is the one I used to transfer all my user accounts, albeit with a bit of extra stuff removed which dealt with all my custom user profile data. All the other tasks were based on the same code with the only changes being to the sql statement and the contents of the while loop which iterates through what that pulls back.

BTW, if you’re not familiar with rake tasks, check out the Railscast linked to at the bottom of the article.


namespace :db do
  namespace :legacy do
    desc "Transfer user data" 
    task :migrate_users => :environment do
      class Drupal < ActiveRecord::Base
        establish_connection "legacy" 
        set_table_name "users" 
      end

      ActionMailer::Base.perform_deliveries = false
      User.record_timestamps = false

      @old_users = Drupal.find_by_sql "SELECT * FROM users WHERE uid > 0" 
      for old_user in @old_users
          new_user = User.new(:login => old_user.name,
                              :email => old_user.mail,
                              :sig => old_user.signature)
        new_user.created_at = Time.at(old_user.created)
        new_user.updated_at = Time.at(old_user.changed)
        new_user.save
      end
    end
  end
end

So, if you’ve been working with Rails for a little while then you’ll probably get all you need to just from reading that snippit of code but it’s still worth reading on to see why I did things a certain way. If you don’t understand what’s going on, here’s an in depth account of what’s happening.

What we’re doing is creating a new class descended from ActiveRecord::Base which, if we left it like that, would be no different from if we’d created a model called Drupal. However, that’s not what we want as we’re using this as our way to connect to the the legacy database, so we have to change where it plans to connect to. For this we have the following lines:

establish_connection "legacy" 
set_table_name "users" 

The first line here overrides the connection it inherited from ActiveRecord::Base which will already be set to connect to your production or development database and tells it to connect to ‘legacy’ which I have defined in my databases.yml just like the others. In addition to this though, because it’s not a typical AR model, when it looks for our table called drupals it’s not going to find it which is the reason for the second line.

This tells AR that the corresponding table for this class is users which I’ve merely set to satisfy AR. In practice I only ever plan to use find_by_sql so this bears no influence on the way the migration, however if you now did a Drupal.find(:all), you’d get the contents of the user table returned.

I did think there was a better way though as from what I’ve gathered, setting abstract_class to true (i.e. use self.abstract_class = true instead of setting the table name) should tell AR that your class doesn’t have a corresponding table however when I tried this it still looked for a table named drupals, so I just went with setting the table name.

Anyhoo…. after that there are these two lines:

ActionMailer::Base.perform_deliveries = false
User.record_timestamps = false

The first one turns off mail delivery so that things like registration mails working off observers don’t get sent (as will be the case if you’re using AAA or Restful Authentication).

The second one turns off timestamps on the new users I’m creating so that I can preserve these from the old records. Incidentally I experienced some strange behavior with this having initially set this on ActiveRecord::Base which seemed to work for created_at but not the updated_at field (and in addition to this, I only observed this with the User model). Setting it directly on the model in question had the desired effect.

So, once you’ve done this, you can use the Drupal class to do a find_by_sql against your legacy database and pull back any info you like.

It is possible to go into more depth to connect to a legacy schema through AR, defining several classes and defining has_one and has_many relationships by overriding table and foreign key names, but like I said before, for simplicity’s sake I decided I’d rather iterate through tens or even hundreds of individual statements rather than pull my hair out making it more complicated just for the sake of a few minutes gained by the joins.

And so, once you’ve done that find, you have all your legacy data in a nice orderly collection and you can loop through it and create the entries in your new database as you would do normally.

And that, pretty much, is that.

Based on this, you have full access to all your legacy data and can perform whatever sql is needed to pull back other data, nest multiple loops / statements within each other to create your models with has_many relationships, etc. In my next post I’ll cover a few specifics which you’re likely to run into if you’re migrating from a typical Drupal install.

Reference

A few notes on Facebooker

Posted by jon

A few weeks ago I made a start on my first Rails Facebook application and after some playing around Rfacebook, what I thought was the only Facebook plugin available, I came across Chad Fowler’s Facebooker.

Here are a few handy bits of information for anyone else who’s interested in giving Facebooker a try before it’s first official release.

Firstly, hats off to Matt Pizzimenti for stepping up and releasing Rfacebook quickly after the F8 keynote but as far as I know his plugin has remained essentially a straight Ruby port of the PHP libraries. Facebooker is the Rubyist’s answer to the Facebook API and wraps the api into classes and methods that are intuitive to the typical Ruby developer. Whether this abstraction is a good thing or not has been the point of a small amount of debate but personally, I like it.

Anyway, here’s a run down of the current state of affairs.

The implementation of the API itself is now by and large complete and the developers are currently focusing on implementing the Rails side of things. When I started using it a few weeks ago all that was implemented were the basic Rails features:

  • Methods to create and work with Facebook sessions.
  • Before filters to check that your application has been installed in Facebook
  • Ability to create an FBML mime type so you can use respond_to to separate logic in your controllers and use separate views for fbml and plain html.

Since then the amount of Rails oriented code has greatly increased. A few days ago a whole load of helpers to ease creation of FBML were committed by Mike Mangino, and just today there was a commit to address the sessions issue (because due to the way that Facebook proxies FBML canvas requests, they don’t work off the bat).

Facebooker is yet to produce it’s first official release however at this rate it seems pretty close and if you’re feeling adventurous it’s quite usable as it is. Very little material is available so far but for the time being there is at least the readme which now contains the fundamental usage information.

In addition there’s the rdoc, which I’ve put online here and will do my best to keep up to date. Things are quite quiet over at Rubyforge at the moment and it seems for now the developers are coordinating their efforts directly, but they do appear to keep their code documented well.

I’ve made quite a few notes over the course of this so I plan on putting together a few articles on the experience. My initial intention was to focus on writing apps that also exist as external applications but I’ve since decided to also put together a few general pointers on writing Facebook apps for people already up to speed with Rails, since the main well known Rails Facebook (by Stuart Eccles) tutorial is oriented more towards people new to Rails. Should there not be any significant writing about Facebooker by then I’ll put something together on it too.