With the help from Jonathan Dursi and osgx, I\'ve now done the \"row decomposition\" among the processes:
row http://img535.imageshack.us/img535/9118/ghostcells.jpg
You can always make do without a datatype by just creating a buffer and copying the buffer as count of the underlying type; that's conceptually the simplest. On the other hand, it slower and it actually involves a lot more lines of code. Still, it can be handy when you're trying to get something to work, and then you can implement the datatype-y version along side that and make sure you're getting the same answers.
For the ghost-cell filling, in the i direction you don't need a type, as it's similar to what you had been doing; but you can use one, MPI_Type_contiguous, which just specifies a count of some type (which you can do anyway in your send/recv).
For ghost-cell filling in the j direction, probably easiest is to use MPI_Type_Vector. If you're sending the rightmost column of (say) an array with i=0..N-1, j=0..M-1 you want to send a vector with count=N, blocksize=1, stride=M. That is, you're sending count chunks of 1 value, each separated by M values in the array.
You can also use MPI_Type_create_subarray to pull out just the region of the array you want; that's probably a little overkill in this case.
Now, if as in your previous question you want to be able at some point to gather all the sub-arrays onto one processor, you'll probably be using subarrays, and part of the question is answered here: MPI_Type_create_subarray and MPI_Gather . Note that if your array chunks are of different sizes, though, then things start getting a little tricker.
(Actually, why are you doing the gather onto one processor, anyway? That'll eventually be a scalability bottleneck. If you're doing it for I/O, once you're comfortable with data types, you can use MPI-IO for this..)