We have been looking at things recently which have made my colleagues interested in hardware again, and which also show how far we have come in being able to manipulate and move large amounts of data.
WorldCat is our union catalogue of about 56 million bibliographic records, which represent approximately a billion holdings. It is about 50 gigabytes in MARC Communications (100+ gigabytes in XML) format and about 23 gigabytes compressed.
OCLC Research recently acquired a 24-node (48-cpu) Beowulf cluster with 96 Gigabytes of memory. According to my colleague Thom Hickey, whose team has been working on the machine, the cluster speeds up most bibliographic processing by about a factor of 30. This means that what might have taken a minute now takes two seconds, what might have taken an hour takes two minutes, what might have taken a month takes a day. For jobs that will fit entirely in memory (e.g. a `grep' of WorldCat) avoiding disk i/o gives another factor of about 20, reducing 1-hour jobs down to 6 seconds. We can 'frbrize' WorldCat on the cluster in about an hour.
WorldCat is also now more mobile. Thom has a 40 gig iPod which can accommodate WorldCat on its disk with room left for 5,000 song tracks. Now, you can't do much with the data on the iPod, but you can certainly carry it around. Again, it takes about an hour to get it on and off the iPod.
Thom adds "much as 'quantity has a quality all its own', such a speed-up changes both the approaches that can be taken to solving problems and the type of problems that can be tackled at all. Google typically works with cluster of 3,600 CPUs, raising performance to yet another level."