Time to Downsize

One of the biggest challenges in modern Imaging Research lies in how to handle big datasets. This is a particular issue when undertaking multidimensional acquisition. In this post, I’ll be covering some ideas on how to sensibly work with large datasets as well as some neat tricks on how to downsize the biggest ones.

Multidimensional datasets can have multiple timepoints, channels, slices, positions or combinations of all of these.

Multidimensional Datasets

To fully appreciate the magnitude of big datasets, it’s worth considering dimensionality. On a hard-disk, a single image takes up the number of bits (binary digits) needed to record the information for every pixel in the image (We’ll assume arbitrary 512 x 512 image dimensions). If the image is 8 bit (remember that’s 8 bits of data per pixel) the image will take up:

`x * y * bitdepth = 512 * 512 * 8 = 2097152 bits ≈ 0.25 Mb`

(It will actually take up a bit more as there’s header information in the image)

We can extend this thought-experiment to a time course:

So now, the file size of the image will be multiplied by the number of frames (tn). But why stop there? Everyone loves 4D imaging, so let’s add a 20-slice z-stack into the mix:

Now every time point has a corresponding stack in the z-plane.

Multi-channel imaging is very common, even if it’s just recording the transmitted light in addition to your fluorescence. The real fun starts when you have access to spectral detection, which can add up to 32 channels across a slightly-wider-than-visible spectrum:

Even with conservative values (50 time points, 20 z-planes and 4 channels), add it all together and you’re getting single files in the region of a gigabyte:

`tn * zn * λn * x * y * bitdepth = 50 * 20 * 4 * 512 * 512 * 8 ≈ 1000 Mb`

Of course this can be extended to multiple positions on an XY stage, higher bitdepths and so on but hopefully the point is made.

You not only have to store all of these data but even moving the files around and opening them, becomes time consuming.

When this might be an issue

In and of itself, there is no real problem with multidimensional datasets. If that’s what your experiment requires, then you can’t really make do with fewer channels or a smaller z-stack.

This may be more relevant in some specific circumstances:

• If during your time course the sample drifts, perhaps the first half of your movie is useful but the latter half is unusable.
• The same goes for bubbles, debris, cells dying or anything else obscuring your data but leaving enough time points to be worth keeping.
• We’ve occasionally seen that when you run long, multidimensional time courses and cancel before the end of the acquisition, Zen (the Zeiss acquisition software) writes out “Phantom Images” to fill in the un-acquired time points. The images are not empty, but have very specific (and within an experiment, identical) noise.

What to do about it: An ode to Bio-Formats

As I have said before, one of the most useful things about Fiji is the integration of the Bio-Formats library. Not only does this allow you to open a huge number of proprietary file formats from within the comfort of Fiji, but it also has a bunch of smart Importing options. Let’s take a look:

To open files with the Bio-Formats library in Fiji you have two options:

1) On the menu, run [Plugins > Bio-Formats > Bio-Formats Importer] and you’ll get a normal Open File Dialog.

2) The alternative (and my preference) is to run [Plugins > Bio-Formats > Bio-Formats Shortcut Window] which will (unsurprisingly) open the Shortcut Window:

Not only is this an easy way to access the commands (Import, Export &c) but you can drag files onto the shortcut window to open them.

Here’s where the awesome begins. Once you drag a file in or open a file, you are presented with the Import Window which looks something like this:

You may not realise this but this is an incredibly useful and powerful interface. Here are a few highlights:

Display Metadata / ROIs: Metadata deserve their own post (although a great place to start is the OME blog). Needless to say if you want to find out the details of your acquisition, check these boxes.

Split Channels/Timepoints/focal planes: Really useful if you were going to split the channels anyway

Specify Range to Open: Check this box and you’ll be presented with a further dialog asking you to specify the ranges you’d like to import. Only need the first 10 frames of a 1000 frame movie? You got it! Only want one channel on which to do your analysis: no problem.

This last option is the key to subsetting large datasets. If you only want the first half of a massive movie, why bother opening all the frames only to discard half of them. Open the first half and save them as a subset…which neatly leads me onto the last point:

Saving Time

Much as I like to moan about proprietary file formats, they do have some benefits. Each one perfectly saves all of the acquisition metadata for that acquisition system, because they’re written to do that. Using these formats can become a problem when you start using different software and other file formats as they all save different amounts of metadata in different ways and call them different things (again, the OME blog has a great explanation).

Anyway, as you can’t save back into the original file format, your next best option is OME.TIFF which is a fantastically metadata-aware file format. Furthermore, as of Bio-Formats version 5.1.2, the OME specification now supports files larger than 4GB.

To save your file in this format, simply use the Bio-Formats Exporter, available in the same places as the Importer (see above).