Storing All of Your Ph.D. Data in One File

August 6, 2018

Storing All of Your Ph.D. Data in One File

Ohio State seal

One of the challenges of publishing academic manuscripts is the large amount of different information that is required to do so. In my (limited) experience one of the best solutions to combat this information overload is the use of the Hierarchical Data Format (HDF). HDF works a lot like the file tree on your computer (Fig. 1) in that there are branches of folders stemming from a single root, each containing any number of files and/or folders. In 1987 researchers at the National Center for Supercomputing Applications decided this was such a good idea that they created a file format to store variables in tree format and called it HDF. With continued open-source development the current version of HDF is referred to as HDF5.

Example of a computer file tree showing how subfolders and document originate from one root folder

 

Figure 1. Parker’s file tree on his laptop. The key takeaway is that there is one root folder (Macintosh HD) and each subfolder contains any number of additional subfolders and documents. 

Why Use HDF5?

Many of us have used Excel to manage data, and I am still an advocate of the program for teaching basic data analysis. However Excel falls short of many scripting languages when working with large sets of data, images, text, or comparing data sets of all sizes. When attempting to create or read large Excel files a noticeable strain on your computer is noticed, and usability can be affected to a fatal degree. These issues are caused by your command to load all of the data in the Excel file into your computer’s memory*. HDF5 is useful because you specify which parts of the dataset you want to bring into memory, work with, then save back to your hard drive. HDF5 files are much smaller in size than the same data in xls format, and are correspondingly faster to work with. Additionally most of the Ph.D. candidates I have talked with use some sort of external program, such as Python, Matlab, or R, to generate figures and perform additional analyses. HDF5 is embedded in the majority of programming and scripting languages, can store an unlimited number of variables, variables can be of any type, and up to 16 exabytes in size (that’s an unlimited number of 16 billion GB variables!).  

How to Use HDF5?

It should go without saying, but I’m going to write it twice: if you put all of your raw data into a single file – BACK THIS FILE UP!!!

In the Integrated Systems Lab we use Matlab to control hardware, analyze data, and generate plots; any example code will be shown from Matlab R2018a for Mac. My research centers around electroactive devices: a working device may have hundreds of experiments attached to it while a malfunctioning device may have no data associated. Additionally, not every piece of equipment is controllable from Matlab which means I often have to manage Excel files from different machines and devices. Fortunately, Excel files can be easily read into Matlab, and saved into a HDF5 file (Fig. 2).

Lines from script to select Excel files using a dialog box

 

 

 

 

 

 

 

 

Figure 2. Lines from my script to select Excel files using a dialog box and to load the experiments into a hierarchical data structure that is HDF5 compatible. 

I highly recommend using a regular expression to iterate through experiments by name rather than by number. In order to make the best use of HDF5 your data will need to be ordered hierarchically as shown in Fig. 3.

Figure 3. Command used to save data in an unlimited sized h5 file titled: RawData.h5. Data organized by a unique device ID (red), date (blue), experiment name (orange), and finally by the column of data (green); i and j are iterators.

You will have to write your own scripts and functions to plot the data sets relevant to your current manuscript; but having all my data in the same place has significantly improved the process of generating figures from raw data. Again, if you put all of your raw data into a single file – BACK THIS FILE UP!!!

Written by TPS Fellow Parker Evans

Resources:

*Storage is on your hard drive, is not actively being worked on, and can persist after the device power is turned off; while memory is on RAM chips, is what is actively being worked on, and does not persist after the device power is turned off.

  1. https://support.hdfgroup.org/HDF5/whatishdf5.html
  2. https://www.mathworks.com/help/matlab/ref/h5create.html
  3. https://www.pcmag.com/encyclopedia/term/63352/storage-vs-memory
  4. http://www.matlabtips.com/how-to-store-large-datasets/

 

News Filters: