File

Choosing among jason, yaml, and toml.

To continue the discussion in previous post, we want a folder strucutre standard instead of HDF5 to store dataset temporarily for processing or permantantly for sharing. To enable the flexibility of such folder structure apporach, we only impose minimum requirements on such folder and leave the rest fine-definition to the meta-data file. So what is the best format for such meta data? Basically, we want a hash talbe that establishes relationship between keyword and values that are meaningful to the user/audience.

Why I moved away from HDF5?

Recently, the increasing volume of data and application of neural networks have both forced to look at data format again. Previously, I thought the HDF5 format is the best for most of my application. The nice APIs to HDF5, e.g. H5py and DeepDish gives me both flexibility and easiness of using HDF5 to store and share my dataset. However, as my datasets start to grow substantially, loading them into the memory puts a significant burden on my I/O bus, especially I only need part of that dataset every time.