Products model version application platform forecast prediction

HDF5 vs. TsFile: Efficient Time-Series Data Storage

DEV Communityby TimechoDBApril 1, 20269 min read1 views

<p>In the era of big data, efficient data storage and management are critical to the success of both scientific research and industrial applications. <a href="https://www.hdfgroup.org/solutions/hdf5/" rel="noopener noreferrer">HDF5</a>, a hierarchical format for managing experimental data, and <a href="https://tsfile.apache.org" rel="noopener noreferrer">TsFile</a>, a modern time-series data storage format, each offer unique strengths and design philosophies. This article takes a deep dive into the origins, use cases, and limitations of HDF5, and explores the similarities and differences between HDF5 and TsFile.</p> <h2> Origins of HDF5 </h2> <p>HDF5, short for <em>Hierarchical Data Format version 5</em>, is more than just a file format. It encompasses a full data model, software libraries

In the era of big data, efficient data storage and management are critical to the success of both scientific research and industrial applications. HDF5, a hierarchical format for managing experimental data, and TsFile, a modern time-series data storage format, each offer unique strengths and design philosophies. This article takes a deep dive into the origins, use cases, and limitations of HDF5, and explores the similarities and differences between HDF5 and TsFile.

Origins of HDF5

HDF5, short for Hierarchical Data Format version 5, is more than just a file format. It encompasses a full data model, software libraries, and a binary file format designed for storing and managing complex data. HDF5 originated in 1987 and was proposed by the GFTF group at the National Center for Supercomputing Applications (NCSA) in the United States.

The original goal of HDF was to develop an architecture-independent file format capable of meeting the growing need to transfer scientific data across diverse computing platforms at NCSA.

Use Cases

HDF5 has found widespread application in fields such as scientific computing, engineering simulation, and weather forecasting—domains that all require efficient management of experimental data.

Case 1: Scientific Data Storage

In scientific research, there is often a need to store and process complex, multidimensional datasets such as matrices or gridded meteorological data. These datasets typically contain rich metadata and require a storage solution that supports efficient data organization and access.

Case 2: Sensor Data from Devices

In industrial monitoring and sensor networks, large volumes of data are generated by various sensors. For example, in the structural health monitoring systems of aerospace agencies, sensor data capturing vibrations, temperature, and other parameters play a crucial role in condition monitoring and fault prediction.

Case 3: Particle Simulation Data Storage

In particle simulation scenarios, simulation programs can generate vast amounts of experimental data, such as particle trajectories and energy deposition metrics. These data are crucial for understanding physical processes and optimizing simulation parameters. Efficient storage and management systems are essential to handle such workloads.

Although HDF5 is widely used in scientific computing, a large portion of modern sensor and experimental data is fundamentally time-series in nature. This mismatch between data characteristics and storage models motivates the emergence of specialized time-series formats such as TsFile.

Introduction to TsFile

In many HDF5-based applications, a significant portion of the data actually exhibits time-series characteristics. TsFile is a columnar storage file format specifically designed for time-series data. It was originally developed by the School of Software at Tsinghua University and became a top-level Apache project in 2023.

TsFile stands out with its high performance, high compression ratio, self-describing format, and support for flexible time-range queries.

Technical Comparison: TsFile vs. HDF5

The following table outlines a technical comparison of TsFile and HDF5 across several key dimensions:

TsFile HDF5

Compression Ratio High Low

Query & Filtering Strong (supports time-based filtering) Weak (full scans required)

Data Model Lightweight and time-series oriented Complex and multidimensional

Let’s explore each of these dimensions in more detail:

Compression Ratio

TsFile: TsFile leverages time-series-specific encodings (e.g., TS_2DIFF for timestamp delta encoding, GORILLA for floating-point compression) along with efficient general-purpose compression algorithms (e.g., Snappy, ZSTD, LZ4). These work in tandem to eliminate redundancy. For variable-length objects, TsFile dynamically allocates memory, avoiding byte-padding waste. Its compact storage strategies make it especially suitable for sparse datasets and variable-length strings.
HDF5: HDF5 provides general-purpose compression filters(e.g., gzip, LZF, SZIP) and a plugin mechanism, but lacks built-in time-series–aware encoding schemes such as delta or Gorilla-style compression. As a result, it cannot fully exploit temporal patterns, leading to significantly lower compression ratios. Moreover, for variable-length or sparse sensor data, HDF5 often resorts to fixed-size compound records or multiple small datasets, leading to metadata overhead and padding inefficiencies.

Query & Filtering Capabilities

TsFile: Offers powerful query capabilities, enabling precise reads based on sequence IDs and time ranges. It avoids loading the entire dataset, which significantly boosts query efficiency for large-scale time-series workloads.
HDF5: HDF5 supports partial reads through hyperslabs and chunked storage. However, it lacks native semantic indexing or predicate pushdown for time-based filtering, so efficient time-range queries require external indexing logic.

Data Model

TsFile: Purpose-built for time-series workloads, TsFile adopts a streamlined timestamp–value model. Its schema is simple and well-suited for the sequential and append-heavy nature of time-series data.
HDF5: While highly flexible and capable of modeling complex data structures like multidimensional arrays and compound types, HDF5’s data model is comparatively heavy and not optimized for the unique characteristics of time-series data.

Practical Comparison via Code Examples

While Section 4 presented a conceptual comparison between TsFile and HDF5 across core characteristics such as compression, query flexibility, and data modeling, this section dives into the developer-facing differences through concrete code examples. We demonstrate how both formats handle data writing and querying, and discuss the design implications observed through their APIs. We use the same dataset in both cases: time-series data generated by a device (device1) in a factory (factory1), with the schema (time: long, s1: long).

Example: Writing Data

Example uses TsFile native C++ writer API (standalone TsFile SDK).

TsFile Write Snippet:

// Create a new TsFile named "test.tsfile" file.create("test.tsfile", O_WRONLY | O_CREAT | O_TRUNC, 0666); // Define schema for the timeseries table auto* schema = new storage::TableSchema(  "factory1",  {  common::ColumnSchema("id", common::STRING, common::LZ4, common::PLAIN, common::ColumnCategory::TAG),  common::ColumnSchema("s1", common::INT64, common::LZ4, common::TS_2DIFF, common::ColumnCategory::FIELD),  }); // Create a writer using the schema auto* writer = new storage::TsFileTableWriter(&file, schema); // Prepare a tablet for batched insertion storage::Tablet tablet("factory1", {"id1", "s1"},  {common::STRING, common::INT64},  {common::ColumnCategory::TAG, common::ColumnCategory::FIELD}, 10); // Insert rows into the tablet for (int row = 0; row < 5; row++) {  long timestamp = row;  tablet.add_timestamp(row, timestamp);  tablet.add_value(row, "id1", "machine1");  tablet.add_value(row, "s1", static_cast(row)); } // Write to disk and finalize writer->write_table(tablet); writer->flush(); writer->close();

// Create a new TsFile named "test.tsfile" file.create("test.tsfile", O_WRONLY | O_CREAT | O_TRUNC, 0666); // Define schema for the timeseries table auto* schema = new storage::TableSchema(  "factory1",  {  common::ColumnSchema("id", common::STRING, common::LZ4, common::PLAIN, common::ColumnCategory::TAG),  common::ColumnSchema("s1", common::INT64, common::LZ4, common::TS_2DIFF, common::ColumnCategory::FIELD),  }); // Create a writer using the schema auto* writer = new storage::TsFileTableWriter(&file, schema); // Prepare a tablet for batched insertion storage::Tablet tablet("factory1", {"id1", "s1"},  {common::STRING, common::INT64},  {common::ColumnCategory::TAG, common::ColumnCategory::FIELD}, 10); // Insert rows into the tablet for (int row = 0; row < 5; row++) {  long timestamp = row;  tablet.add_timestamp(row, timestamp);  tablet.add_value(row, "id1", "machine1");  tablet.add_value(row, "s1", static_cast(row)); } // Write to disk and finalize writer->write_table(tablet); writer->flush(); writer->close();

Enter fullscreen mode

Exit fullscreen mode

HDF5 Write Snippet:

typedef struct {  long time;  long s1; } Data; // Create HDF5 file (overwrite if exists) hid_t file_id = H5Fcreate("test.h5", H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT); // Create group under root hid_t group_id = H5Gcreate2(file_id, "factory1", H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT); // Create 1D dataspace hsize_t dims[1] = { (hsize_t)rows }; hid_t dataspace_id = H5Screate_simple(1, dims, NULL); // Define compound datatype hid_t datatype_id = H5Tcreate(H5T_COMPOUND, sizeof(Data)); H5Tinsert(datatype_id, "time", 0, H5T_NATIVE_LONG); H5Tinsert(datatype_id, "s1", sizeof(long), H5T_NATIVE_LONG); // Enable chunking and GZIP compression hid_t dcpl = H5Pcreate(H5P_DATASET_CREATE); hsize_t chunk_dims[1] = { (hsize_t)rows }; H5Pset_chunk(dcpl, 1, chunk_dims); H5Pset_deflate(dcpl, 1); // Create dataset under the group hid_t dataset_id = H5Dcreate2(group_id, "machine1", datatype_id, dataspace_id, H5P_DEFAULT, dcpl, H5P_DEFAULT); // Allocate and fill data buffer Data* dset = (Data*)malloc(rows * sizeof(Data)); for (int i = 0; i < rows; i++) {  dset[i].time = i;  dset[i].s1 = i; } // Write and clean up H5Dwrite(dataset_id, datatype_id, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset); free(dset); H5Dclose(dataset_id); H5Gclose(group_id); H5Sclose(dataspace_id); H5Fclose(file_id);

typedef struct {  long time;  long s1; } Data; // Create HDF5 file (overwrite if exists) hid_t file_id = H5Fcreate("test.h5", H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT); // Create group under root hid_t group_id = H5Gcreate2(file_id, "factory1", H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT); // Create 1D dataspace hsize_t dims[1] = { (hsize_t)rows }; hid_t dataspace_id = H5Screate_simple(1, dims, NULL); // Define compound datatype hid_t datatype_id = H5Tcreate(H5T_COMPOUND, sizeof(Data)); H5Tinsert(datatype_id, "time", 0, H5T_NATIVE_LONG); H5Tinsert(datatype_id, "s1", sizeof(long), H5T_NATIVE_LONG); // Enable chunking and GZIP compression hid_t dcpl = H5Pcreate(H5P_DATASET_CREATE); hsize_t chunk_dims[1] = { (hsize_t)rows }; H5Pset_chunk(dcpl, 1, chunk_dims); H5Pset_deflate(dcpl, 1); // Create dataset under the group hid_t dataset_id = H5Dcreate2(group_id, "machine1", datatype_id, dataspace_id, H5P_DEFAULT, dcpl, H5P_DEFAULT); // Allocate and fill data buffer Data* dset = (Data*)malloc(rows * sizeof(Data)); for (int i = 0; i < rows; i++) {  dset[i].time = i;  dset[i].s1 = i; } // Write and clean up H5Dwrite(dataset_id, datatype_id, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset); free(dset); H5Dclose(dataset_id); H5Gclose(group_id); H5Sclose(dataspace_id); H5Fclose(file_id);

Enter fullscreen mode

Exit fullscreen mode

Query Example

TsFile Query Snippet:

storage::TsFileReader reader; reader.open("test.tsfile"); std::vector columns = {"id1", "s1"}; storage::ResultSet* result = nullptr; // Query by table name, columns, and time range reader.query("factory1", columns, 0, 100, result); auto* ret = dynamic_cast(result); bool has_next = false; while ((code = ret->next(has_next)) == common::E_OK && has_next) {  std::cout << ret->get_value(1) << std::endl;  std::cout << ret->get_value(1) << std::endl; } ret->close(); reader.close();

storage::TsFileReader reader; reader.open("test.tsfile"); std::vector columns = {"id1", "s1"}; storage::ResultSet* result = nullptr; // Query by table name, columns, and time range reader.query("factory1", columns, 0, 100, result); auto* ret = dynamic_cast(result); bool has_next = false; while ((code = ret->next(has_next)) == common::E_OK && has_next) {  std::cout << ret->get_value(1) << std::endl;  std::cout << ret->get_value(1) << std::endl; } ret->close(); reader.close();

Enter fullscreen mode

Exit fullscreen mode

HDF5 Read Snippet:

file_id = H5Fopen("test.h5", H5F_ACC_RDONLY, H5P_DEFAULT); group_id = H5Gopen2(file_id, "factory1", H5P_DEFAULT); dataset_id = H5Dopen2(group_id, "machine1", H5P_DEFAULT); datatype_id = H5Dget_type(dataset_id); dataspace_id = H5Dget_space(dataset_id); hsize_t dims[1]; H5Sget_simple_extent_dims(dataspace_id, dims, NULL); int rows = (int)dims[0]; Data* dset = (Data*)malloc(rows * sizeof(Data)); H5Dread(dataset_id, datatype_id, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset); for (int i = 0; i < rows; i++) {  printf("Row %d: time: %ld, s1: %ld\n", i, dset[i].time, dset[i].s1); } free(dset); H5Dclose(dataset_id); H5Gclose(group_id); H5Sclose(dataspace_id); H5Fclose(file_id);

file_id = H5Fopen("test.h5", H5F_ACC_RDONLY, H5P_DEFAULT); group_id = H5Gopen2(file_id, "factory1", H5P_DEFAULT); dataset_id = H5Dopen2(group_id, "machine1", H5P_DEFAULT); datatype_id = H5Dget_type(dataset_id); dataspace_id = H5Dget_space(dataset_id); hsize_t dims[1]; H5Sget_simple_extent_dims(dataspace_id, dims, NULL); int rows = (int)dims[0]; Data* dset = (Data*)malloc(rows * sizeof(Data)); H5Dread(dataset_id, datatype_id, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset); for (int i = 0; i < rows; i++) {  printf("Row %d: time: %ld, s1: %ld\n", i, dset[i].time, dset[i].s1); } free(dset); H5Dclose(dataset_id); H5Gclose(group_id); H5Sclose(dataspace_id); H5Fclose(file_id);

Enter fullscreen mode

Exit fullscreen mode

Interface Behavior in Comparison

Writing: Metadata and Structure
TsFile requires only table name and column types. Its tablet model separates timestamps and values for better compression and performance.

HDF5 demands manual definition of dimensionality and compound types. There is no time-series awareness in structure.

Querying: Filtering and Performance
TsFile supports built-in time-range queries and partial column reads. Filtering is pushed down, reducing unnecessary I/O.
HDF5 offers full dataset reads or chunk-level access. It does not provide built-in predicate pushdown; filtering logic typically runs after data is loaded into application memory.
Developer Experience
TsFile provides intuitive abstractions for time-series data: schema, tablet, time-based queries.
HDF5 exposes low-level controls for flexible data modeling but with significant API complexity.

Real-World Case Study: Aerospace Sensor Data

In a real-world aerospace project, time-series data is primarily collected from aircraft-mounted sensors. Each year, there are thousands of flights, and each flight generates data from approximately 3,000 to 4,000 sensors. These sensors capture a wide range of parameters with varying sampling frequencies and data lengths.

Given the enormous data volume, efficient storage becomes critical. In the HDF5 format, data is organized hierarchically using Groups and Datasets, with metadata stored in Attributes. Each parameter is stored as a separate dataset—a 2D table consisting of a time column and a value column. Internally, HDF5 manages hierarchical objects using internal metadata trees and heaps to map groups to datasets and attributes.

In contrast, TsFile, purpose-built for time-series data, organizes data in a “device–measurement” tree structure. All measurements for a single device are stored contiguously in the file and benefit from columnar compression. Its indexing mechanism consists of a two-level B-tree, linking from the root to devices and then to individual time series. Unlike HDF5, TsFile integrates the time and value columns into a single structure and leverages built-in indexing for fast retrieval without requiring external metadata.

In actual usage, TsFile significantly outperformed HDF5 in both write and query performance. For the same dataset, HDF5 (with compression) occupied approximately 18 TB, using default TsFile encoding (TS_2DIFF + Gorilla + LZ4) and gzip-compressed HDF5 baseline, whereas TsFile (with its native compression and default settings) reduced this to only 2.2 TB—a storage reduction of over 85%. That means TsFile achieved a compression ratio 8 times better than HDF5, with its file size being just 14.31% of the HDF5 equivalent.

Conclusion

With advantages in time-series modeling, compression and encoding schemes, and query filtering capabilities, TsFile proves to be highly optimized for large-scale time-series workloads. The API is also simpler and cleaner, which reduces the learning curve and improves developer efficiency.

Thanks to its performance, storage efficiency, and developer-friendly interface, TsFile stands out as a strong choice for systems that demand high-throughput, low-latency processing of time-series data. For applications requiring scalable and efficient time-series data handling, TsFile is undoubtedly a better fit than general-purpose formats like HDF5.

Original source

DEV Community

https://dev.to/timechodb/hdf5-vs-tsfile-efficient-time-series-data-storage-5om

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelversionapplication

ModelsLive

Raspberry Pi raises prices by $11.25 to $150 citing memory prices, after hikes in December and February, and unveils a 3GB Raspberry Pi 4 model for $83.75 (Stevie Bonifield/The Verge)

Stevie Bonifield / The Verge : Raspberry Pi raises prices by $11.25 to $150 citing memory prices, after hikes in December and February, and unveils a 3GB Raspberry Pi 4 model for $83.75 — Prices are going up by over $100 in some cases thanks to those AI fools. … As of today, the price of the 16GB version …

Techmeme

1m25 minutes ago

ModelsLive

I built a Mac app after getting surprised by my Claude bill

<p>A few months back I got my monthly API bill and felt sick.</p> <p>I had been vibe-coding pretty hard with Claude, and I knew it wasn't going to be zero. But the number was way higher than I expected. Like, embarrassingly higher. I had been running Claude Code sessions back to back, long context windows, lots of tool calls, and I had no idea how fast it was adding up.</p> <p>The worst part? I couldn't have known. There's no live feedback. You just work, and then you find out later.</p> <p>So I did what most developers do when something annoys them enough. I built a tool to fix it.</p> <h2> What I made </h2> <p>TokenBar is a macOS menu bar app that tracks your AI token usage in real time. It sits in your menu bar the whole time you're working and shows you your spend as it happens, not af

DEV Community

3m39 minutes ago

ProductsLive

Building Production RAG Systems in .NET 10: The Complete Guide to Embeddings

<h1> Building Production RAG Systems in .NET 10: The Complete Guide to Embeddings </h1> <h2> The Hallucination Problem </h2> <p>Your company spent $50K building an internal chatbot. It tells customers "yes, we ship internationally" when you only ship to the US. Your support team is drowning in corrections.</p> <p>Sound familiar?</p> <p>This happens because traditional LLMs generate responses from training data patterns, not your actual data. They hallucinate. They confidently state false information.</p> <p><strong>RAG (Retrieval-Augmented Generation) fixes this.</strong> Instead of hoping the LLM knows about your data, you explicitly feed it your documents first.</p> <h2> What Are Embeddings? </h2> <p>Think of embeddings as a way to convert text into mathematics.</p> <h3> The Simple Versi

DEV Community

10m37 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 185 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Products

ProductsLive

Cameo partners with TikTok to boost popularity

Cameo launched a new TikTok integration that allows U.S. creators to offer personalized videos through the app.

TechCrunch AI

1m3 minutes ago

ProductsLive

With Sora shuttered, smaller video AI apps surge into the spotlight - latimes.com

<a href="https://news.google.com/rss/articles/CBMisgFBVV95cUxQMU50SjZtVmZHb0YtQ09xYWVzMEhILURZeFJvTW5OQkxRZG8wdDUtb1Z0cVFCZ3ltWC1YN2JQTzJfa1B3Vk9ERTJiTWx5bkJFS2ZKczluZjEtUU9fdGVOOHl6azZNRE1hcWUtQTNZRnk5YjN4YzlTR3B0MTU1S0FDd3VDWkVFUkl2TVFsT0ZCdWdyUkVyTDVBajJLNEliem9VTGd6dkdMcHdsY21ZcUFYWXlR?oc=5" target="_blank">With Sora shuttered, smaller video AI apps surge into the spotlight</a> <font color="#6f6f6f">latimes.com</font>

Google News: AI

1m17 minutes ago

ProductsLive

OpenAI raises $122B at $852B valuation, signalling strong investor appetite for AI - Yahoo! Finance Canada

<a href="https://news.google.com/rss/articles/CBMiigFBVV95cUxPOEgxbUdjVEJoVHpydUY1SUFWb3RmZDBQYXJTLVMwbGlUV2Qwb0ZkVTRtbkRpNmJmQXpTLTk0LVIxNU5WVkU0YVFoNUk1Z0dmNG1YRkNEc2FlWkR6NWN3ODBSRDRfdGR2eUEzYTE1RzJnUWJCZGZ6eURDRldvQjVtTWpJSi1Gd0I1N3c?oc=5" target="_blank">OpenAI raises $122B at $852B valuation, signalling strong investor appetite for AI</a> <font color="#6f6f6f">Yahoo! Finance Canada</font>

Google News: OpenAI

1mabout 1 hour ago

ProductsLive

The Walt Disney Company (DIS) Faces Sebtack to Its Potential Major Partnership with OpenAI - Yahoo Finance

<a href="https://news.google.com/rss/articles/CBMimgFBVV95cUxObTRYM1V0d1VBWGZ3RkREbW00REVLNElUSnp6elF0VzdGSUgtczZvQ0lPN0pLVm1PU1prTHZOQkZ0QWIxaG1heFQ3eTZ3aUhwR3ktXzdMMkZpWWY2UnQ4djR0LTJVUGk4Zy1DcGpBUndyYnpOczFzZnU2MnU5aS15OVZDblU1SHlzMEpTR2lqWm1hdmJMWmNwcEl3?oc=5" target="_blank">The Walt Disney Company (DIS) Faces Sebtack to Its Potential Major Partnership with OpenAI</a> <font color="#6f6f6f">Yahoo Finance</font>

Google News: OpenAI

1m30 minutes ago