Skip to end of metadata
Go to start of metadata

 

At the Board meeting on   we agreed to gather HDF5 strengths and weaknesses. Please add your thoughts to the table below.

 

Board member nameStrengthWeakness
Mark
  • Long standing track record of software and development team
  • Well supported
  • Flexible data model (but see below ;)
  • Rich set of features useful to HPC applications
    • Compression
    • Checksumming
    • Partial I/O
    • Run-time typing and type conversion
    • Number architecture conversion
    • Name spaces
    • Extendible datasets
    • Error handling
    • Version management
    • Core fapl and file image stuff
  • Extensible with filter and VFD plugins
  • Commitment to backward compatibility
  • Runs everywhere
  • An excellent data-agnostic intermediate abstraction layer
  • Flexible data model (too easy for highly similar applications to wind up using data model very differently)
  • API verbosity...hard for newbies to quickly get up to speed
  • Impressions of overly complex implementation
  • Seems hard sometimes to do simple things, well
  • Bytes on disk format is too tied to one and only implementation
  • Lack of documented, common use case examples
  • Re-implements a lot of stuff file systems already do for files
  • Object overheads potentially prohibitive at small scale
  • Parallel interface is too inflexible
  • Not easy to mentally map HDF5 API calls to actual I/O activity

If we were starting from scratch today, would we arrive at something that looked vastly similar to or very different from what we have today?

Rob
  • Well-defined, multi-purpose data model
  • Variety of users, relatively good uptake in HPC / scientific computing
  • Widely respected software implementation (with exception perhaps of performance in some HPC contexts)
  • Own standard and implementation (easier to extend than in a community standard)
  • Mixed free/license release model is adding confusion on application side as to availability on all their platforms (from laptop to supercomputer)
  • Hasn't adopted some modern HPC storage tactics (e.g., loosely-coupled log-structured writing as in PLFS, ADIOS BP)
  • Single (widely used) implementation (maybe?) (EP: There is a pure JAVA read-only implementation from Unidata; not supported anymore).

Re: Top 1-2 technical things, I would consider either the storage tactic topic above (e.g., matching ADIOS on performance) or defining what it means to put HDF data in a collection of S3/RADOS objects. But I think a collection of examples with different motifs is also a good idea (we have a couple in our tutorial IIRC).

Quincey
  • Future-proof I/O middleware for applications to use, probably well into the exascale era
  • High-performance I/O middleware
  • Well-engineered software base
  • Many, many features
  • A middle-ground of abstraction between “bytes” and “meshes” (although I’d like to extend toward the “mesh” end of things, with optional features)
  • Well-supported
  • Backward / forward compatible (i.e. API and data stability for application developers)
  • SWMR (and MWMR coming)
  • Large, sophisticated / complex code base
  • Parallel I/O tied to MPI (currently)
  • Opaque performance characteristics (difficult for applications to know “what’s happening” when it’s slow)
  • Many API routines need better user documentation / recipes for how to use them well
  • Needs more technical documentation about library internals
  • Several aspects that are being addressed with ECP funding (collective metadata modifications, no async I/O, etc)

 

 

Elena
  • Diverse users' base and areas of application for HDF5
  • A lot of "unique" features that make HDF5 very attractive (e.g., compression, custom datatypes, complex partial I/O, one set of APIs for sequential and parallel libraries); complexity grew organically
  • Committed core team that works together for the past 10-15-20 years
  • Strong commitment to provide user support that helps to grow our users' base.
  • Divers users' base and areas of application - hard to provide "default tuned" version; hard to get good performance in some use cases that were not "common" at the time when the original library was designed.
  • HDF5 lacks  what many of our users consider "must to have" features such as support for Unicode and multi-threading.
  • Several factors make it very difficult to bring new talent and engage community (compare with h5py):
    • type of software (I/O library and not video games)
    • HDF5 complexity and lack of documentation
    • THG vs. Microsoft (smile)
  • Very weak feedback from the users (both in strengths and weaknesses)
Gary

A support model – I was at the Smokey Mountains conference last week and Salman Habib from Argonne who I have known since he came to LANL long ago lamented that the ECP software technology products and ECP in general are faced with a tough problem because of support models.  He said you kind of have 3 different kinds of science uses

o   Small community – like climate or hep that probably have a large enough mass to take the risk of adopting some of the ECP ST products when they are abandoned (not if but when)  He said ECP is a project and ends.  He said ASCR has no history in long term support and people are measured on H-Index, so that wont work either.

o   Single code

§  Partnering a single code/team with a single ST product for life and jointly publish works but doesn’t penetrate the market well (the ADIOS model sort of)

§  Cant afford to take the risk and wont partner with any ST product

o   Weapons code – lasts 20 years, colossal in size and scope, behind guns and gates, and likely cant adopt things from ST land easily

The ECP folks said they were thinking about embedding ST people in apps.  That was funny, some people applauded, others looked at one another in amazement, very awkward moment

I think this is interesting and speaks volumes about products much like HDF5

-  A support model – interesting story back when Lee and I were trying to keep Peter Braam on some invisible road to produce Lustre, we said – Peter this could be a 10 man company well funded for a long time.  He of course wanted larger things, but the lesson I think is, these software niches are small and sustainable if you have a stable market.

-  Somewhat completely contained software – few dependencies

-  Linkable in user app

-          Somewhat completely contained software – less light on your feet

-          Every DOE event I go to seems to be all about mixing simulation and instruments or simulation and data sources (IOT/etc.)  Seems to me these instruments/data sources (which is a foreign concept to me living at the run >1.2 machine sized simulations for years site) are very record oriented IO, implying entry sequence and or key/value.  Log structured data was mentioned but in general seems like this is a growth market and how HDF5 plays with this may be weaker than you might like

-          Another piece of tech that you may want to adopt is the HECFSIO nurtured beta epsilon tree concept.  Seems to me that metadata is becoming far more important than data, if that weren’t true we wouldn’t see all the DOE researchers chasing the shiny machine learning craze.  You are hierarchical which is good for what it is good for but seems like beta epsilon for metadata gives you a way to extend where hierarchical is used

-          Exploitation in other buzzword bingo areas like composable and right sized etc.  You are linkable so you have a leg up on being composed into workflows, but it could be that with the container craze (see lots of buzzwords from an old person), you may be able to think of yourself as a service.  Your linkable heritage makes things like user space file system like concepts a natural extension.  Maybe make it so that HDF could be composed into solutions like that much more easily.  The delta-fs demo could find its way into lots of apps if put under an HDF5 like scheme.

-          Slow moving apps like weapons apps have little reason to adopt newer features but have money, fast moving apps can adopt new features easier but aren’t well funded

 

John
  • build system and regression test suite
  • user level documentation
  • code organization

 

  • lack of architectural documentation
  • poor code commenting practices
  • unnecessary complexity in some sections of the code
  • very slow to correct fundamental design errors. For example, repair of poor design in the metadata cache took decades, only now starting to address design weaknesses in our management of MPI I/O.
  • very steep learning curve for new developers.
  • tiny developer base
Suren Byna
  • large number of HPC and non-HPC  users
  • a well-defined data model
  • Portability of file format
  • VOL to open up the API to connect with different ways of storing and accessing data
  • Single shared file approach has performance issues – Subfiling or log structured data format could provide better performance for several use cases (mostly HPC use cases)
  • Lack of code coupling support without storing files to a persistent storage layer
  • Missing design / architectural documentation
  • Parallel I/O's being tied up to MPI-IO – VOL / VFD opens up to try a different middleware than MPI-IO
  • Support for sparse and streaming data
  • API verbosity could be intimidating some users – high-level interfaces could help (there are third-party efforts, such as h5py, h5part / h5hut, but working closely with those developers to evaluate further development of easy-to-use interfaces)
  • Tracking features after a release - There are a number features that could improve performance or usability or something else. Not certain, how each of the features are performing a few releases later?
Prabhat
  • Top I/O library in use at NERSC over the past 10+ years
  • Highly respected staff
  • Street Cred (independent of funding cycles)
  • Broad set of capabilities
  • Flexibility
  • THG has not been nimble enough to respond to competitors in the DOE (ADIOS), and lines of critique (web article by neuroscientist outlining problems).
  • Historically, THG has been somewhat idealistic (both in its business model, pursuing new funding opportunities, building relationships with PMs, web outreach/visibility, etc).
  • Critical dependency on Quincey
  • Engaging a strong group of developers outside THG
  • The HPC community favors Performance, The Data science community favors Productivity. I expect Productivity to win in the long run. THG and HDF5 s/w should bite the bullet and make an explicit effort to make the s/w more broadly useful and interoperable with the modern python-based Data stack.
   
   
   
   
  • No labels

1 Comment

  1. Mr. Prabhat, can you provide ref to article you mention? Can you elaborate on what you mean by "python-based Data Stack". Can you elaborate on what you mean by "Productivity"?