RobynsVeil opened this issue on May 30, 2008 · 267 posts
jfbeute posted Thu, 05 June 2008 at 7:51 AM
I knew I had another text about file structures.
All these texts were written by me in diverse proposals, none where ever published or used in actual products (in my work I do spend quite a lot of time writing proposals which are never implemented or realized with other solutions for practical or commercial reasons). None of the texts were ever checked against already existing practices or patents.
File structure.
The XML file structure is well known and widely used. It has many advantages but also several disadvantages. As far as we are concerned the most problematic areas are the wastefulness of storage and the absolute openness of structure. The XML file structure requires using meaningful texts as tags (where the same text is repeated as the end tag), this makes the file in theory human readable but does create a relatively large overhead. In practicality most XML files tend to be so large that the argument of human readability is an unrealistic point anyway (a 10,000 line text with start and end tags often 1000’s of lines apart simply isn’t human readable). The overhead becomes event more problematic when a relatively small number of tags are repeated very often (as is the case in many structured data files). The structure of XML files allows using new tags at any place in file without any prior notification; this means files can not really be validated for valid content as there is no build in facility for providing a list of valid tags.
In this proposed format the tags are replaced by numbers and the end tag is a single character standard token. The meaning of each tag number is maintained by a dictionary. This dictionary can be external but could also be contained within the file itself in special dictionary definition lines, where a dictionary definition is valid from its definition until the end of the scope of the element where it is defined. Since a number is generally shorter than a meaningful text (especially when the tags used most often get short numbers) and the end tag is a single character token the overhead is much reduced. Since an external dictionary can be used and dictionary definitions can be easily recognized in file validity of the file can be checked while still allowing for extensions and allowing for older versions which may not recognize specific tags (and could ignore those without causing problems for recognizing the file structure). This format can easily be converted to and from the XML structure.
In this proposed format the tags (and other integers) are stored in the system native format, for the rest the same features as described for the Open Structure Data format apply. This means files in this format can only be properly read on the system they are written. This may look like a major limitation but due to a specific marker at the start of the file the exact encoding can be derived and the file can be easily converted to and from Open Structure Data format on any system. Additionally all numeric fields in this format are stored in the internal high precision numeric storage format (normalized with 2 digits per byte). Overall using this format increases speed of reading and writing, and will generally also mean a reduction of the file size.
In this proposed format obsolete tags can be replaced by a single character standard token, for the rest the same features as described for the Native Structure Data format apply. Obsolete tags are defined as those tags which according to the dictionary would be the next tag after the previous tag. Any file in this format can easily be converted to and from Native Structure Data format. This format will significantly reduce the file size for most structured files.
Although the XML structure is a standard structure and widely used further study of possible file structures has revealed several alternative structures with advantages for highly structured files with repetitive content.
It is suggested that programs should only be able to read and write one of the proposed structures and upon recognizing a file in another structure call an external converter (since each proposed file structure will has its own unique markers at the start of the file it can be recognized). This allows the developers choosing the most convenient structure for their purpose without limiting the possible file structures.
This is the short version. I should be able to find a somewhat more complete design somewhere if you want a copy, although that probably wouldn't really apply anyway as the file wasn't intended for 3D data but for transporting logger data between diverse equipment. This was in the end implemented with a dedicated file structure.
Most proposals don't go much further that a short management summary and some technical notes as they are killed early on in the design phase. These texts are considered unfinished and are therefore not considered company trade secrets. As I work in the telecommunication industry it would be frowned upon if I were to use this in any related field but since this is open source work in a different field I am allowed to share any unfinished texts.
It is amazing that many design considerations are common regardless of the actual specialization.