I came across a collection of standards created by a single person in Germany on his website. I have become very interested in the problems that these standards aim to solve, and have since become intimately familiar with all that the website and his YouTube channel has to offer concerning WSV, SML and ReliableTXT.
I will start by thanking Mr Stefan John for his work and for all the food for thought.
While any new standard tends to end as "just another standard", this is still a good opportunity for me to to build some pet projects in Rust and improve my knowledge of serde. I will start with building a parser for the WSV format in Rust.
I would like to voice my thoughts, doubts and approvals about this project here on a more public forum, so that they can be more easily rebuffed and corrected. This confers a responsibility onto me to be as transparent as possible in my methods and objectives. Bring it on.
First, I will describe what I understand as the problem which WSV aims to solve.
Motivation
The objective of the WSV format is to find the balance between something which is end-user-editable and also reasonably easy to interpret by computers, claiming that syntax-heavy formats like JSON and XML are too inaccessible for the non-programmer. In WSV, the only special characters are the double-quote "
for strings containing whitespace, the line feed \n
to split lines, the hash #
for comments and the dash -
for empty entries.
Notice that while the spec defines the line delimiter as just a line feed, a file with both line feed and carriage return as the delimiter would be parsed in exactly the same way, since leading and trailing whitespace get ignored.
This spec is built on the ReliableTXT specification, which requires any UTF-8 encoded file to have the byte-order mark (BOM) at the beginning. This is considered by a sizeable portion of the developer community, and the Unicode forum themselves1, as a bad idea unless you are re-coding files from UTF-16 and up. However, I believe that the reason for this requirement in ReliableTXT is because the other primary use case for the BOM in UTF-8 is in interpreting CSVs in Microsoft Excel. CSVs are the replacement target for WSV, hence the exception. Recently, however, it seems that Microsoft have backed off their initial decision to require a BOM, and while it might take a while for that change to pervade their software, WSV-as-written therefore supports a deprecated feature.
I shall not be including the BOM in my implementation of WSV by default, although it will be behind a feature gate eventually.
The extension of the WSV specification is the SML specification, which adds support for more complex data formats. This is out of scope for now, but my first impression is that the value of this format for non-programmer end users comes from editor formatting and highlighting more than the spec itself.
This crate will undergo many major iterations. I will increment the first version digit once all features are completed, the second whenever I add a feature, and the third for anything else I change. A provisional list of MVP features are as follows:
- serialize
- with improved error handling
- deserialize
- all supported encodings
- with serde
-
Taken from the Unicode website Where a BOM is used with UTF-8, it is only used as an encoding signature to distinguish UTF-8 from other encodings — it has nothing to do with byte order. [AF] ↩