I have just completed 0.1.3 on my wsv
library. This update handles errors instead of panicking, and introduces bom
to exclude the current parsing of files with a byte-order mark (BOM). The current implementation now explicitly handles only UTF-8 without a BOM. This work has also highlighted another fundamental difference between CSVs and WSVs. CSV parser, assuming that the file contains a coherent and known encoding for the text, will always succeed. There are no combinations of characters which would invalidate a CSV. With WSVs, however, there is a requirement to have an even number of double quotes on every line. It would therefore be important to highlight this error for an end user.
I also finished Rust for Rustaceans. Gjengset provides many references at the end for further reading, where I found another way to parse files. I was already aware of nom
, which I've seen used a lot in Advent of Code challenges, and serde
, which for this particular use case of interpreting WSVs is also relevant. Now I know of pest
, which uses a DSL to describe the grammar of a file, and then macros generate the parse function, which hands you an iterator of iterators to walk through the file contents so you can populate your own data structure.
Departing from the list of features from the previous post, the objective has evolved into the following:
- Build the same parsing function in all three of these example crates above, using
thiserror
instead of writing the errors myself. - Inspect the generated code to learn more about how each of them work
- Benchmark them against a large file input using something like
hyperfine
- Graph time spent for each implementation with
flamegraph
- Build a CLI tool to create a text-based table representation of the data for debugging with
comfy-table
- Offer a second function which only contains errors in each row, rather than for the whole file. This will make the file itself easier to debug. This will possibly evolve into a general purpose tool for visualising errors
The CLI tool can parse an input path in debug mode or not. This tool will use clap
to interpret the command line arguments and use one of anyhow
or eyre
for help with error handling. I might also consider adding a flag to switch between the different parsing implementations I will write.
I also hope that there will be a visualisation tool somewhere which displays dependencies too. I am considering writing a library which strafes GitHub for all uses of the public API for a crate, and gives a percentage usage so new users can know what part to learn first. Furthermore, I would like to create a minimum standard for metrics and images of a crate to give the newcomer (or any non-technical interested party) a rough technical overview of a crate so comparisons can be made, and more importantly, ornaments printed reflecting the codebase.
Eventually, I will also introduce feature flags to hide all of the testing by default, exposing only one parse method which will be the fastest.
WSV 0.1.3
- Adds
I have reimplemented the parse function which takes a file path and produces a vec of vecs of the contents of the file. There are also nine more tests demonstrating the boundaries of the function behaviour, which has changed from V1 to returning errors, rather than panicking.
I have also added a link to this blog to the README.md and can now link here to the codebase and crates.io location.
- Next
A reimplementation of the parser with pest.
- Design
V2 abstracts the WsvValue enum which helps to parse that part. There are now no repeated parts of code which must be changed simultaneously. I'm using a BufReader now. The primary logic still happens across a large match without peeking. There is no fractal symmetry yet. I watched two videos on other people implementing parsers, and there are clearly many more things to learn.
I have since shifted this commentary to the repository, and will write a post about its current status soon.