Skip to content

Latest commit

 

History

History
118 lines (76 loc) · 8.66 KB

DESIGN.md

File metadata and controls

118 lines (76 loc) · 8.66 KB

Some rambling thoughts that don't deserve a place in the README.

Cleanness

The API consists of a handful of clean simple powerful methods with no arguments and no configurability, plus some more junk to make them convenient to use.

Language features are to be preferred over library features as much as possible. That way the library can stay smaller and code that uses the library is hopefully easier to understand in detail for people who aren't familiar with the library.

I don't really like the ValueExt extension trait, but I can't think of a nicer way to parse values. In my ideal workflow you would call .into_string()?.parse()? to parse a value, all built-in methods. But I don't think it's possible to have an error type that can be transformed from both methods' error types, into_string returns OsString and there are annoying rules around overlapping trait implementations. The error messages would also suffer.

(Update: as of 0.3.0, ValueExt has a string method as an alternative to into_string with a cleaner return type. In theory this opens the way to removing From<OsString>, but I don't think lexopt::Error should be catch-all. There's anyhow for that.)

Keeping the core API clean and generic means this could perhaps be used as the basis of a more complete parser.

Possible enhancements

POSIX has a notion of subarguments, combining multiple values in a single option-argument by separating them with commas or spaces. This is easy enough to hand-roll for valid unicode (.into_string()?.split(...)) but we could provide a function that does it on OsStrings. I can't think of a case where values may not be valid unicode but definitely don't contain commas or spaces, though.

Language quirks

Sometimes Rust is a bother.

Arg::Long contains a borrowed string instead of an owned string because you can't match owned strings against string literals. That means Arg needs a lifetime, the iterator protocol cannot be used (it would also be a bad fit for other reasons), and some abstractions are hard or impossible to build. On the plus side, error messages can be slightly better.

(Deref patterns would fix this, and if/when they're released I'll probably do a breaking release with a huge MSRV bump. I consider them a prerequisite for 1.0.)

Arguments on Windows sometimes have to be transcoded three times: from UTF-16 to WTF-8 by args_os, then back to UTF-16 to parse them, then to WTF-8 again to be used. This ensures we see the original invalid code unit if there's a problem, but it's a bit sad. (Luckily it only happens very rarely.)

Errors

There's not always enough information for a good error message. A plain OsString doesn't remember what the parser knows, like what the last option was.

ValueExt::parse exists to include the original string in an error message and to wrap all errors inside a uniform type. It's unclear if it earns its upkeep.

Iterator backing

I see three ways to store Parser's internal iterator:

  1. As a generic field (Parser<I> where I: Iterator<Item = OsString>)
  2. As a trait object (source: Box<dyn Iterator<Item = OsString> + 'static>)
  3. As a particular known type (source: std::vec::IntoIter<OsString>)

lexopt originally used option 2 but switched to option 3.

Option 1 (generic field) is the most general and powerful but it's cumbersome and bloated. Benefits:

  • The parser inherits the iterator's properties. You can have a non-'static parser, or a parser that is or isn't thread-safe.
  • You can provide direct access to the original iterator.
  • In theory, better optimization.

Drawbacks:

  • Using a parser as an argument (or return value, or field) is difficult. You have to name the whole type (e.g. Parser<std::env::ArgsOs>), and you can't mix and match parers created from different iterators.
  • Code size and compile times are bloated, particularly if you use multiple iterator types.
  • The benefits are pretty weak or niche.

Option 2 (trait object) doesn't have the drawbacks of option 1, but it reduces everything to a lowest common denominator:

  • Either the input must be Send/Sync, or the parser can't be Send/Sync. (To complicate things, ArgsOs is !Send and !Sync out of caution.)
  • Clone can't be implemented. (Unless you exhaust the original iterator, which requires interior mutability and has bad edge cases.)
  • Debug can't be derived.

Option 3 (known type) would mean collecting the iterator into a Vec when the parser is constructed and then turning that into an iterator.

  • The biggest benefit is that vec::IntoIter is a well-behaved type and everything becomes easy. It's Sync and Send and Clone and Debug and Debug even shows the raw arguments.
  • We get unlimited lookahead through vec::IntoIter::as_slice().
  • FromIterator can be implemented.

There are also drawbacks:

  • It's likely to be less efficient. But not disastrously so: args_os() allocates a brand-new Vec full of brand-new OsStrings (each with their own allocation) before returning, and we only duplicate the Vec allocation.
  • Iterators can't be infinite or otherwise avoid loading all arguments into memory at once.
  • You can't use clever tricks to observe which argument is being processed.
    • as_slice() might provide an alternative, but if this is to be a proper API it has to be designed carefully.

Configuration

lexopt isn't configurable right now but maybe it should be.

There are requests to a) disable = for short options and b) make .value() ignore arguments that look like options.

Especially b) is context-sensitive. An option might need to take negative numbers as values, or arbitrary filenames. That means you might want to switch the option on/off just for the duration of parsing a single option. That rules out the builder pattern of cfg(self) -> Self, but not cfg(&mut self) -> &mut Self.

a) might also be context-sensitive if you e.g. want to allow it just for a particular option for backward compatibility. But this seems less likely.

A footgun of cfg(&mut self) is that you have to remember to revert the configuration once you're done.

Other possible APIs:

  • parser.value_disallow_dash()
  • parser.value_cfg(Config { allow_dash: false })
  • parser.allow_dash(false).value() (with allow_dash() -> SomeWrapper)

I'm not really happy with any of these.

Problems in other libraries

These are all defensible design choices, they're just a bad fit for some of the programs I want to write. All of them make some other kind of program easier to write.

pico-args

  • Results can be erratic in edge cases: option arguments may be interpreted as options, the order in which you request options matters, and arguments may get treated as if they're next to each other if the arguments inbetween get parsed first.
  • -- as a separator is not built in.
  • Arguments that are not valid unicode are not recognized as options, even if they start with a dash.
  • Left-over arguments are ignored by default. I prefer when the path of least resistance is strict.
  • It uses Vec::remove, so it's potentially slow if you pass many thousands of options. (This is a bit academic, there's no problem for realistic workloads.)

These make the library simpler and smaller, which is the whole point.

clap/structopt

  • structopt nudges the user toward needlessly panicking on invalid unicode: even if a field has type OsString or PathBuf it'll round-trip through a unicode string and panic unless from_os_str is used. (I don't know if this is fixable even in theory while keeping the API ergonomic.)
  • Invalid unicode can cause a panic instead of a soft error.
  • Options with a variable number of arguments are supported, even though they're ambiguous. In structopt you need to take care not to enable this if you want an option that can occur multiple times with a single argument each time.
  • They're large, both in API surface and in code size.

That said, it's still my first choice for complicated interfaces.

(I don't know how much of this applies to clap v3+ and clap-derive.)

Minimum Supported Rust Version

The current MSRV is 1.31, the first release of the 2018 edition.

The blocker for moving it even earlier is non-lexical lifetimes, there's some code that won't compile without it.

The Value(arg) if foo.is_none() => pattern doesn't actually work until 1.39 (bind_by_move_pattern_guards), so not all of the examples compile on the MSRV. (And one of them uses str::strip_prefix, which requires at least 1.45.)

Even Debian oldstable packages Rust 1.41 as of writing, so it's okay to relax that if there's a reason to.