JVM | Platform | Status |
---|---|---|
OpenJDK (Temurin) Current | Linux | |
OpenJDK (Temurin) LTS | Linux | |
OpenJDK (Temurin) Current | Windows | |
OpenJDK (Temurin) LTS | Windows |
A specification for restricted dotted names.
Applications such as compilers and package systems often use so-called reverse dns notation to identify packages and code artifacts. Unfortunately, due to reverse DNS notation being underspecified, each implementation has its own idea of which names should be permitted and which should be rejected.
This specification attempts to define a restricted form of the notation with the following properties:
- Names can be validated with a simple regular expression that is defined in such a way as to avoid resource exhaustion attacks.
- Names are defined using a strict subset of ASCII in order to avoid Unicode-based spoofing and phishing attacks.
- Names are defined such that the maximum length of a name is bounded in order to provide for predictable storage use in database applications.
A dotted name is a string matching the following regular expression:
([a-z][a-z0-9_-]{0,63})(\.[a-z][a-z0-9_-]{0,62}){0,15}
A dotted name can always be split into between 1 and 16 segments by splitting the name into separate parts at each dot, such that each part is a valid dotted name .
Let x
be a dotted name. By the definition of the
regular expression that defines a dotted name, x
must be one of:
- A single character in the range
[a-z]
. - A primary segment consisting of a single character in the range
[a-z]
followed by up to 63 characters from the set[a-z0-9_-]
. - A primary segment followed by between
1
and15
secondary segments that each consist of a dot.
, followed by a single character in the range[a-z]
, followed by up to 62 characters from the set[a-z0-9_-]
.
If x
is single character in the range [a-z]
, then there is no splitting
to be performed and x
is already trivially matched by the regular expression
and is therefore a valid dotted name.
If x
consists of a single primary segment, then there is no splitting
to be performed and x
is already trivially matched by the regular expression
and is therefore a valid dotted name.
If x
consists of a primary segment followed by between 1
and 15
secondary segments, then for each segment s
it is necessary to show that
s
matches the regular expression when the preceding dot (if s
is a
secondary segment) is removed.
-
If
s
is a primary segment, then it already matches the regular expression. -
If
s
is a secondary segment, then it will effectively become the primary segment of the new dotted name. Becauses
is a secondary segment, it must match the subexpression[a-z][a-z0-9_-]{0,62}
. By the semantics of length ranges in regular expressions, any string matched by an expressione{0,n}
will also be matched by an expressione{0,n+1}
. As the subexpression for primary segments is[a-z][a-z0-9_-]{0,63}
,s
will match and is therefore a valid primary segment.
The number of characters in any dotted name is <= 1024
.
By SPLIT, we know that a dotted name x
consists of
a primary segment followed by up to 15
secondary segments.
The regular (sub)expression that matches a primary segment is
defined as [a-z][a-z0-9_-]{0,63}
. The longest size of a primary segment
is therefore 1 + 63 = 64
.
The regular (sub)expression that matches a secondary segment is
defined as \.[a-z][a-z0-9_-]{0,62}
. The longest size of a secondary segment
is therefore 1 + 1 + 62 = 64
.
The regular (sub)expression that defines how many secondary segments may
appear in a dotted name is defined as e{0,15}
, so the maximum number of
secondary segments is 15
and therefore the maximum number of characters
that can be used for secondary segments is 15 * 64 = 960
.
We can therefore conclude that a string consisting of a maximum length
primary segment and the maximum number of maximum length
secondary segments is 64 + (15 * 64) = 1024
.
Machine-checked proofs of the above propositions are provided in the Lanark.v file.
Why are names defined in terms of a regular expression rather than as a BNF grammar?
This specification is being written to support the development of various io7m software packages, and validation of dotted names is expected to occur in a wide range of different contexts such as XML schemas, at runtime in Java code, in definitions of SQL tables, and etc. These environments all feature regular expression validation, and not all of them support writing parsers for more advanced grammars. With the specification itself containing the canonical regular expression, this expression can literally be pasted into various locations without needing any changes.
Why are names restricted to a subset of ASCII?
One of the uses for dotted names is in the naming of software packages published onto the web. In systems that allow for the full use of Unicode to name packages, it's possible for malicious parties to spoof the appearance of packages by using carefully crafted names. For example:
com.io7m.example
com.iọ7m.example
The second package is a malicious package. It would be fairly trivial for
someone to sneak in a reference to this package as a dependency in an open
source project and have it go unnoticed. For those unable to tell the
difference: The o
in io7m
in the second package is actually U+1ECC
("Latin Capital Letter O with Dot Below"). This is almost indistinguishable
from the first package, but could easily be used to fool people into thinking
they're installing a package written by someone controlling the com.io7m
namespace.
Won't ASCII cause problems for non-English developers?
Currently, Maven Central is the largest collection of open-source software on the planet. Artifacts published to Maven Central have a group name and an artifact name. It is conventional for group names to be in reverse DNS notation, and it is not uncommon for artifact names to also be in this same notation. By analyzing the largest collection of of open-source software on the planet, we can probably get some idea as to how developers all over the world are naming their artifacts.
An index is published weekly consisting of a list of every single artifact published into the repository. By analyzing the names of artifacts and groups and checking to see if those names could be expressed using the restricted dotted name specification here, we observed the following:
-
There are
69604
unique group names on Maven Central. Of these,68690
have names that are expressible using the syntax defined here. This leaves914
inexpressible groups, for a coverage of98.69%
. -
There are
431423
unique artifact names on Maven Central. Of these,343678
have names that are expressible using the syntax defined here. This leaves87745
inexpressible names, for a coverage of79.66%
.
However, we also analyzed the reasons that names failed to match the syntax defined here and determined:
-
618
and24810
group and artifact names, respectively, failed to match because they contained uppercase characters. If all names are converted to lowercase, this removes a significant chunk of "bad" names. -
62696
artifact names failed to match because they contained characters other than[a-z]
after a dot. These were frequently artifacts that, for whatever reason, decided to encode version numbers within the name itself. A random sample of failing names is as follows:com.github.javawithmarcus.wicket-cdi-1.1 org.floggy.3rd.org.eclipse.core com.github.1137095129 io.github.2gis com.github.9215095360 com.9isuper.eve org.99soft com.moz.kiji.delegation.kiji-delegation.3.0.0.com.moz.kiji.schema org.floggy.3rd.org.eclipse.ui io.7mind.izumi.sbt opentelemetry-armeria-1.0 common-util_2.13 mongoauth_3.1_2.12 content-api-client_2.12 utils-test_2.12 kafkakit_2.13 ciris-refined_2.11 dynamo-test_2.13 case-service_2.12 jimcy-java-api_2.11
-
Only one single artifact name used a non-ASCII character on Maven Central:
com.github.marcioos:bgg-clienẗ
-
Only
69
artifacts contained name segments that were too long to be supported by the syntax defined here. A random sample of failing names is as follows:rapidpm-proxybuilder-modules-dynamicobjectadapter-generator-processors rapidpm-proxybuilder-modules-objectadapter-generator-usages-usinggenerated spring-cloud-starter-stream-processor-tasklaunchrequest-transform stormpath-sdk-examples-spring-security-spring-boot-webmvc-bare-bones camel-quarkus-integration-tests-support-custom-type-converter-deployment wildfly-microprofile-reactive-streams-operators-cdi-provider-legacy-namespace camel-quarkus-integration-tests-support-custom-type-converter-parent stormpath-sdk-tutorials-spring-boot-default-spring-security-refined nav-virksomhet-tiltakOgAktiviteterForBrukere-v1-meldingsdefinisjon camel-quarkus-integration-test-support-core-main-collector-ext-deployment
In the author's opinion, these names are somewhat excessive and could be supported with dotted notation instead of relentless hyphenation.
-
Less than
20
artifacts had any other characters that do not appear in the regular expression defined here. A random sample is as follows:# These contain ':' libaums:storageprovider com.foilen:database-tools libaums:http reactivex:rxjs app.ubie:brave-kt # These contain '+' bctsp-jdk15+ bcpg-jdk15+ sugar-tms_2.12at13+ mvp+android bcprov-jdk15+ amiitool+android bcmail-jdk15+ # These contain whitespace com.inkapplications.spondee.math-macosx64.0.0.3.com 2.inkapplications.spondee com.inkapplications.spondee.math-macosx64.0.0.3.com 3.inkapplications.spondee com.inkapplications.spondee.math-macosx64.0.0.3.com 4.inkapplications.spondee utilex # These contain quote characters "palsolayouts" "android-sdk" "rxbluetooth"
Many of these look like publication mistakes.
It is therefore the position of the author that if people are publishing packages with non-English names, they appear to be doing it using the ASCII character set.
Why are the length of names bounded?
For two reasons:
-
Regular expressions can be subject to denial of service attacks, particularly when they contain unbounded quantifiers such as
*
and+
. -
Adding an upper bound on length means more predictable storage use when names are used in relational databases.
The regular expression as it is defined here is expected to be somewhat less vulnerable to denial of service attacks in naive regex engine implementations than an unbounded version would be.