An Algebraic Structure For Path Schema (Take 2)

\(\newcommand{\abs}[1]{\lvert#1\rvert}\) \(\newcommand{\(}{\left(}\) \(\newcommand{\)}{\right)}\)

This is a second shot at expressing Path Schema as algebraic objects. See my first attempt. The definitions should be equivelent, and any places they are not indicates a deficency in one of the defintions. This should be a bit more elegant, than before. It is also a bit more extensive. Note that \(R\) and \(A\) are now defined differently, and \(A^\ast\) and \(R^\ast\) are what one should be focussing on instead, this is to use the free monoid convention.

In general a path can be described as a a hierachical index, onto a directed multigraph. Noting that “flat” sets, trees, and directed graphs are all particular types of directed multigraphs.

To repeat the introduction:

This post comes from a longish discussion with Fengyang Wang (@TotalVerb), on the JuliaLang Gitter. Its pretty cool stuff.

It is defined here independent of the object (filesystem, document etc) being indexed. The precise implementation of the algebric structure differs, depending on the Path types in question, eg Filesystem vs URL, vs XPATH.

This defintion is generally applicable to paths, such as:

The defintion whch follows provides all the the expected functionality on paths

Definition

To have the general functionality of a path we only need to define a few items on our index, from these the rest of the functionality flows. This sets a minimal amount of functionality for a path at the most abstract level. Particular path implementations can provide more. The defintion is made from 3 object \(A\), \(R\), \(\cdot\), with 3 rules (1), (2), (3) binding them.

Our path is defined by two sets, \(R\), and \(A\), and an operation \(\cdot\).

Number of roots

Depending on the system being indexed, there may be one or more abolute roots. In linux file systems, there is one root /. In windows, there is one root per hard drive C:, D: etc, plus a an additional root for named pipes \\.\pipe\. One also might wish to consider UNC paths to be part of the same system in windows – these to would have there on roots.

In theory we could also have zero absolute path roots – this would mean there are on absolute paths. However this is not generally useful, as the evaluation function given later, to resolve an index into the object is only defined for absolute paths.

Parentdir, and basename

As the path is a heirachical index, every point in the heirachy has a parent, or is a root.

We define a function \(p:\; A^\ast \cup R^\ast \to A^\ast \cup R^\ast\). This is called the parentdir function. Note the similarity to Lisp’s cdr, or the tail operation.

It takes a little to show that this is properly a function, but it comes from the faithfulness of \(R\). Note that \(p(A^\ast)=A\) and \(p(R^\ast)=R^\ast\), and that these as disjoint.

We define another a function \(b:\; A^\ast \cup R^\ast \to A \cup A\). This is called the basename function. Note the similarity to Lisp’s car, or the head operation.

Note that for nonroots, the resault is always (and covers every) path componant – that is to say \(b(A^\ast)=R\). And for roots, this is the identity. We note that \(b(x)=x \Leftrightarrow p(x)=x \Leftrightarrow x\; is\; \;a \; root\)

\(p\) and \(b\) are linked. In that: \(\forall x \in A^\ast \cup R^\ast \; p(x) \cdot b(x) = x\). This can probably be related to a kind of pseduo-inverse.

Parentdir and \(\varphi\) (..)

As a digression: In many implementations of paths, it may be tempting to define an element $\varphi \in R$ and state that \(p(x) = x\cdot \varphi\). Eg in filepaths this would be ... However this is not possible, under the above definition of a path, as then \(\left( R,\cdot \right)\) would no longer be a free monoid, or indeed a monoid at all. As \(I_R \cdot \varphi = \varphi\) by the fact that \(I_R\) is the identity; but \(p(I_R)= I_R\) by definition of \(p\). Depending on how this is resolved one might end up with for pathjoin(a, ./..)==a. A alternative is to define a normalisation function (or a equalience relation), that resolves this; as we will do later.

Derived Operations

the root function

We define a function to find the root of a path. \(root:\; A^\ast \cup R^\ast \to A\cup\{I_R\}\)

The root of an element of \(R^\ast\) is always \(I_R\); and of an element of \(A^\ast\) is always an element of \(R\).

To make finding roots fast (as it is useful for the relative_to operation) it seems efficent to always store an element of \(A^\ast\), as its root (an element of \(A\)), and a relative path from that root (an element of \(R^\ast\); which we will call the relative compent of the absolute path). We know that all absolute paths can be expressed this way, due to the faithfulness of \(R\).

the depth function

We define a function \(d:\; A^\ast \cup R^\ast \to \mathbb{N}_0)\\) , often call depth or nesting level.

the parts function

We define a funtion to find the parts of a path. This breaks a path down in to a sequence starting with it’s root, and followed by path componants

We can recombine the path components: for product over \(\cdot\) by \(x=\prod_{p\in parts(x)} p\).

Notice this option (like all the other really), can be implemented easily, by implementing it for the \(R^\ast\) and then apply it to the relative component of a absolute path, until it reach the root – and then subsituting the absolute root for \(I_R\).

the within function

\(within\), takes two paths, with one inside the other and finds the relative path from the second to the first. It is a weaker version of \(relative_to\)

Resolving a Path to the object

Finally we have an the operation that turns a absolute path (\(x\in A\)) into a entity, or a set of entities from the Domain being indexed \(D\). These operations are less clear, as at this level objects must be considered – the data store (the indexed set) must be accessed to resolve. And issues like symlinks, hardlinks, etc start to matter (To use examples from POSIX filepaths).

Depending on the type of path schema, this may be 1 enitity, or many

Either way we can make some statements about the cardinality of \(e(x)\)

\(\varphi\) the Pseudoparent Element

This was touched on before, that for many systems we would like to have a element \(\varphi\), that acts like a the parentdir function. But we can’t because it would break the monoid. Now that we have a understanding of evaluating a path to a domain object, it starts to make most sense to talk of this. This \(\varphi\) is not defined for all Path Schema, but it is for many (For file paths it is ..). Where it exists:

We define \(\varphi\in R\), as having the property that: \(\forall x \in A^\ast\;\) then \(e(x\cdot \varphi) = e(p(x))\).

Does \(\varphi\) even exist? (POSIX compliance)

Note that for POSIX filesystems using \(\varphi = \mathtt{..}\) this is not POSIX compliant, as POSIX requires that Symlink directories are substituted in to the path, before .. (our \(\varphi\)) is resolved. Our \(p\) function does not have that requirement. As far as I know there is no way to be POSIX complient on the behavour of .. without actually reading the filesystem; to know what is or is not a symlink.

See this Unix Stack Exchange question for more information.

The behavour here is the default behavour in Bash. ksh, zsh and ash, with the -L flag (which is on by default for cd and off by default for pwd). Also in python os.path, and Node.jl’s path.

Python3 pathlib, has the correct behavour – in that it does not process .. at all, except in the final resolve step. i.e. it does not offer any of the following functionality except a final resolve for absolute paths (Their \(relative\_to\) is the \(within\) defined ealier). The Haskell Path Package bans .. outright.

This gets particular hairy for Multipaths; which have path components that are more complex than simply directories. Consider for a Glob: \(\mathtt{a/**/b/..}\) finds all folders below a with a sibling that is named b. Where as \(p(\mathtt{a/**/b})=\mathtt{a/**}\) which finds all folders below a. And so .. is not \(\varphi\) for Glob paths, and indeed there is no such element for them.

So not supporting funtions involving \(\varphi\) may be a good idea for an implementation. Without it though you can not have \(norm\) nor \(relative_{to}\). Except by accessing the backing system.

Normalising, to remove \(\varphi\)

Using this, and we can now define a normalising function that removes the \(\varphi\) where possible.

The function we will define removes all instances of \(\varphi\) from absolute paths (elements of \(A^\ast\)), and all non-leading instances from relative paths (elements of \(R^\ast\)) – though some nonleading \(\varphi\)s will potentenitally become leading.

We consider this on parts the relative componant of the path. That is \(parts(within(x,root(x)))\) (which is just \(x\) if \(x \in R^\ast\)) In this we repeatedly replace: from the left the earliest occurances of \(\left[ x,\varphi, y\right]\) with \(\left[ y \right]\), where \(x \in R \setminus {\varphi}\), and \(y\in R\). If it were a relative path, then this all we have to do, we simply rejoin the rewritten paths, and it is done. If it is applied to a absolute path, then we then can strip all leading instances of \(\varphi\), since \(\forall a \in A\), \(e(a \cdot \varphi) = e(p(a)) = e(a)\).

The replacement rule method is quite easy to explain in words. But rather awkward to write mathematically.

To supplement it for clarity, see the implementation, of normparts in this Julia, Jupyter Notebook.

Proving that \(norm\) does have the features we import upon it, with regard to not changing what paths point to, is not done here. Indeed I am yet to do so at all. It seems like to be very involved.

Note that even if for some \(x,\; y \in A \cap R^\ast\) with \(norm(x) = norm(y)\), this still does not imply that \(e(x) \neq e(y)\), as other things, eg POSIX symlinks, can still give multiple paths, to the same domain object.

The relative_to function

Earlier we saw the $within$ function; which could find the relative path of one directory, within the other. But this was limitted, as we could not move up the heirachy. Now using \(\varphi\) we can.

\(relative\_to\), takes two paths, and finds the relative path from the second to the first.

And with this: \(norm(y \cdot relative\_to(x,y)) = norm(x)\). And \(e(y \cdot relative\_to(x,y)) = e(x)\). (proving that, also looks like it would be very fun, and is not done here).

As suggeted before, if \(\varphi\) is not defined to meet \(e(x \cdot \varphi) = e(p(x))\), and thus \(norm\) is not defined, we can still define the properties of \(relative_to\) If we do not have

Other extensions

Canonical Name

Some path schema do actually permit a canonical name. In these schema \(\forall x,y \in A^\ast\; e(x)=e(y) \Leftrightarrow x=y\) Swift Object Stores are one such path system which permit a canonical name.

Differenciating Directory Path from File Paths

We have differenciated Absolute Paths from Relative Paths, it is also possible to differenciate File paths from Directory Paths. This is done in the Haskell Path package. This places restictions on pathjoin (\(\cdot\)). For $R_F^\ast\subset R^\ast$ the relative file paths, and for $A_F^\ast\subset A^\ast$ the set absolute file paths. For $R_D^\ast=\subset R^\ast = R^\ast \setminus R_F^\ast$ the set of relative directory paths, and for \(A_F^\ast\subset A^\ast = A^\ast \setminus A_F^\ast\) then only the following joins are permitted, and they have the following results:

This stops \(R\) from being a monoid, but \(R_D\) is still a free monoid. The earlier definitions can be rewritten again, to take this into account, though it does add considerable complexity. We replace all uses of \(R^\ast\) with uses of $R_D^\ast$, paired with $R_F$.

Conclusion

So that is paths. This is a highly abstract set of features. We show that from the simple rules we can get complex functionality, that is applicable accross the wide range of pathing systems It is applicable on all kinds of things, from key-values stores to XPATH. Most importantly perhaps, it covers URLs and Filepaths.

In general one might think monopath schema with only as single absolute root, are isomorphic to strings. This is true, but it is important to note the size of the alphabet – which for most path schema in use is countable infinite.

We did not prove any of our assertions, here. And indeed I’ve not written out proof for many of them at all. I hope however this was illuminating.

See also: