In case you found my previous post unsatisfactory, and wanted to know how one would actually implement os.path.commonprefix in Python, here’s my answer.
(And yes, I started writing this before I found os.path.commonprefix in the documentation. ☺)
This is in response to wootest‘s post on finding the longest common prefix of an array of strings in Ruby.
files = [‘/foo/bar/xyzzy_one’, ‘/foo/bar/xyzzy_two’,
‘/foo/bar/xyzzy_three/four’]
No change here. For relative paths, use os.path.join(os.getcwd(), path)
to convert each path to an absolute path.
Zeroth, we don’t set up an array to hold guesses. We’ll extract exactly one prefix, so no array is necessary.
Let’s put this whole thing in a function for easy access. (It also makes the code clearer later.)
def longest_common_prefix(*files):
“This function returns the longest common prefix of one or more sequences. Raises an exception when zero sequences are provided.”
The *
means that this function is variadic: You use it with the syntax longest_common_prefix(foo, bar, baz)
. If you want to pass an array of paths, use the apply syntax: longest_common_prefix(*files)
(does that asterisk look familiar?).
The first thing our function should do is assert that we actually have some strings. There are two reasons for this:
- My algorithm assumes that there is at least one string.
- Trying to get the longest common prefix of no strings is definitely unusual, and should be suspected of being a bug.
assert files, ‘Longest common prefix of no strings requested. Such behavior is highly irrational and is not tolerated by this program.’
Next is an if to cover the case of exactly one string. The algorithm would handle this just fine, but inefficiently: One string is its own longest common prefix, so we can simply return that. This if would also cover the case of no strings, but I like having an assertion for that (see #2 above). I’m paranoid that way.
if len(files) == 1:
return files[0]
Next we put the filenames in order by length. Having the shortest string first is necessary everywhere you see files[0]
mentioned below.
files = [pair[1] for pair in sorted((len(fi), fi) for fi in files)]
This is the “decorate-sort-undecorate” pattern, used in Python because built-in comparison is much faster than a pure-Python comparison function.
Next is the outer loop, in which we start iterating the characters of the first (shortest) string.
for i, comparison_ch in enumerate(files[0]):
The built-in enumerate yields a pair (idx, element) for every item in the sequence.
The inner loop iterates through all our other paths.
for fi in files[1:]:
Get character i of this path.
ch = fi[i]
Compare it to the same character in the shortest path.
if ch != comparison_ch:
If the equality test fails, then we have found the first character that is not in the longest common prefix. This means that i is the length of said prefix. We extract the prefix by slicing the shortest path.
return fi[:i]
The stop index i is not included in the slice, just as with ranges). This gets us the first i characters.
If the equality test succeeds, then we haven’t gone past the longest common prefix, and we continue iterating.
Outside the loops (notice that we’ve gone down several indentation levels):
return ”
If we’ve run out of loop, there must not be a common prefix; our common prefix, therefore, is the empty string. We return that as a literal.
return files[0]
If we’ve run out of loop, then our shortest string (which we sorted to first in the list) is also the longest common prefix. (Thanks to Lars for the correction.)
Now, despite the variable names, that function is suited for arbitrary sequences, and does not properly handle the case of a partial directory name. There are two solutions.
Ugly solution
Split the pathnames into components, pass these lists to longest_common_prefix, and join the output and prepend a ‘/’ to make it back into a path.
‘/’ + path.join(*longest_common_prefix(*(fi.split(‘/’) for fi in files)))
Robust solution
def longest_common_pathname_prefix(*files):
“This function returns the longest common prefix of one or more pathnames. Raises an exception when zero pathnames are provided.”
lcp = longest_common_prefix(*files)
if not lcp.endswith(‘/’):
import os.path
lcp = os.path.dirname(lcp)
return lcp
Alternatively, I could use:
if lcp[-1:] != ‘/’:
which mirrors wootest’s solution. (It would extract characters -1 through the end — i.e. the last character, or the empty string if lcp is also the empty string.) But str.endswith is more Pythonic.
Here’s all the code at once, implemented as a complete combination Python module and self-test:
#!/usr/bin/env python
def longest_common_prefix(*files):
“This function returns the longest common prefix of one or more sequences. Raises an exception when zero sequences are provided.”
assert files, ‘Longest common prefix of no strings requested. Such behavior is highly irrational and is not tolerated by this program.’
if len(files) == 1:
return files[0]
files = [pair[1] for pair in sorted((len(fi), fi) for fi in files)]
for i, comparison_ch in enumerate(files[0]):
for fi in files[1:]:
ch = fi[i]
if ch != comparison_ch:
return fi[:i]
return files[0]
def longest_common_pathname_prefix(*files):
“This function returns the longest common prefix of one or more pathnames. Raises an exception when zero pathnames are provided.”
lcp = longest_common_prefix(*files)
if not lcp.endswith(‘/’):
import os.path
lcp = os.path.dirname(lcp)
return lcp
if __name__ == ‘__main__’:
files = [‘/foo/bar/xyzzy_one’, ‘/foo/bar/xyzzy_two’, ‘/foo/bar/xyzzy_three/four’]
print longest_common_prefix(*files)
print longest_common_pathname_prefix(*files)