Idle Time

Finding the longest common prefix of an array of strings in Python, part 2

2007-01-06 21:39:50 -08:00

In case you found my previous post unsatisfactory, and wanted to know how one would actually implement os.path.commonprefix in Python, here’s my answer.

(And yes, I started writing this before I found os.path.commonprefix in the documentation. ☺)

This is in response to wootest‘s post on finding the longest common prefix of an array of strings in Ruby.

files = [‘/foo/bar/xyzzy_one’, ‘/foo/bar/xyzzy_two’,
‘/foo/bar/xyzzy_three/four’]

No change here. For relative paths, use os.path.join(os.getcwd(), path) to convert each path to an absolute path.

Zeroth, we don’t set up an array to hold guesses. We’ll extract exactly one prefix, so no array is necessary.

Let’s put this whole thing in a function for easy access. (It also makes the code clearer later.)

def longest_common_prefix(*files):
“This function returns the longest common prefix of one or more sequences. Raises an exception when zero sequences are provided.”

The * means that this function is variadic: You use it with the syntax longest_common_prefix(foo, bar, baz). If you want to pass an array of paths, use the apply syntax: longest_common_prefix(*files) (does that asterisk look familiar?).

The first thing our function should do is assert that we actually have some strings. There are two reasons for this:

My algorithm assumes that there is at least one string.
Trying to get the longest common prefix of no strings is definitely unusual, and should be suspected of being a bug.

assert files, ‘Longest common prefix of no strings requested. Such behavior is highly irrational and is not tolerated by this program.’

Next is an if to cover the case of exactly one string. The algorithm would handle this just fine, but inefficiently: One string is its own longest common prefix, so we can simply return that. This if would also cover the case of no strings, but I like having an assertion for that (see #2 above). I’m paranoid that way.

if len(files) == 1:
return files[0]

Next we put the filenames in order by length. Having the shortest string first is necessary everywhere you see files[0] mentioned below.

files = [pair[1] for pair in sorted((len(fi), fi) for fi in files)]

This is the “decorate-sort-undecorate” pattern, used in Python because built-in comparison is much faster than a pure-Python comparison function.

Next is the outer loop, in which we start iterating the characters of the first (shortest) string.

for i, comparison_ch in enumerate(files[0]):

The built-in enumerate yields a pair (idx, element) for every item in the sequence.

The inner loop iterates through all our other paths.

for fi in files[1:]:

Get character i of this path.

ch = fi[i]

Compare it to the same character in the shortest path.

if ch != comparison_ch:

If the equality test fails, then we have found the first character that is not in the longest common prefix. This means that i is the length of said prefix. We extract the prefix by slicing the shortest path.

return fi[:i]

The stop index i is not included in the slice, just as with ranges). This gets us the first i characters.

If the equality test succeeds, then we haven’t gone past the longest common prefix, and we continue iterating.

Outside the loops (notice that we’ve gone down several indentation levels):

return ”

If we’ve run out of loop, there must not be a common prefix; our common prefix, therefore, is the empty string. We return that as a literal.

	return files[0]

If we’ve run out of loop, then our shortest string (which we sorted to first in the list) is also the longest common prefix. (Thanks to Lars for the correction.)

Now, despite the variable names, that function is suited for arbitrary sequences, and does not properly handle the case of a partial directory name. There are two solutions.

Ugly solution

Split the pathnames into components, pass these lists to longest_common_prefix, and join the output and prepend a ‘/’ to make it back into a path.

‘/’ + path.join(*longest_common_prefix(*(fi.split(‘/’) for fi in files)))

Robust solution

def longest_common_pathname_prefix(*files):
“This function returns the longest common prefix of one or more pathnames. Raises an exception when zero pathnames are provided.”
lcp = longest_common_prefix(*files)

if not lcp.endswith(‘/’):
import os.path
lcp = os.path.dirname(lcp)

return lcp

Alternatively, I could use:

if lcp[-1:] != ‘/’:

which mirrors wootest’s solution. (It would extract characters -1 through the end — i.e. the last character, or the empty string if lcp is also the empty string.) But str.endswith is more Pythonic.

Here’s all the code at once, implemented as a complete combination Python module and self-test:

#!/usr/bin/env python

def longest_common_prefix(*files):
“This function returns the longest common prefix of one or more sequences. Raises an exception when zero sequences are provided.”
assert files, ‘Longest common prefix of no strings requested. Such behavior is highly irrational and is not tolerated by this program.’

if len(files) == 1:
return files[0]

files = [pair[1] for pair in sorted((len(fi), fi) for fi in files)]

for i, comparison_ch in enumerate(files[0]):
for fi in files[1:]:
ch = fi[i]

if ch != comparison_ch:
return fi[:i]

return files[0]

def longest_common_pathname_prefix(*files):
“This function returns the longest common prefix of one or more pathnames. Raises an exception when zero pathnames are provided.”
lcp = longest_common_prefix(*files)

if not lcp.endswith(‘/’):
import os.path
lcp = os.path.dirname(lcp)

return lcp

if __name__ == ‘__main__’:
files = [‘/foo/bar/xyzzy_one’, ‘/foo/bar/xyzzy_two’, ‘/foo/bar/xyzzy_three/four’]
print longest_common_prefix(*files)
print longest_common_pathname_prefix(*files)

Categories: Programming; Python. | Comments: 9 (feed).

9 Responses to “Finding the longest common prefix of an array of strings in Python, part 2”

Jonathan Wight Says:
January 9th, 2007 at 06:41:31
If you take your full complete implementation and pass in:

files = [‘/foo/bar/xyzzy_one’, ‘/foo/bar/xyzzy_two’, ‘/foo/bar/xyzzy_three/four’, ‘/foo/bar/xyzzy_’]

You get no output.
Nicholas Riley Says:
January 21st, 2007 at 18:39:28
One minor style nit (which may reduce speed, but improve readability; then again, this is Python, so YMMV :-)

files = [pair[1] for pair in sorted((len(fi), fi) for fi in files)]

I typically use tuple unpacking in this case:

files = [name for length, name in sorted((len(fi), fi) for fi in files))]

OTOH, Python has a function that does exactly what you want: os.path.commonprefix. It uses min and max in a really neat way I wouldn’t have thought of, which is most likely more efficient than sorting the list.

http://svn.python.org/view/python/trunk/Lib/genericpath.py?rev=51719&view=markup
Nicholas Riley Says:
January 21st, 2007 at 18:41:48
Or, I could learn to read. *sigh*

(In my defense, your website layout is really difficult to read.)
Peter Hosey Says:
January 21st, 2007 at 21:08:24
Indeed, genericpath.commonprefix has an awesome implementation. Thanks for the link!

What’s hard to read about my website layout? (Other than the fact that all the articles get shunted down below the sidebar half the time. That’s a bug that I still haven’t figured out; something to do with the wrapping nature of heading elements, I think.)
Nicholas Riley Says:
January 28th, 2007 at 19:25:25
It’s hard to tell posts apart from one another, and the text uses the entire page width. Sometimes whitespace is good.
Peter Hosey Says:
January 28th, 2007 at 19:53:10
I guess we’ll have to agree to disagree on the whitespace—I don’t like large gutters on either side of the page, as I consider the space wasted.

I’m not sure what you mean by it being hard to tell posts apart from one another. Are the HR and H2 on each post not doing the job? (No sarcasm meant—I’m asking earnestly.)
Lars Says:
September 11th, 2008 at 06:48:03
You wrote:

> return ”

If we’ve run out of loop, there must not be a common prefix; our common prefix, therefore, is the empty string. We return that as a literal.

Huh?
I don’t understand this…
It seems to me that if you run out of loop, then all the characters of all the non-shortest strings have agreed with those of the shortest string.
So your common prefix is the shortest string, files[0].

It seems unlikely that I would have noticed this while you and your other readers did not… but I can’t figure out why the way it’s written would be correct.
If you answer this, I’d appreciate an email. I don’t normally check this blog.
Peter Hosey Says:
September 11th, 2008 at 10:07:59
Lars: You’re absolutely right. It did, indeed, return '' in cases where that’s wrong. (Example: ['foo', 'foobarbaz', 'foobarqux'] should return 'foo'.)

Thanks.
lamby Says:
January 26th, 2009 at 14:34:16
A functional version might be:

from itertools import imap, izip
from operator import itemgetter
xs = (“Hello”, “Here”)

print “”.join(
imap(
itemgetter(0),
takewhile(
lambda x: len(set(x)) == 1,
izip(*xs)
)
)
)

Finding the longest common prefix of an array of strings in Python, part 2

Ugly solution

Robust solution

9 Responses to “Finding the longest common prefix of an array of strings in Python, part 2”

Leave a Reply