Finding the longest common prefix of an array of strings in Python, part 2

2007-01-06 21:39:50 -08:00

In case you found my previous post unsatisfactory, and wanted to know how one would actually implement os.path.commonprefix in Python, here’s my answer.

(And yes, I started writing this before I found os.path.commonprefix in the documentation. ☺)

This is in response to wootest‘s post on finding the longest common prefix of an array of strings in Ruby.

files = [‘/foo/bar/xyzzy_one’, ‘/foo/bar/xyzzy_two’,
‘/foo/bar/xyzzy_three/four’]

No change here. For relative paths, use os.path.join(os.getcwd(), path) to convert each path to an absolute path.

Zeroth, we don’t set up an array to hold guesses. We’ll extract exactly one prefix, so no array is necessary.

Let’s put this whole thing in a function for easy access. (It also makes the code clearer later.)

def longest_common_prefix(*files):
“This function returns the longest common prefix of one or more sequences. Raises an exception when zero sequences are provided.”

The * means that this function is variadic: You use it with the syntax longest_common_prefix(foo, bar, baz). If you want to pass an array of paths, use the apply syntax: longest_common_prefix(*files) (does that asterisk look familiar?).

The first thing our function should do is assert that we actually have some strings. There are two reasons for this:

  1. My algorithm assumes that there is at least one string.
  2. Trying to get the longest common prefix of no strings is definitely unusual, and should be suspected of being a bug.

assert files, ‘Longest common prefix of no strings requested. Such behavior is highly irrational and is not tolerated by this program.’

Next is an if to cover the case of exactly one string. The algorithm would handle this just fine, but inefficiently: One string is its own longest common prefix, so we can simply return that. This if would also cover the case of no strings, but I like having an assertion for that (see #2 above). I’m paranoid that way.

if len(files) == 1:
return files[0]

Next we put the filenames in order by length. Having the shortest string first is necessary everywhere you see files[0] mentioned below.

files = [pair[1] for pair in sorted((len(fi), fi) for fi in files)]

This is the “decorate-sort-undecorate” pattern, used in Python because built-in comparison is much faster than a pure-Python comparison function.

Next is the outer loop, in which we start iterating the characters of the first (shortest) string.

for i, comparison_ch in enumerate(files[0]):

The built-in enumerate yields a pair (idx, element) for every item in the sequence.

The inner loop iterates through all our other paths.

for fi in files[1:]:

Get character i of this path.

ch = fi[i]

Compare it to the same character in the shortest path.

if ch != comparison_ch:

If the equality test fails, then we have found the first character that is not in the longest common prefix. This means that i is the length of said prefix. We extract the prefix by slicing the shortest path.

return fi[:i]

The stop index i is not included in the slice, just as with ranges). This gets us the first i characters.

If the equality test succeeds, then we haven’t gone past the longest common prefix, and we continue iterating.

Outside the loops (notice that we’ve gone down several indentation levels):

return ”

If we’ve run out of loop, there must not be a common prefix; our common prefix, therefore, is the empty string. We return that as a literal.

	return files[0]

If we’ve run out of loop, then our shortest string (which we sorted to first in the list) is also the longest common prefix. (Thanks to Lars for the correction.)

Now, despite the variable names, that function is suited for arbitrary sequences, and does not properly handle the case of a partial directory name. There are two solutions.

Ugly solution

Split the pathnames into components, pass these lists to longest_common_prefix, and join the output and prepend a ‘/’ to make it back into a path.

‘/’ + path.join(*longest_common_prefix(*(fi.split(‘/’) for fi in files)))

Robust solution

def longest_common_pathname_prefix(*files):
“This function returns the longest common prefix of one or more pathnames. Raises an exception when zero pathnames are provided.”
lcp = longest_common_prefix(*files)

if not lcp.endswith(‘/’):
import os.path
lcp = os.path.dirname(lcp)

return lcp

Alternatively, I could use:

if lcp[-1:] != ‘/’:

which mirrors wootest’s solution. (It would extract characters -1 through the end — i.e. the last character, or the empty string if lcp is also the empty string.) But str.endswith is more Pythonic.

Here’s all the code at once, implemented as a complete combination Python module and self-test:

#!/usr/bin/env python

def longest_common_prefix(*files):
“This function returns the longest common prefix of one or more sequences. Raises an exception when zero sequences are provided.”
assert files, ‘Longest common prefix of no strings requested. Such behavior is highly irrational and is not tolerated by this program.’

if len(files) == 1:
return files[0]

files = [pair[1] for pair in sorted((len(fi), fi) for fi in files)]

for i, comparison_ch in enumerate(files[0]):
for fi in files[1:]:
ch = fi[i]

if ch != comparison_ch:
return fi[:i]

return files[0]

def longest_common_pathname_prefix(*files):
“This function returns the longest common prefix of one or more pathnames. Raises an exception when zero pathnames are provided.”
lcp = longest_common_prefix(*files)

if not lcp.endswith(‘/’):
import os.path
lcp = os.path.dirname(lcp)

return lcp

if __name__ == ‘__main__’:
files = [‘/foo/bar/xyzzy_one’, ‘/foo/bar/xyzzy_two’, ‘/foo/bar/xyzzy_three/four’]
print longest_common_prefix(*files)
print longest_common_pathname_prefix(*files)

9 Responses to “Finding the longest common prefix of an array of strings in Python, part 2”

  1. Jonathan Wight Says:

    If you take your full complete implementation and pass in:

    files = [‘/foo/bar/xyzzy_one’, ‘/foo/bar/xyzzy_two’, ‘/foo/bar/xyzzy_three/four’, ‘/foo/bar/xyzzy_’]

    You get no output.

  2. Nicholas Riley Says:

    One minor style nit (which may reduce speed, but improve readability; then again, this is Python, so YMMV :-)

    files = [pair[1] for pair in sorted((len(fi), fi) for fi in files)]

    I typically use tuple unpacking in this case:

    files = [name for length, name in sorted((len(fi), fi) for fi in files))]

    OTOH, Python has a function that does exactly what you want: os.path.commonprefix. It uses min and max in a really neat way I wouldn’t have thought of, which is most likely more efficient than sorting the list.

    http://svn.python.org/view/python/trunk/Lib/genericpath.py?rev=51719&view=markup

  3. Nicholas Riley Says:

    Or, I could learn to read. *sigh*

    (In my defense, your website layout is really difficult to read.)

  4. Peter Hosey Says:

    Indeed, genericpath.commonprefix has an awesome implementation. Thanks for the link!

    What’s hard to read about my website layout? (Other than the fact that all the articles get shunted down below the sidebar half the time. That’s a bug that I still haven’t figured out; something to do with the wrapping nature of heading elements, I think.)

  5. Nicholas Riley Says:

    It’s hard to tell posts apart from one another, and the text uses the entire page width. Sometimes whitespace is good.

  6. Peter Hosey Says:

    I guess we’ll have to agree to disagree on the whitespace—I don’t like large gutters on either side of the page, as I consider the space wasted.

    I’m not sure what you mean by it being hard to tell posts apart from one another. Are the HR and H2 on each post not doing the job? (No sarcasm meant—I’m asking earnestly.)

  7. Lars Says:

    You wrote:

    > return ”

    If we’ve run out of loop, there must not be a common prefix; our common prefix, therefore, is the empty string. We return that as a literal.

    Huh?
    I don’t understand this…
    It seems to me that if you run out of loop, then all the characters of all the non-shortest strings have agreed with those of the shortest string.
    So your common prefix is the shortest string, files[0].

    It seems unlikely that I would have noticed this while you and your other readers did not… but I can’t figure out why the way it’s written would be correct.
    If you answer this, I’d appreciate an email. I don’t normally check this blog.

  8. Peter Hosey Says:

    Lars: You’re absolutely right. It did, indeed, return '' in cases where that’s wrong. (Example: ['foo', 'foobarbaz', 'foobarqux'] should return 'foo'.)

    Thanks.

  9. lamby Says:

    A functional version might be:

    from itertools import imap, izip
    from operator import itemgetter
    xs = (“Hello”, “Here”)

    print “”.join(
    imap(
    itemgetter(0),
    takewhile(
    lambda x: len(set(x)) == 1,
    izip(*xs)
    )
    )
    )

Leave a Reply

Do not delete the second sentence.