Archive for the 'Python' Category

A language-contrast exercise

Sunday, March 31st, 2013

Python’s str type has a translate method that, given a second string representing a translation table, returns a new string in which characters in the first string are looked up at their ordinal positions in the translation table and replaced with the characters found at those positions.

The identity translation table, performing no changes, is table[i] = i. For example, table['!']* is '!', so exclamation marks are not changed. If you made a table where table['!'] were '.', exclamation marks would be changed to periods (full stops).

I’d like to see implementations of a program that does that, with the input string encoded in UTF-16 and the translation table encoded in UTF-32 (a 0x11000-element long array of UTF-32 characters), with the table initialized to its identity: table[i] = i.

And yes, you need to handle surrogate pairs correctly.

Some languages that I would particularly like to see this implemented in include:

  • C
  • Haskell
  • LISP
  • A state-machine language (I don’t know of any off-hand; this might be their time to shine)

I know how I would do this in C, and I’m sure I could bash something out in Python, but how would you do this in your favorite language?

As a test case, you could replace “ and ” (U+201C and U+201D) with « and » (U+00AB and U+00BB).

If you want to post code in the comments, <pre>…</pre> should work. Alternatively, you can use Gist.

* I’m using the C sense of '!' here. In Python, this would be table[ord('!')], since characters in Python are just strings of length 1, and you can’t index into a string with another string; ord is a function that returns the ordinal (code-point) value of the character in such a string.

What to do if Python says “character mapping must return integer, None or unicode”

Monday, June 16th, 2008

So you’re using the unicode class’s translate method, and it says:

TypeError: character mapping must return integer, None or unicode

You may be wondering what causes this. After all, you’re duly using the string.maketrans function. Surely this should be valid?

Well, no: You can only use string.maketrans with str.translate. For unicode.translate, you must use a different kind of translation table:

… a mapping of Unicode ordinals to Unicode ordinals [int], Unicode strings [unicode] or None.

3.6.1 String Methods

In Localization Helper, I had to replace my string.maketrans call with this code:

dict(zip(map(ord, u'\a\b\t\n\v\f\r'), u'abtnvfr'))

Note that I need to call ord on each key character, because keys must be ints.

Generating all combinations from an arbitrary number of sequences

Monday, October 22nd, 2007

In Python 2.5:

import itertools

def combinations(*seqs):
    def base_combinations(so_far, seq):
        for x in seq:
            yield so_far + [x]
    def nested_combinations(so_far, seqs):
        if len(seqs) == 1:
            return base_combinations(so_far, seqs[0])
        else:
            iterators = (nested_combinations(so_far + [x], seqs[1:]) for x in seqs[0])
            return itertools.chain(*iterators)
    return nested_combinations([], seqs)

Usage:

for x, y, z in combinations(xs, ys, zs):
    # Perform magic

If you have a better way, I’d like to read it.

(Thanks to Colin for inspiring me to write it, and renaming one of the nested functions.)

WWDC 2007 session videos are out

Monday, July 30th, 2007

If you attended WWDC, you can head over to ADC on iTunes and see what you missed.

The most efficient way to waste time

Wednesday, May 16th, 2007

In profiling CPU Usage, I need to get my CPUs busy so that the CPU-usage views have something to do. This means that I need a program to busy-wait.

Busy-waiting means running a tight loop that doesn’t actually do anything except run. In C, the most efficient such loop is:

for(;;);

That’s all well and good, but it only busy-waits a single processor. I have four, and I need 1 < n < 4 of them to be lit up so that CPU Usage has something to indicate (otherwise it will sit there showing 0-0-0-0, which doesn’t make good profiling—busy processors will jump around a bit, which gives CPU Usage something to do).

Now, my first approach was to write this in Python. That’s my go-to language for anything without a GUI. Here’s what came out:

#!/usr/bin/env python

def busy_wait():
    from itertools import repeat
    for x in repeat(None):
        pass

import thread, sys
try:
    num_threads = int(sys.argv[1])
except IndexError:
    num_threads = 100

for i in xrange(1, num_threads): #We'll do the first one ourselves after starting all the other threads.
    throwaway = thread.start_new_thread(busy_wait, ())
busy_wait()

Looks good, right?

What was weird is that I couldn’t seem to get it to max out all my processors, even with num_threads=5000. That seemed mighty suspicious.

It was then that I remembered the Global Interpreter Lock.

You see, in CPython, only one thread can be running Python code at a time. (Exceptions exist for things like I/O, of which my program contains none.) This means that my yummy multithreaded busy-wait program—being purely Python—was effectively running single-threaded.

I reimplemented the program in pure C. Not only does it run much more efficiently now (no interpreter overhead), but it also requires far fewer threads: Four threads will light up all four processors. Victory!

If you want a copy for yourself, here it is. It takes one argument, being the number of threads to spawn. It defaults to the number of logical CPUs in your machine (HW_CPU in sysctl), so if you just run it with no arguments, it will will spawn one thread per processor.

Ever wonder which is the fastest way to concatenate strings in Python?

Sunday, May 6th, 2007

This is a response to Ever wonder which is the fastest way to concatenate strings in Ruby?.

I tested using Python 2.5 on Mac OS X 10.4.9 on my four-core Mac Pro. Here are the results:

Method Time (seconds per million iterations)
+ 0.308359861374
str.join(list) 0.53214097023
str.join(tuple) 0.48233294487
% 0.515310049057

That’s quite a surprise—the usual advice is to avoid the + operator because it is inefficient. But here we see that it wins quite handily. Google revealed Chris Siebenmann’s explanation: Python 2.4 fixed the + operator to be much more efficient by only allocating one string instead of n-1 strings.

So, clearly, the old advice no longer applies. Go forth and use +.

Oh, and in case you’re curious, here’s the code. (The times shown above are using the re-create-the-string-every-time version of the code, for comparison’s sake. Not doing that only saves about 1100 second on each test-case.)

So you need to get a count of all asterisks in a string

Monday, April 16th, 2007

You have a string, which I’ll call sample, and you need to count the number of asterisks in it. For comparison purposes:

>>> sample = ' *' * 5000000 #Just imagine that you got this from somewhere else

What do you do?

Solution A: filter

(ifilter won’t work here, because you can’t count it.)

>>> start = time.time(); len(filter(lambda ch: ch == '*', sample)); end = time.time()
5000000 #The correct result
>>> end - start
2.2621231079101562 #seconds

Solution B: List comprehension

(Generator comprehensions won’t work here, because you can’t count them.)

>>> start = time.time(); len([ch for ch in sample if ch == '*']); end = time.time()
5000000
>>> end - start
2.005012035369873

OK, so it looks like I’ll be going with list com—WAIT! What’s that!? It’s a regular expression!

Solution C: re.findall

>>> start = time.time(); len(re.findall(r'\*', sample)); end = time.time()
5000000
>>> end - start
0.40664911270141602

…Wow.

Incidentally, I didn’t find a statistically-significant speed-up in running re.compile over the expression first. Apparently, this expression isn’t complex enough for that to help any.

[Added 2007-04-19] So I guess that’s it then. re.findall is the winn—

What’s this? New comment from Chuck…

Solution D: str.count

>>> start = time.time(); sample.count('*'); end = time.time();
5000000
>>> end - start
0.038351058959960938

A factor-of-ten improvement! Wow—thanks, Chuck!

ROT13 in Python

Wednesday, January 31st, 2007

Patr1ck has posted instructions on performing ROT13 in Ruby. Colin has responded with the Perl version. Here’s the Python version.

import codecs
rot13ed_data = codecs.getencoder(‘rot13’)(data_to_rot13)[0]

That is all.

Finding the longest common prefix of an array of strings in Python, part 2

Saturday, January 6th, 2007

In case you found my previous post unsatisfactory, and wanted to know how one would actually implement os.path.commonprefix in Python, here’s my answer.

(And yes, I started writing this before I found os.path.commonprefix in the documentation. ☺)

This is in response to wootest‘s post on finding the longest common prefix of an array of strings in Ruby.

files = [‘/foo/bar/xyzzy_one’, ‘/foo/bar/xyzzy_two’,
‘/foo/bar/xyzzy_three/four’]

No change here. For relative paths, use os.path.join(os.getcwd(), path) to convert each path to an absolute path.

Zeroth, we don’t set up an array to hold guesses. We’ll extract exactly one prefix, so no array is necessary.

Let’s put this whole thing in a function for easy access. (It also makes the code clearer later.)

def longest_common_prefix(*files):
“This function returns the longest common prefix of one or more sequences. Raises an exception when zero sequences are provided.”

The * means that this function is variadic: You use it with the syntax longest_common_prefix(foo, bar, baz). If you want to pass an array of paths, use the apply syntax: longest_common_prefix(*files) (does that asterisk look familiar?).

The first thing our function should do is assert that we actually have some strings. There are two reasons for this:

  1. My algorithm assumes that there is at least one string.
  2. Trying to get the longest common prefix of no strings is definitely unusual, and should be suspected of being a bug.

assert files, ‘Longest common prefix of no strings requested. Such behavior is highly irrational and is not tolerated by this program.’

Next is an if to cover the case of exactly one string. The algorithm would handle this just fine, but inefficiently: One string is its own longest common prefix, so we can simply return that. This if would also cover the case of no strings, but I like having an assertion for that (see #2 above). I’m paranoid that way.

if len(files) == 1:
return files[0]

Next we put the filenames in order by length. Having the shortest string first is necessary everywhere you see files[0] mentioned below.

files = [pair[1] for pair in sorted((len(fi), fi) for fi in files)]

This is the “decorate-sort-undecorate” pattern, used in Python because built-in comparison is much faster than a pure-Python comparison function.

Next is the outer loop, in which we start iterating the characters of the first (shortest) string.

for i, comparison_ch in enumerate(files[0]):

The built-in enumerate yields a pair (idx, element) for every item in the sequence.

The inner loop iterates through all our other paths.

for fi in files[1:]:

Get character i of this path.

ch = fi[i]

Compare it to the same character in the shortest path.

if ch != comparison_ch:

If the equality test fails, then we have found the first character that is not in the longest common prefix. This means that i is the length of said prefix. We extract the prefix by slicing the shortest path.

return fi[:i]

The stop index i is not included in the slice, just as with ranges). This gets us the first i characters.

If the equality test succeeds, then we haven’t gone past the longest common prefix, and we continue iterating.

Outside the loops (notice that we’ve gone down several indentation levels):

return ”

If we’ve run out of loop, there must not be a common prefix; our common prefix, therefore, is the empty string. We return that as a literal.

	return files[0]

If we’ve run out of loop, then our shortest string (which we sorted to first in the list) is also the longest common prefix. (Thanks to Lars for the correction.)

Now, despite the variable names, that function is suited for arbitrary sequences, and does not properly handle the case of a partial directory name. There are two solutions.

Ugly solution

Split the pathnames into components, pass these lists to longest_common_prefix, and join the output and prepend a ‘/’ to make it back into a path.

‘/’ + path.join(*longest_common_prefix(*(fi.split(‘/’) for fi in files)))

Robust solution

def longest_common_pathname_prefix(*files):
“This function returns the longest common prefix of one or more pathnames. Raises an exception when zero pathnames are provided.”
lcp = longest_common_prefix(*files)

if not lcp.endswith(‘/’):
import os.path
lcp = os.path.dirname(lcp)

return lcp

Alternatively, I could use:

if lcp[-1:] != ‘/’:

which mirrors wootest’s solution. (It would extract characters -1 through the end — i.e. the last character, or the empty string if lcp is also the empty string.) But str.endswith is more Pythonic.

Here’s all the code at once, implemented as a complete combination Python module and self-test:

#!/usr/bin/env python

def longest_common_prefix(*files):
“This function returns the longest common prefix of one or more sequences. Raises an exception when zero sequences are provided.”
assert files, ‘Longest common prefix of no strings requested. Such behavior is highly irrational and is not tolerated by this program.’

if len(files) == 1:
return files[0]

files = [pair[1] for pair in sorted((len(fi), fi) for fi in files)]

for i, comparison_ch in enumerate(files[0]):
for fi in files[1:]:
ch = fi[i]

if ch != comparison_ch:
return fi[:i]

return files[0]

def longest_common_pathname_prefix(*files):
“This function returns the longest common prefix of one or more pathnames. Raises an exception when zero pathnames are provided.”
lcp = longest_common_prefix(*files)

if not lcp.endswith(‘/’):
import os.path
lcp = os.path.dirname(lcp)

return lcp

if __name__ == ‘__main__’:
files = [‘/foo/bar/xyzzy_one’, ‘/foo/bar/xyzzy_two’, ‘/foo/bar/xyzzy_three/four’]
print longest_common_prefix(*files)
print longest_common_pathname_prefix(*files)

Finding the longest common prefix of an array of strings in Python

Saturday, January 6th, 2007

This is in response to wootest‘s post on finding the longest common prefix of an array of strings in Ruby.

files = [‘/foo/bar/xyzzy_one’, ‘/foo/bar/xyzzy_two’,
‘/foo/bar/xyzzy_three/four’]

No change here. For relative paths, use os.path.join(os.getcwd(), path) to convert each path to an absolute path.

We follow this up with:

import os
lcp = os.path.commonprefix(files)

*bows*

*walks off*