Idle Time

A language-contrast exercise

2013-03-31 14:59:34 -08:00

Python’s str type has a translate method that, given a second string representing a translation table, returns a new string in which characters in the first string are looked up at their ordinal positions in the translation table and replaced with the characters found at those positions.

The identity translation table, performing no changes, is table[i] = i. For example, table['!']* is '!', so exclamation marks are not changed. If you made a table where table['!'] were '.', exclamation marks would be changed to periods (full stops).

I’d like to see implementations of a program that does that, with the input string encoded in UTF-16 and the translation table encoded in UTF-32 (a 0x11000-element long array of UTF-32 characters), with the table initialized to its identity: table[i] = i.

And yes, you need to handle surrogate pairs correctly.

Some languages that I would particularly like to see this implemented in include:

C
Haskell
LISP
A state-machine language (I don’t know of any off-hand; this might be their time to shine)

I know how I would do this in C, and I’m sure I could bash something out in Python, but how would you do this in your favorite language?

As a test case, you could replace “ and ” (U+201C and U+201D) with « and » (U+00AB and U+00BB).

If you want to post code in the comments, <pre>…</pre> should work. Alternatively, you can use Gist.

* I’m using the C sense of '!' here. In Python, this would be table[ord('!')], since characters in Python are just strings of length 1, and you can’t index into a string with another string; ord is a function that returns the ordinal (code-point) value of the character in such a string. ↶

Categories: C; Programming; Python. | Comments: 6 (feed).

6 Responses to “A language-contrast exercise”

Karsten Says:
April 1st, 2013 at 00:50:29
In Smalltalk it would be something like:

String>>translate:aDictionary

^self collect:[:each | aDictionary at: each ifAbsent:[each]]

aDictionary would map characters to characters. A character object is an object containing the unicode number of the character.

Likewise Strings are not encoded in any way, if the string contains a character that’s bigger than 255, it’ll be a two byte string, if the character doesn’t fit into two bytes, it’ll be a four byte string. Encoding the strings in utf8 or utf16 is then done when you write them to a file.
Peter Hosey Says:
April 1st, 2013 at 00:59:18
@Karsten: I don’t understand. You’re talking about characters one minute and bytes the next. What are the elements of a String object representing U+1F4A9?
Karsten Says:
April 3rd, 2013 at 23:07:53
the elements of a String are Character objects.

The actual String class that is used for the string is either ByteString, TwoByteString or FourByteString depending on the Characters in the string, so that the characters don’t need to be converted anymore.
Peter Hosey Says:
April 3rd, 2013 at 23:27:04
So, then, your code only works on a FourByteString? Anything else will need to handle surrogate pairs or UTF-8 sequences, which your code doesn’t do as far as I can tell.
Karsten Says:
April 4th, 2013 at 00:02:09
no, the classes are changed automatically. If you have an empty string and add a euro sign, it’ll be converted into a TwoByteString. If you add a Chinese character or something with a unicode value bigger than 0xffff the string would automatically be converted into a FourByteString. Typically you don’t care about the classes that are used, like you don’t care about the Array class that apple chooses for a certain instance of NSArray.
Carl Says:
April 11th, 2013 at 01:33:15
This is more or less built into Go: http://golang.org/pkg/strings/#Map

The one difference is that strings.Map takes a mapping function, not a dictionary object, so you’d have to write a simple function to pass into it, like

var m = map[rune]rune{‘!’: ‘.’}

func mapping(in rune) rune {
out, ok := m[in]
if ok {
return out
}
return in
}

A language-contrast exercise

6 Responses to “A language-contrast exercise”

Leave a Reply