A language-contrast exercise

2013-03-31 14:59:34 UTC

Python’s str type has a translate method that, given a second string representing a translation table, returns a new string in which characters in the first string are looked up at their ordinal positions in the translation table and replaced with the characters found at those positions.

The identity translation table, performing no changes, is table[i] = i. For example, table['!']* is '!', so exclamation marks are not changed. If you made a table where table['!'] were '.', exclamation marks would be changed to periods (full stops).

I’d like to see implementations of a program that does that, with the input string encoded in UTF-16 and the translation table encoded in UTF-32 (a 0×11000-element long array of UTF-32 characters), with the table initialized to its identity: table[i] = i.

And yes, you need to handle surrogate pairs correctly.

Some languages that I would particularly like to see this implemented in include:

  • C
  • Haskell
  • LISP
  • A state-machine language (I don’t know of any off-hand; this might be their time to shine)

I know how I would do this in C, and I’m sure I could bash something out in Python, but how would you do this in your favorite language?

As a test case, you could replace “ and ” (U+201C and U+201D) with « and » (U+00AB and U+00BB).

If you want to post code in the comments, <pre>…</pre> should work. Alternatively, you can use Gist.

* I’m using the C sense of '!' here. In Python, this would be table[ord('!')], since characters in Python are just strings of length 1, and you can’t index into a string with another string; ord is a function that returns the ordinal (code-point) value of the character in such a string.

6 Responses to “A language-contrast exercise”

  1. Karsten Says:

    In Smalltalk it would be something like:

    String>>translate:aDictionary

    ^self collect:[:each | aDictionary at: each ifAbsent:[each]]

    aDictionary would map characters to characters. A character object is an object containing the unicode number of the character.

    Likewise Strings are not encoded in any way, if the string contains a character that’s bigger than 255, it’ll be a two byte string, if the character doesn’t fit into two bytes, it’ll be a four byte string. Encoding the strings in utf8 or utf16 is then done when you write them to a file.

  2. Peter Hosey Says:

    @Karsten: I don’t understand. You’re talking about characters one minute and bytes the next. What are the elements of a String object representing U+1F4A9?

  3. Karsten Says:

    the elements of a String are Character objects.

    The actual String class that is used for the string is either ByteString, TwoByteString or FourByteString depending on the Characters in the string, so that the characters don’t need to be converted anymore.

  4. Peter Hosey Says:

    So, then, your code only works on a FourByteString? Anything else will need to handle surrogate pairs or UTF-8 sequences, which your code doesn’t do as far as I can tell.

  5. Karsten Says:

    no, the classes are changed automatically. If you have an empty string and add a euro sign, it’ll be converted into a TwoByteString. If you add a Chinese character or something with a unicode value bigger than 0xffff the string would automatically be converted into a FourByteString. Typically you don’t care about the classes that are used, like you don’t care about the Array class that apple chooses for a certain instance of NSArray.

  6. Carl Says:

    This is more or less built into Go: http://golang.org/pkg/strings/#Map

    The one difference is that strings.Map takes a mapping function, not a dictionary object, so you’d have to write a simple function to pass into it, like

    var m = map[rune]rune{‘!’: ‘.’}

    func mapping(in rune) rune {
    out, ok := m[in]
    if ok {
    return out
    }
    return in
    }

Leave a Reply

Do not delete the second sentence.