calloc vs malloc

2006-11-26 12:08:02 -08:00

I wondered whether there’s a performance advantage to using calloc to allocate zeroed bytes rather than malloc. So, as I usually do in such situations, I wrote a test app.

Along the way, I found out that gettimeofday — the traditional high-resolution what-time-is-it function — is not high-enough resolution. On my machine, the differences are smaller than one microsecond, which is the resolution of gettimeofday. So I switched to a Mach clock, which provides nanosecond resolution.*

I’m running Mac OS X 10.4.8 on a 2×2×2.66-GHz Mac Pro. The first run looks like this:

First run
Method Time
calloc, 100 MiB * 1 0.000006896000000 seconds
malloc, 100 MiB 0.000007790000000 seconds
calloc, 100 MiB * 1, followed by reading first byte 0.000012331000000 seconds
calloc, 10 MiB * 10 0.000024079000000 seconds
calloc, 10 MiB * 10, followed by reading first byte 0.000031266000000 seconds
malloc followed by bzero 2.252493061000000 seconds

Ouch! Two-and-a-quarter seconds for bzero. A second run returned saner numbers:

Second run
Method Time
calloc, 100 MiB * 1 0.000007140000000 seconds
malloc, 100 MiB 0.000007317000000 seconds
calloc, 10 MiB * 10 0.000008956000000 seconds
calloc, 100 MiB * 1, followed by reading first byte 0.000012812000000 seconds
calloc, 10 MiB * 10, followed by reading first byte 0.000031807000000 seconds
malloc followed by bzero 0.138714770000000 seconds

bzero has greatly improved, but it still loses badly, taking more than a tenth of a second.

Lesson: Always use calloc when you need zeroed bytes. Don’t try to be lazy by zeroing them later — calloc is much better at that.

If you want, you can play with calloctiming yourself. If you want to post the results in the comments, use the included sed script to convert the output to an HTML table: ./calloctiming | sort -n -t $':\t' +1 | sed -f output_to_html.sed.

* Another way that I didn’t think of until just now is QA1398‘s recommendation of mach_absolute_time** and AbsoluteToNanoseconds. I tried that too and the output is no different from the Mach clock.

** And when I went looking for documentation of mach_absolute_time to link to, I found this instead: Kernel Programming Guide: Using Kernel Time Abstractions. This information is for those writing kernel extensions; it recommends using clock_get_system_nanotime. Won’t work in user space, unfortunately.

7 Responses to “calloc vs malloc”

  1. Blake C. Says:

    Thanks for documenting this, I’ve often wondered the same thing. I wonder- what changed between runs to cause the saner results?

  2. Peter Hosey Says:

    My guess is that it’s some form of caching that hadn’t caught up yet the first time. The first run of any time trial is always artifically slowed down, no matter what you’re timing. That’s why I always run any time trial at least twice.

  3. Blake C. Says:

    Maybe we should ask Peter Ammon. You may have seen the surprising results he got from his array trials:

    http://ridiculousfish.com/blog/archives/2005/12/23/array/

    In the current context, I was reminded of this quote: “Say what? I was so surprised by this, I had to rerun my test, but it seems to be right.” Either this cacheing(sp?) issue didn’t exist in his context, or he knew how to avoid it.

    It seems even more curious to me that he was testing high level Apple data types(slow), while you’re talking about ANSI stuff(fast). The respective expected delays seem to be reversed…

    And thank you for the info- I’m currently scavenging my code for the malloc/bzero happenings and improving them :) Now I’m wondering how I can determine whether this cache thing can also be improved in my code…

  4. ridiculous_fish Says:

    Howdy – fish here.

    For large blocks (where “large” is surprisingly small, something like 14 KB) Mac OS X’s default malloc() will always go to the kernel for memory by calling vm_allocate(). vm_allocate() always returns zero-filled pages; otherwise, you might get back a chunk of physical RAM or swap space that had been written to by some root process, and you’d get privileged data. So for large blocks, we’d expect calloc() and malloc() to perform identically.

    Mach will reserve some memory but not physically allocate it until you read or write it. The pages can also be marked zero filled without having to write zeros to RAM. But the first time you read from the page, it has to allocate and then zero-fill it.

    Your conclusion that calloc is much better than bzero is dead-on.

    _fish

  5. Peter Hosey Says:

    Woo — thanks. ☺

    The two sets of numbers are from separate runs of the test app — that is, separate processes. (I’m not sure whether that was clear in the post.) Does that change anything? If so, how?

  6. Blake C. Says:

    I just wrote about something you made me remember-

    http://yamacdev.blogspot.com/2006/12/implicit-memcpy3-calls.html

    It’s good for a laugh while we’re waiting, I guess :)

  7. Jens Ayton Says:

    Just got here from StackOverflow.

    One important consequence of fish’s comment is that your timings are completely irrelevant to a program that will actually use (read or write) the memory, because as he says calloc() isn’t actually allocating, let alone clearing, any pages. To get more meaningful timings, you should try reading one byte from each page. I expect calloc() will still win, but by a less ridiculous margin.

Leave a Reply

Do not delete the second sentence.


Warning: Undefined array key "ntt_saved_comment_text" in /home/public/blog/wp-content/plugins/negative-turing-test/negative-turing-test.php on line 143