calloc vs malloc
I wondered whether there’s a performance advantage to using calloc to allocate zeroed bytes rather than malloc. So, as I usually do in such situations, I wrote a test app.
Along the way, I found out that gettimeofday — the traditional high-resolution what-time-is-it function — is not high-enough resolution. On my machine, the differences are smaller than one microsecond, which is the resolution of gettimeofday. So I switched to a Mach clock, which provides nanosecond resolution.*
I’m running Mac OS X 10.4.8 on a 2×2×2.66-GHz Mac Pro. The first run looks like this:
Method | Time |
---|---|
calloc, 100 MiB * 1 | 0.000006896000000 seconds |
malloc, 100 MiB | 0.000007790000000 seconds |
calloc, 100 MiB * 1, followed by reading first byte | 0.000012331000000 seconds |
calloc, 10 MiB * 10 | 0.000024079000000 seconds |
calloc, 10 MiB * 10, followed by reading first byte | 0.000031266000000 seconds |
malloc followed by bzero | 2.252493061000000 seconds |
Ouch! Two-and-a-quarter seconds for bzero. A second run returned saner numbers:
Method | Time |
---|---|
calloc, 100 MiB * 1 | 0.000007140000000 seconds |
malloc, 100 MiB | 0.000007317000000 seconds |
calloc, 10 MiB * 10 | 0.000008956000000 seconds |
calloc, 100 MiB * 1, followed by reading first byte | 0.000012812000000 seconds |
calloc, 10 MiB * 10, followed by reading first byte | 0.000031807000000 seconds |
malloc followed by bzero | 0.138714770000000 seconds |
bzero has greatly improved, but it still loses badly, taking more than a tenth of a second.
Lesson: Always use calloc when you need zeroed bytes. Don’t try to be lazy by zeroing them later — calloc is much better at that.
If you want, you can play with calloctiming yourself. If you want to post the results in the comments, use the included sed script to convert the output to an HTML table: ./calloctiming | sort -n -t $':\t' +1 | sed -f output_to_html.sed.
* Another way that I didn’t think of until just now is QA1398‘s recommendation of mach_absolute_time** and AbsoluteToNanoseconds. I tried that too and the output is no different from the Mach clock. ↶
** And when I went looking for documentation of mach_absolute_time to link to, I found this instead: Kernel Programming Guide: Using Kernel Time Abstractions. This information is for those writing kernel extensions; it recommends using clock_get_system_nanotime. Won’t work in user space, unfortunately. ↶
December 13th, 2006 at 19:23:27
Thanks for documenting this, I’ve often wondered the same thing. I wonder- what changed between runs to cause the saner results?
December 13th, 2006 at 19:39:52
My guess is that it’s some form of caching that hadn’t caught up yet the first time. The first run of any time trial is always artifically slowed down, no matter what you’re timing. That’s why I always run any time trial at least twice.
December 13th, 2006 at 20:39:07
Maybe we should ask Peter Ammon. You may have seen the surprising results he got from his array trials:
http://ridiculousfish.com/blog/archives/2005/12/23/array/
In the current context, I was reminded of this quote: “Say what? I was so surprised by this, I had to rerun my test, but it seems to be right.” Either this cacheing(sp?) issue didn’t exist in his context, or he knew how to avoid it.
It seems even more curious to me that he was testing high level Apple data types(slow), while you’re talking about ANSI stuff(fast). The respective expected delays seem to be reversed…
And thank you for the info- I’m currently scavenging my code for the malloc/bzero happenings and improving them :) Now I’m wondering how I can determine whether this cache thing can also be improved in my code…
December 14th, 2006 at 17:36:08
Howdy – fish here.
For large blocks (where “large” is surprisingly small, something like 14 KB) Mac OS X’s default malloc() will always go to the kernel for memory by calling vm_allocate(). vm_allocate() always returns zero-filled pages; otherwise, you might get back a chunk of physical RAM or swap space that had been written to by some root process, and you’d get privileged data. So for large blocks, we’d expect calloc() and malloc() to perform identically.
Mach will reserve some memory but not physically allocate it until you read or write it. The pages can also be marked zero filled without having to write zeros to RAM. But the first time you read from the page, it has to allocate and then zero-fill it.
Your conclusion that calloc is much better than bzero is dead-on.
_fish
December 14th, 2006 at 18:16:35
Woo — thanks. ☺
The two sets of numbers are from separate runs of the test app — that is, separate processes. (I’m not sure whether that was clear in the post.) Does that change anything? If so, how?
December 17th, 2006 at 22:51:12
I just wrote about something you made me remember-
http://yamacdev.blogspot.com/2006/12/implicit-memcpy3-calls.html
It’s good for a laugh while we’re waiting, I guess :)
October 30th, 2009 at 00:36:37
Just got here from StackOverflow.
One important consequence of fish’s comment is that your timings are completely irrelevant to a program that will actually use (read or write) the memory, because as he says calloc() isn’t actually allocating, let alone clearing, any pages. To get more meaningful timings, you should try reading one byte from each page. I expect calloc() will still win, but by a less ridiculous margin.