Fixing an Adium bug: A timeline

2007-08-14 19:07:42 UTC

I’m fixing a hole, where the rain gets in/and stops my mind from wondering/where it will go-ooo…

All times are PDT.

  • 2007-08-12 14:04:38: Evan releases 1.1 in the appcast. 1.1 is our first version that requires Tiger.

  • 2007-08-12 14:08:09: Evan puts back 1.0.5 alongside 1.1, with a sparkle:minimumSystemVersion attribute on the enclosure element noting that 1.0.5 requires 10.3.0 or later.

  • 2007-08-12 14:25:12: Evan changes 10.3.0 to 10.3.9; David had pointed out that 1.0.5 won’t run on OS versions before 10.3.9.

  • 2007-08-12 17:32:02: First user report on the feedback address from a 10.3.9 user that she had updated to 1.1 and it didn’t run on her system.

  • 2007-08-13, about 06:00: I wake up, and find six such user reports sent to the feedback list, including the first one. I answer them all saying that 1.1 requires Tiger and you can get 1.0.5 from the front page. By the sixth one, I’m suspicious that we may have a Sparkle bug.

  • 06:31:48: I join the #adium IRC channel. I have a feeling that I’m going to need to field user reports there, too. I join the Adium development channel at the same time (my IRC client is set to join both automatically).

  • From the #adium IRC channel:

    06:32:46:              Mac-arena notes that the Adium 1.1 story is #1 in Apple, #4 in Technology, and #8 in News.
  • From the #adium IRC channel:

    06:51:55:   <proton> damn sparkle updating 10.3 users
    06:53:59: <Mac-arena> Evan said he fixed it, which makes it weird.
  • 07:57:58: I fix the appcast for Panther users. Sparkle actually looks for the minimum OS version in an element named sparkle:minimumSystemVersion, not in an attribute of the enclosure element.

  • 08:13:58: I’m sending emails to Panther users who got the update notification before I fixed the appcast, then emailed our feedback address complaining that it didn’t work. Here’s an excerpt from eight such emails:

    Adium 1.0.5 is available from the front page of http://adiumx.com/, at the bottom of the page. Unless some bug arises that we need to address quickly with a 1.0.6, this is the last version that will run on 10.3.9.

  • From the #adium IRC channel:

    08:34:56:    :Server: eventualbuddha (n=eventual@[deleted].sbcglobal.net) joined #adium
    08:35:16: <eventualbuddha> congrats on 1.1 guys! in trying to update Adium it crashes though
    ⋮
    08:41:57: <Mac-arena> eventualbuddha: What version of Adium were you running at the time?
    08:42:07: <eventualbuddha> 1.0.5
    ⋮
    08:55:39: <eventualbuddha> Mac-arena, The_Tick: i should mention that the svn build i had was doing the same thing, but i just updated to 1.1 and it doesn't have that problem

    So, this user has the problem with 1.0.5 and not 1.1.

  • From the #adium IRC channel:

    08:53:58:    :Server: gammah (n=gammah@[censored].swbell.net) joined #adium
    08:54:28:   <gammah> hey I was getting crashes in the previous version of adium, on what appeared to be the 'check for updates' function
    08:54:46:   <gammah> so I manually got adium, and it seems to suffer the same fate - crashing right at startup

    This user has the problem with both 1.0.5 and 1.1.

  • 09:01:35: First ticket filed about the crash on update.

  • 09:22:11: Second ticket filed. The other tickets that will be filed about this are #7544, #7546, #7547, #7549, #7550, #7552, #7555, #7557, #7559, #7564, #7565, #7566, #7568. (Ever wonder why we don’t do more actual programming work on Adium? That’s why. However, to be fair, #7543 was filed on request from me, because we initially thought it to be a second bug.)

  • From the #adium IRC channel:

    09:23:45: <Mac-arena> dang503: Interesting.
    09:23:59: <Mac-arena> dang503: There are a lot of people getting this same crash on update in 1.0.5.
    09:24:05: <Mac-arena> dang503: And you have it in 1.1.
  • From the #adium IRC channel:

    09:47:17:   <nphase> http://pastebin.com/m28849e83
    09:48:51: <Mac-arena> Hey, let's see if we can get this crash posted to every pastebin on the internet. :D
  • By 10:00 it becomes clear on #adium that this is mostly happening to 1.1 people.

  • 10:19:02: Received first email to the feedback address reporting the crash on update.

  • 10:42: Posted to the blog acknowledging the crash and inviting users with debugger-fu to help us track it down.

  • From the Adium development IRC channel:

    10:46:12: <The_Tick> zac: go into com.adiumX.adiumX.plist
    10:46:21: <The_Tick> and change the last update time
    10:46:23: <The_Tick> to today
    10:46:29: <The_Tick> you'll be able to launch then

    The other way to do this would have been defaults write com.adiumx.adiumx SULastCheckTime '2007-08-13 10:47:00 -0700'. The effect of this would have been to forestall the crash; it would not have actually fixed it. Zac says he did not perform the edit.

  • 11:50:39: First volunteer joins IRC to help with debugging.

  • 11:57:35: First debug build goes live.

  • 12:02:05: I create a page on our wiki to keep the debug build link and provide instructions on how to step through and get a stack trace.

  • As evidence of how inaccurate the page was initially, this excerpt from #adium:

    12:12:45:      <prb> I have the build and XCode, but it's not obvious to me how to attach the debugger.  What is the necessary dance to get the attach enabled?
    12:12:59: <Mac-arena> prb: Build & Debug, cmd-y
    12:13:45:      <prb> Mac-arena: Is there a project lurking around somewhere, or do I need to download a source snapshot?
    12:14:06: <Mac-arena> prb: Erm. Right. :D Probably easiest just to use gdb
  • From the #adium IRC channel:

    12:16:34:      <prb> I'm in -- anyone got a gdb cribsheet (catch foo, etc.)
    12:16:40: <Mac-arena> prb: Writing the instructions now
  • Our webmaster, Zac, who is able to reproduce the crash (although at the time, it seems like a slightly different crash and is tentatively considered unrelated), posts his Console output, containing some NSLogs that I’d accidentally left in while diagnosing the earlier problem with the appcast (wherein the minimumSystemVersion was in the wrong place in the XML). I don’t realize it at the time, but the NSLogs actually give a hint to the problem, as they show the two minimumSystemVersion objects each used once, and then the crash.

  • From the #adium IRC channel:

    12:25:32: <Mac-arena> Note for future reference
    12:25:39: <Mac-arena> When you use cd Release && make
    12:25:42: <Mac-arena> It uses Deployment [build configuration].
    12:25:48: <Mac-arena> OK, I get to do *THAT* over.
  • 13:24:46: I post the link to the new debug build (this time actually compiled with Development build config)…

  • 13:25:51: … and update the wiki page with it.

  • From the #adium IRC channel:

    13:31:50:     <efge> Mac-arena: Thread 0 Crashed:
    13:31:50:     <efge> 0   libobjc.A.dylib                	0x90a594c0 objc_msgSend + 16
    13:31:51:     <efge> 1   org.ironcoder.SparklePlus      	0x003eb552 -[SUUpdater newVersionAvailable] + 49 (SUUpdater.m:479)
    13:31:51:     <efge> 2   org.ironcoder.SparklePlus      	0x003eb96b -[SUUpdater appcastDidFinishLoading:] + 786 (SUUpdater.m:507)

    Line numbers! Woo-hoo!

    So here’s line 478 (gdb being generally off by one with its line numbers):

    	BOOL available = SUStandardVersionComparison([updateItem minimumSystemVersion], SUCurrentSystemVersionString()) == NSOrderedDescending;

    There’s only one message there, so that has to be what’s crashing.

    Current suspicion: The SUAppcastItem is a zombie.

  • About 13:36: I finally get the idea to try to reproduce the bug on my laptop.

  • From the #adium IRC channel:

    13:38:25: <Mac-arena> Hey, everybody!
    13:38:27: <Mac-arena> Good news
    13:38:31: <Mac-arena> I can reproduce this on my laptop!
    13:39:17:  <durin42> yay Mac-arena can has crash
  • From the #adium IRC channel:

    13:39:12: <Mac-arena> The appcast calls -appcastDidFailToLoad:, and we send it autorelease, and that breaks.
    13:40:52: <Mac-arena> Wait, that's wrong. The backtrace says -appcastDidFinishLoading: and -newVersionAvailable [accessor], so it's not in -DidFailToLoad:.
    13:40:59: <Mac-arena> I hate it when gdb lies to me.
  • From the #adium IRC channel:

    13:55:00:  <Mac-arena> (gdb) po minimumSystemVersion
    13:55:00:  <Mac-arena> 1, 0
    13:55:23:  <Mac-arena> That's on the third of two calls to that method.
  • 13:59:36: Found the problem! The minimumSystemVersion property in SUAppcastItem was not being retained (or copied, to be precise about the correct behavior) by -setMinimumSystemVersion:. So it wasn’t the SUAppcastItem that was zombied; it was the item’s minimumSystemVersion string.

    This also means that zac’s backtrace in #7543 is the same bug; the one in #7542 is missing a frame (for some reason).

    Note how this was 21 minutes after I reproduced the crash on my laptop. This is why you must provide steps that can consistently reproduce your problem, if this is at all possible. Finding this bug would have taken at least an hour, probably more, if I hadn’t been able to reproduce it myself.

  • 14:09: Posted second blog post, announcing that I found the bug and will release 1.1.1 “soon—if all goes well, within the hour”.

    So much for that.

  • From the #adium IRC channel:

    14:20:57: <Newtylicious> Did you really switch to geico?
    14:21:08: <Newtylicious> and did it really take 15 minutes or less?
    14:21:03: <Mac-arena> No.
    14:21:16: <Mac-arena> Never let factual accuracy stand in the way of a good commercial reference.
  • From the #adium IRC channel:

    14:21:53: <Mac-arena> I love flash memory
    14:21:58: <Mac-arena> aka SneakerNet 3.0
  • From the #adium IRC channel:

    14:23:42:     <efge> Mac-arena: was it the same bug in 1.0.5 ? will people have to update manually from that version ?
    14:24:02: <Mac-arena> efge: Well, that's a good question
    14:24:12: <Mac-arena> I haven't confirmed yet whether the Sparkle in 1.0.5 has the same bug.
    14:24:22: <Mac-arena> You raise a good point, and I thank you for it. We may be doing a 1.0.6 as well.
    14:24:26: <Catfish_Man> it is likely
    14:24:34: <Catfish_Man> iirc I wrote that code for a 1.0 beta
    14:24:38: <Catfish_Man> possibly even a 1.0 alpha
  • 14:42:02: Third and final debug build goes live. This one contains the fix for the crash, which was to make -setMinimumSystemVersion: copy its argument, then assign the copy to the instance variable.

  • From the #adium IRC channel:

    14:54:23:   <vIkSiT> ah. so is the problem with sparkle itself, or the adium-sparkle integration?
    14:54:31: <Mac-arena> Neither.
    14:54:45: <Mac-arena> The official Sparkle doesn't have the problem. Only Sparkle ChatKit Edition and Sparkle Plus.
    14:55:04: <Mac-arena> It was already fixed in the official Sparkle.
    14:55:16: <Mac-arena> But that fix did not get carried over to S+, for some reason.
    14:58:40:   <jchang> is there a reason we didn't catch this bug during the betas?
    14:58:49: <Catfish_Man> yes
    14:59:00: <Catfish_Man> minimumSystemVersion was not enabled in the appcast correctly
    15:01:02: <Mac-arena> Now it is, and we find out that it was broken in Sparkle+ :)
    15:02:17: <Mac-arena> http://sparkle.andymatuschak.org/changeset/13
    15:02:46: <Mac-arena> So, it was fixed 5 months ago and it never made its way to us :\
    15:02:54: <Mac-arena> (nor S+, which is how it would have gotten to us)
    
  • From the #adium IRC channel:

    15:06:52:      <zac> can we temporarily fix the appcast now? :P
    15:07:29:   <jchang> zac: what do you mean?
    15:07:51:      <zac> removing the minimum requirement from the .xml will fix the bug
    15:07:59:      <zac> temporarily at least

    I should have done this then.

  • 15:20:24: Sent email to our development mailing-list explaining the situation and proposing a plan:

    1. Take the minimumSystemVersion elements out of the appcast. Since this would lead to Panther users downloading 1.1 again, also take out 1.1, leaving only 1.0.5 (with no minimumSystemVersion element) in the appcast.
    2. Push 1.0.6 with the fixed Sparkle+, for the benefit of Panther users.
    3. 24 hours later, release 1.1.1, put it in the appcast, and restore the minimumSystemVersion elements.
  • 15:37:46: “Michael” replies on the list suggesting that we only hold the 1.1.1 announcement from the appcast; 1.1.1 would be released everywhere else.

  • 15:49:50: I reply to Michael agreeing with his plan, though honestly, I thought that that was what I’d originally said, and worded it accordingly. If you’re reading this, Michael, sorry about the mix-up. (The above list is taken directly from my original email to the development list. As you can see, I clearly wrote that we would hold the entire 1.1.1 release, not just the appcast. No, that doesn’t make sense to me either—Michael was patently right.)

  • From the #adium IRC channel:

    16:45:25:  <Maurits> hi, is the 1.1.1 version available from svn already?
  • 17:15:25: No dissents on the development list, so I IM Chris (who’s just gotten back online) and ask him about the plan.

  • 17:17:15: He asks basically the same question Michael did, and I (slightly puzzled at the time, but brushing it off) give basically the same answer I gave Michael. (To you, too, Chris, sorry about the mix-up.) He OKs the amended plan. I begin prepping two new releases.

  • 17:21:03: I finally uncrashify the appcast as zac suggested (and as mentioned in my email), deleting 1.1 and the minimumSystemVersion elements.

  • From the Adium development IRC channel:

    19:02:29:   <proton> you know peter what oyu really should do is make Adium 1.0.6 and Adium 1.1.1 point to a new sparkle feed that has minimum system requirements in it
    19:02:38:   <proton> and leave the current feed permanently pointing to 1.0.6
    19:02:53:   <proton> because there'll be people for quite a while with older versions that'll be going boom :)
    19:04:31: <Mac-arena> Eh. I think most people will update within 24 hours.
    19:04:56:   <proton> i think that's deluded
    19:05:19:      <zac> 12.7% of our users currently use 1.0.4 still
    19:05:24: <Mac-arena> Heh.
    19:06:32: <Mac-arena> OK, will do
    19:06:36:              Mac-arena goes to amend the blog post
    19:08:51: <Mac-arena> zac: Yo
    19:08:54:      <zac> yo
    19:09:13: <Mac-arena> We need a new pair of feed URLs so we can abandon the old pair.
  • From the Adium development IRC channel:

    19:19:34: <Mac-arena> What if we plan for future versions?
    19:19:56:      <zac> Mac-arena: hm?
    19:19:59: <Mac-arena> update-enhanced.php?generation=2
    19:20:07: <Mac-arena> This being the second generation of appcast feeds.
    19:20:16: <Mac-arena> If anything like this happens again, we go to generation=3.
    19:20:35: <Newtylicious> go forth and multiply, little appcasts
    19:20:42: <Mac-arena> Each generation can represent a directory full of feeds. The current generation is what's updated by [the appcast-refresh script].
    19:21:07:      <zac> Mac-arena: okay, do you want to add the argument "appCastGen=2" to the new builds? that's less hackish and cleaner
    19:21:34:      <zac> or just generation=2, whichever
    19:21:51: <Mac-arena> Hmmm, yeah. We can even keep the same [appcast] URL that way.
  • About 19:14: Posted the third blog post, wherein I explained the day’s circumstances so far, the nature of the bug, and what we’d be doing about it.

  • From the Adium development IRC channel:

    19:26:43: <Mac-arena> Gargh, this code is a mess
    19:27:15: <Mac-arena> Our CodingStyle prohibits multiple returns, too.
    19:27:37:   <proton> sometimes ignoring the CodingStyle will make nicer code (and most of the time not)
    19:28:06:   <proton> hell, I've actually written code with a goto that didn't suck :)
    19:28:13: <Mac-arena> proton: Same here.
    19:27:44: <Mac-arena> This is one of the latter cases.
    19:27:52: <Mac-arena> We're passed a mutable array
    19:27:57: <Mac-arena> Expected to modify it
    19:27:59: <Mac-arena> And guess what?
    19:28:03: <Mac-arena> Two out of three times, *we don't*.
    19:28:10: <Mac-arena> We return a new array instead.
    19:29:03:      <zac> Mac-arena: I'd say, let's just do the new appcast URL part now, and decide on new generational stuff later :)
    19:29:13:      <zac> It looks like it's more effort to do the latter now.
    19:29:14: <Mac-arena> zac: No; I don't wanna go through this again
    19:29:28:      <zac> I don't want to break sparkle in 1.1.1 :)
    19:29:31: <Mac-arena> It's hard to do *anything* to this code. This is no harder than anything else.
    19:30:18: <Mac-arena> zac: This will use generation=2
    19:30:46:      <zac> k
    19:34:59:      <zac> k, my end works
  • From the Adium development IRC channel:

    19:43:32:      <zac> I can tell yo ureally quick if your generation stuff works
    19:43:37:      <zac> I just realized generation=2 isn't processed yet :)
  • From the Adium development IRC channel:

    20:28:44: <Mac-arena> zac: Mind putting a bogus version in the gen=2 appcast? Like 1.42 or something?
    20:29:01:              Mac-arena is about to test his shiny new 1.0.6 and 1.1.1
    20:28:59:      <zac> sure
    20:29:41:      <zac> adium 1.9.7 is now out!
    20:29:54: <Mac-arena> Woohoo!
  • About 20:38: I begin uploading 1.1.1 to my iDisk so that Augie can upload it to our CacheFly hosting.

  • 20:52:59: 1.1.1 arrives on my iDisk. Immediately afterward, thanks to Transmit‘s Queue feature, 1.0.6 starts on the path to join it there.

  • 21:46:37: 1.0.6 arrives on my iDisk.

  • 22:12:08: Changelogs go up.

  • 22:18:16: Replaced 1.0.5, added 1.1.1, and restored minimumSystemVersion elements in the appcasts in r20586.

  • About 22:21: Appcasts updated on the website.

  • From the Adium development IRC channel:

    22:24:12:      <zac> Mac-arena: hmm, 1.1 isn't gonna see 1.1.1 under this plan
    22:24:20:      <zac> nor 1.1.2, 1.2, etc.
    22:24:28: <Mac-arena> Hm.
    22:24:51: <Mac-arena> It was a great plan, too.
    22:24:52:      <zac> 10.3.9 users are vastly the minority
    22:24:57:      <zac> should we inconvenience them [in favor of Tiger users, who are the majority]?
    22:25:06: <Mac-arena> I'm OK with that. So what do we do?
    22:25:08:      <zac> put a "10.3.9 users: get 1.0.6 <here>" at the top of the 1.1.1 update
    22:25:13: <Mac-arena> Sounds good.
    22:25:18: <Mac-arena> In the changelog, you mean?
    22:25:20:              zac does so
    22:26:24:      <zac> 10.3.9 users: Get <a href="http://www.adiumx.com">1.0.6</a> to fix update checking.
    22:26:26:      <zac> ^ sound good?
    22:26:35: <Mac-arena> Hm, no
    22:26:47: <Mac-arena> 10.3.9 users: 1.1.1 will not run on your system. You need to get 1.0.6 instead.
    22:26:49: <Mac-arena> And put it in h2.
    22:27:54:      <zac><p style="color: maroon; font-size: large;">10.3.9 users: Adium 1.1.1 will not run on your system. You need to get <a href="http://www.adiumx.com">Adium 1.0.6</a> instead.</p>
    22:28:11: <Mac-arena> OK. How big is large? :)
    22:28:25:      <zac> about the same size as h2
    22:28:35: <Mac-arena> Should work, then.
    22:30:57: <Mac-arena> Let's change "10.3.9 users:" to "<b>10.3.9 users!</b>"
    22:31:02: <Mac-arena> (note the exclamation mark, too)
  • About 22:26: Posted the release announcement for 1.0.6 and 1.1.1 to the blog.

  • From the Adium development IRC channel:

    22:38:09:  <Wengero> zac: (small little nitpick) #7542 in the changelog is linking to adiumx instead of trac.adiumx
    22:38:27:  <Wengero> so its getting a 404
    22:38:37: <Mac-arena> Ah, relative links
    22:38:41: <Mac-arena> That affects both changelogs, BTW.
    22:38:46: <Mac-arena> And only the #7542 link.
    22:39:13: <Mac-arena> <ul><li>Fixed <a class="closed ticket" href="/ticket/7542" title="&#34;Check For Updates...&#34; systematically crashes adium (closed)">#7542</a>, the Sparkle Plus bug that caused it to crash when passed a minimum OS version in an appcast (update feed).</li></ul>
    22:39:35: <Mac-arena> Should be s#"/ticket#"http://trac.adiumx.com/ticket#

Finally, we were done. The bug was fixed and the fix was released.

So now you know the full details of how, in one day, we fixed two problems:

  1. Panther users being notified of a Tiger-only version of Adium
  2. All users crashing on update once we fixed #1

What a day. Apologies to everybody who had to put up with the crashing while we got that sorted out.

And no, it does not usually have that much drama. ;)

UPDATE 2007-08-15 08:30: Chris pointed out that he had suggested to Zac that Zac change his Adium defaults (prefs) to fake out Sparkle until the crash was fixed. I’ve added this to the appropriate point in the timeline.

9 Responses to “Fixing an Adium bug: A timeline”

  1. Adam Dempsey Says:

    very interesting read :) Thanks for sharing the story and thanks to all involved for the quick fix :)

  2. Michael Davies Says:

    This is a great post, and should be required reading for all budding application developers! Thanks for taking the time to record it all.

    It’s also really good to see the amount of support you could call on from the user community.

    Cheers, Michael

  3. Blake C. Says:

    Excellent post!

    “This also means that zac’s backtrace in #7543 is the same bug; the one in #7542 is missing a frame (for some reason).”

    This is most likely caused by a tail call optimization, wherein a function A can call another function B without updating the return address, for example ‘b’ instead of ‘blr’ on ppc and simply not pushing %eip/%rip in i386/x86_64. While the performance win is nice, it does have the side effect of not establishing a new stack frame for function B. If you find that this is indeed the cause of the missing stack frame, there’s a handy way to force gcc to not generate the tail call- just add the following line as the last line in function A:

    __asm__ volatile(“”);

    Sprinkle that where needed and stack frames will abound :)

  4. Blake C. Says:

    Oops, I should clarify- not pushing %eip/%rip is a side effect of using some variant of ‘jmp’ rather than ‘call’. This is very easy to spot in otool or otx output.

  5. Peter Hosey Says:

    Blake C.: Aha. Interesting.

    Would -fno-optimize-sibling-calls or perhaps -fno-omit-frame-pointer (added to the Deployment-Debug build config) end that problem?

    If so, I’ll add it to my list of bugs to file that that option/those options should be included in the Debug config in the project template.

  6. Blake C. Says:

    After some more homework, it appears I didn’t have all the facts exactly correct- while the tail call optimization does result in a frame being omitted from a backtrace, it’s not because the frame was not established. It’s just that the frame is not visible because of the nature of what seems to be referred to as a ‘sibling’ call.

    Regarding -fno-omit-frame-pointer: A stack frame can be omitted from a backtrace for 2 reasons that I know of- a tail call can render the frame invisible, or the frame can actually be omitted from the object code as a different optimization. -fno-omit-frame-pointer will prevent the latter case from happening, but will have no effect on the former.

    Regarding -fno-optimize-sibling-calls: I’ve not come across this one before, and ‘man gcc’ is no help. It looks like this is an area I should research more and maybe blog about…

  7. Blake C. Says:

    Well, I was wrong again. -fomit-frame-pointer does not result in a missing stack frame in a backtrace, at least on Intel. I’m looking into it further, blog post coming soon.

  8. Blake C. Says:

    Just posted it:

    http://yamacdev.blogspot.com/2007/09/case-of-missing-stack-frames.html

  9. Peter Hosey Says:

    Already commented on it with my thanks. ☺

Leave a Reply

Do not delete the second sentence.