操作系统版本:
> ::showrev
Hostname: palace
Release: 5.11
Kernel architecture: i86pc
Application architecture: amd64
Kernel version: SunOS 5.11 i86pc snv_57
Platform: i86pc
我们知道,这台机器是amd64 CPU, 操作系统是Solaris 11 build 57.
因为bcme是硬件驱动,因此下面的信息有助于我们了解驱动的基本情况;
网卡信息(通过vendor id和device id: pciex14e4,167a, 可以查到网卡具体型号):
> ::prtconf ! grep bcme
fffffffec01f09b0 pciex14e4,167a, instance #0 (driver name: bcme)
网卡驱动版本(v10.0.3):
> ::modinfo ! grep bcme
140 fffffffff7fd2000 1d3d0 1 bcme (Broadcom GbE Driver v10.0.3)
系统panic的消息已经告诉我们这是一个bad trap:
> ::status
debugging crash dump vmcore.0 (64-bit) from palace
operating system: 5.11 snv_57 (i86pc)
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff0003cb9000 addr=86 occurred in module "genunix" due to a NULL pointer dereference
dump content: kernel pages only
这个bad trap实际上就是Page fault时,访问了一个NULL指针,而且addr=86,已经很明显不是一个合法的地址。
让我们验证一下,首先,查看调用栈的详细信息:
> ::stackregs
ffffff0003cb9100 ddi_dma_unbind_handle+0xf(66)
ffffff0003cb91a0 bcme_tx_sctgth+0x305()
ffffff0003cb91f0 UM_SendPacketIntel+0x56()
ffffff0003cb9230 UM_SendMBLKPacket+0xf2()
ffffff0003cb9280 UM_SendPacketsMP+0x91()
ffffff0003cb92a0 UM_ProcessSendPackets+0x66()
ffffff0003cb92d0 UM_SendPacket+0x9e()
ffffff0003cb92f0 t3TxPacket+0xbb()
ffffff0003cb9320 bcme_wput+0x77()
ffffff0003cb9390 putnext+0x22b(fffffffec1954638, fffffffee22bf300)
ffffff0003cb9470 tcp_send_data+0x72a(fffffffec82452c0, fffffffee21e3650, fffffffee22bf300)
ffffff0003cb95a0 tcp_send+0xa7b(fffffffee21e3650, fffffffec82452c0, 4ec, 28, 14, 0, ffffff0003cb965c, ffffff0003cb9660,
ffffff0003cb9664, ffffff0003cb9618, 13a00be, 7fffffff)
ffffff0003cb9680 tcp_wput_data+0x77f(fffffffec82452c0, 0, 0)
ffffff0003cb97e0 tcp_rput_data+0x2b99(fffffffec82450c0, fffffffee2240f80, fffffffec1563f00)
ffffff0003cb9820 tcp_input+0x4a(fffffffec82450c0, fffffffee2240f80, fffffffec1563f00)
ffffff0003cb98a0 squeue_enter_chain+0x11d(fffffffec1563f00, fffffffee2240f80, fffffffee2240f80, 1, 1)
ffffff0003cb9990 ip_input+0x96f(fffffffec1b9f428, 0, fffffffee2240f80, 0)
ffffff0003cb99e0 ip_rput+0x119(fffffffec1954540, fffffffee2240f80)
ffffff0003cb9a50 putnext+0x22b(fffffffec19547d0, fffffffee2240f80)
ffffff0003cb9ac0 t3SendUp+0x2dd()
ffffff0003cb9ae0 t3ProcessRxPacket+0x98()
ffffff0003cb9b10 bcme_recv+0x87()
ffffff0003cb9b40 LM_RxPackets_Service+0x56()
ffffff0003cb9b60 LM_ISR_ServiceSoftInt+0x29()
ffffff0003cb9bc0 bcme_intr+0x30e()
ffffff0003cb9c20 av_dispatch_autovect+0x78(10)
ffffff0003cb9c60 dispatch_hardint+0x2f(10, 0)
ffffff0003c05ac0 switch_sp_and_call+0x13()
ffffff0003c05b10 do_interrupt+0x9b(ffffff0003c05b20, 1)
ffffff0003c05b20 _interrupt+0xba()
ffffff0003c05c10 mach_cpu_idle+6()
ffffff0003c05c40 cpu_idle+0xc8()
ffffff0003c05c60 idle+0x10e()
ffffff0003c05c70 thread_start+8()
mdb不但打出了函数名,而且有些函数的参数都得到了,其中ddi_dma_unbind_handle的参数是0x66,panic时,系统正在执行的指令时ddi_dma_unbind_handle+0xf,我们可以反汇编这个函数看看该指令是什么?
> ddi_dma_unbind_handle::dis
ddi_dma_unbind_handle: pushq %rbp
ddi_dma_unbind_handle+1: movq %rsp,%rbp
ddi_dma_unbind_handle+4: subq $0x10,%rsp
ddi_dma_unbind_handle+8: movq %rdi,-0x8(%rbp)
ddi_dma_unbind_handle+0xc: movq %rdi,%rdx
ddi_dma_unbind_handle+0xf: movq 0x20(%rdx),%rsi
ddi_dma_unbind_handle+0x13: movq 0x98(%rsi),%rdi
ddi_dma_unbind_handle+0x1a: xorl %eax,%eax
ddi_dma_unbind_handle+0x1c: call *0xe8(%rsi)
ddi_dma_unbind_handle+0x22: leave
ddi_dma_unbind_handle+0x23: ret
反汇编的结果有时会比较难看懂,好在OpenSolaris已经开放源代码了,我们可以对照代码:
http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/sunddi.c#ddi_dma_unbind_handle
int
ddi_dma_unbind_handle(ddi_dma_handle_t h)
{
ddi_dma_impl_t *hp = (ddi_dma_impl_t *)h;
dev_info_t *hdip, *dip;
int (*funcp)(dev_info_t *, dev_info_t *, ddi_dma_handle_t);
dip = hp->dmai_rdip;
hdip = (dev_info_t *)DEVI(dip)->devi_bus_dma_unbindhdl;
funcp = DEVI(dip)->devi_bus_dma_unbindfunc;
return ((*funcp)(hdip, dip, h));
}
不难看出,ddi_dma_unbind_handle+0xf指令正是以下这行,访问dmai_rdip成员的语句:
dip = hp->dmai_rdip;
可以用mdb验证一下:
> ::offsetof ddi_dma_impl_t dmai_rdip
offsetof (ddi_dma_impl_t, dmai_rdip) = 0x20
ddi_dma_unbind_handle+0xf对应的指令是:
ddi_dma_unbind_handle+0xf: movq 0x20(%rdx),%rsi
%rdx寄存器的值可以从以下命令得到:
> ::regs
%rax = 0x0000000000000000 %r9 = 0xfffffffec02023c8
%rbx = 0x0000000000000018 %r10 = 0x0000000000002000
%rcx = 0x0000000000000009 %r11 = 0x0000000000000001
%rdx = 0x0000000000000066 %r12 = 0xfffffffec1dc8fb0
%rsi = 0x0000000000000009 %r13 = 0x0000000000000018
%rdi = 0x0000000000000066 %r14 = 0xfffffffec1dc9298
%r8 = 0x0000000000000180 %r15 = 0xfffffffec1dc9288
%rip = 0xfffffffffba189ff ddi_dma_unbind_handle+0xf
%rbp = 0xffffff0003cb9100
%rsp = 0xffffff0003cb90f0
%rflags = 0x00010286
id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0
status=<of,df,IF,tf,SF,zf,af,PF,cf>
%cs = 0x0030 %ds = 0x004b %es = 0x004b
%trapno = 0xe %fs = 0x0000 %gs = 0x01c3
%err = 0x0
所以ddi_dma_unbind_handle+0xf指令实际上是去访问0x20(%rdx)对应的地址而发生的bad trap, 即addr=88:
> 0x66+0x20/X
mdb: failed to read data from target: no mapping for address
0x86:
再看一下ddi_dma_unbind_handle(9F)里的说明,这个函数几乎是所有的使用dma的驱动都需要用到的:
Kernel Functions for Drivers ddi_dma_unbind_handle(9F)
NAME
ddi_dma_unbind_handle - unbinds the address in a DMA handle
SYNOPSIS
#include <sys/ddi.h>
#include <sys/sunddi.h>
int ddi_dma_unbind_handle(ddi_dma_handle_t handle);
PARAMETERS
handle The DMA handle previously allocated by a call to
ddi_dma_alloc_handle(9F).
