Web Hosting Talk







View Full Version : possible kernel issue?


clocker1996
05-29-2002, 09:12 PM
Hi, i have an amd athlon xp 1900 server from rackshack. I realize this is not a rackshack forum, i am not starting this thread because id like to complain about rackshack, but simply becuase id like to get my issue resolved.

Today i was trying to copy a dir (That was roughly about 300mb with many different files, and subdirs) to /home/test. Usually id o it like this

cp -Rf /path/300mbdir ./&

(because i was already in /home/test)

becuase i dont really like to sit and wait around. anyway, usually it'll end and everything will be fine. but i noticed that instead it would SEG Fault.
i noticed that not all the files were copied
so i rm -rf'ed the dir
and re-did it

cp -Rf /path/300mbdir ./ &

then it worked

at first i thought bad ram! but then i thought no wait, they swapped ram sticks.

so then, i went to go untar this 117mb .tar.gz file

tar xzf filename.tar.gz

SEGFAULT
i had to do it a few times for it to work.

then i pico'ed /var/log/messages, went ot the very bottom, saw things like this:
May 29 20:18:31 sun kernel: <1>Unable to handle kernel NULL pointer dereference at virtual address 00000134
May 29 20:18:31 sun kernel: printing eip:
May 29 20:18:31 sun kernel: c012768d
May 29 20:18:31 sun kernel: *pde = 00000000
May 29 20:18:31 sun kernel: Oops: 0002
May 29 20:18:31 sun kernel: Kernel 2.4.9-31
May 29 20:18:31 sun kernel: CPU: 0
May 29 20:18:31 sun kernel: EIP: 0010:[__remove_inode_page+93/128] Not tainted
May 29 20:18:31 sun kernel: EIP: 0010:[<c012768d>] Not tainted
May 29 20:18:31 sun kernel: EFLAGS: 00010206
May 29 20:18:31 sun kernel: EIP is at __remove_inode_page [kernel] 0x5d
May 29 20:18:31 sun kernel: eax: 00000100 ebx: cc6350f8 ecx: 000001fd edx: f7e708ac
May 29 20:18:31 sun kernel: esi: c136c404 edi: 00003979 ebp: c02b2ec0 esp: d68a9e88
May 29 20:18:32 sun kernel: ds: 0018 es: 0018 ss: 0018
May 29 20:18:32 sun kernel: Process cp (pid: 9591, stackpage=d68a9000)
May 29 20:18:32 sun kernel: Stack: c136c420 c136c404 c012e72f c136c404 c02b2ec0 c02b30a8 00000002 00000001
May 29 20:18:32 sun kernel: c01300d3 c02b2ec0 00000001 c02b30b0 00000000 000000d2 c01301cf c02b30a0
May 29 20:18:32 sun kernel: 00000000 00000002 00000001 00000001 000000d2 f7f752c4 00000000 00000000
May 29 20:18:32 sun kernel: Call Trace: [reclaim_page+671/944] reclaim_page [kernel] 0x29f
May 29 20:18:32 sun kernel: Call Trace: [<c012e72f>] reclaim_page [kernel] 0x29f
May 29 20:18:32 sun kernel: [__alloc_pages_limit+99/144] __alloc_pages_limit [kernel] 0x63
May 29 20:18:32 sun kernel: [<c01300d3>] __alloc_pages_limit [kernel] 0x63
May 29 20:18:32 sun kernel: [_wrapped_alloc_pages+175/608] _wrapped_alloc_pages [kernel] 0xaf
May 29 20:18:32 sun kernel: [<c01301cf>] _wrapped_alloc_pages [kernel] 0xaf
May 29 20:18:32 sun kernel: [__alloc_pages+15/160] __alloc_pages [kernel] 0xf
May 29 20:18:32 sun kernel: [<c013038f>] __alloc_pages [kernel] 0xf
May 29 20:18:32 sun kernel: [generic_file_write+860/1552] generic_file_write [kernel] 0x35c
May 29 20:18:32 sun kernel: [<c012ab1c>] generic_file_write [kernel] 0x35c
May 29 20:18:32 sun kernel: [do_generic_file_read+1300/1312] do_generic_file_read [kernel] 0x514
May 29 20:18:32 sun kernel: [<c0128c64>] do_generic_file_read [kernel] 0x514
May 29 20:18:32 sun kernel: [sis900:__insmod_sis900_O/lib/modules/2.4.9-31/kernel/drivers/net/s+-1263102/96] __insmod_ext3_S.text_L$
May 29 20:18:32 sun kernel: [<f880da02>] __insmod_ext3_S.text_L43056 [ext3] 0x19a2
May 29 20:18:32 sun kernel: [sys_write+150/256] sys_write [kernel] 0x96
May 29 20:18:32 sun kernel: [<c01368c6>] sys_write [kernel] 0x96
May 29 20:18:32 sun kernel: [system_call+51/56] system_call [kernel] 0x33
May 29 20:18:32 sun kernel: [<c0106f3b>] system_call [kernel] 0x33
May 29 20:18:32 sun kernel:
May 29 20:18:32 sun kernel:
May 29 20:18:32 sun kernel: Code: 89 50 34 89 02 c7 46 34 00 00 00 00 ff 0d 00 2b 2b c0 5b 5e
May 29 20:19:39 sun kernel: <1>Unable to handle kernel NULL pointer dereference at virtual address 00000834

You see, when i first got this server, i was told it was 1gig of ram. i ran free -m and it only showed 896

i ran dmesg | more

and it said that it will only go up to 896mb of memory and only that much will be used.

so rackshack re-installed the kernel (thats what they told me) -- and i guess they used the rpm version.
anyway

after they did that, i ran free -m and it was working!

1000mb of ram

so here i am, today with this new problem

COuld this be the kernel....?

What would i have to do to get rid of this problem?

Kernel is: 2.4.9-31

and the server is rh 7.2

would upgrading the kernel fix this issue?
to a newer version (e.g. 2.4.18)

Shyne
05-29-2002, 09:35 PM
http://www.simplicity.net/ia/software/linux/mail/linux-kernel/msg00843.html

http://lists.canonical.org/pipermail/kragen-journal/1999-December/000288.html

Look in the Linux Kernel Mailing list. It has a LOT of information about this error. You can try recompiling the kernel, but make sure you make lilo have an option to boot from the "good" kernel in case the new one fails.

Tim Greer
05-29-2002, 09:37 PM
I'm confused, did you get this error above before or after RS compiled a new kernel? And, more importantly, how exactly did you convince RS of all people, to actually compile a new kernel? Are you sure they didn't re-install the Redhat image? Anyway, I'd upgrade to 2.4.18 anyway, and that could fix any bug that might exist in the kernel you're using now, but it could also be bad RAM. When you compile the new kernel, be sure to enable large memory support (among other things).

clocker1996
05-29-2002, 09:45 PM
Originally posted by Tim_Greer
I'm confused, did you get this error above before or after RS compiled a new kernel? And, more importantly, how exactly did you convince RS of all people, to actually compile a new kernel? Are you sure they didn't re-install the Redhat image? Anyway, I'd upgrade to 2.4.18 anyway, and that could fix any bug that might exist in the kernel you're using now, but it could also be bad RAM. When you compile the new kernel, be sure to enable large memory support (among other things).

this error was today, way after the re installment of the new kernel and the RAM swap.

they didnt compile me a new kernel dude, they just re installed the rpm

Tim Greer
05-29-2002, 09:56 PM
Originally posted by clocker1996


this error was today, way after the re installment of the new kernel and the RAM swap.

they didnt compile me a new kernel dude, they just re installed the rpm

Well, your above post said "so rackshack re-installed the kernel". Anyway, I'm not sure what you mean by re-installing any RPM's. Did they compile a new kernel or not, is what I am wondering? You're saying they re-installed the kernel via an RPM???? Anyway, a new kernel and new RAM and it still has this error, or this error only showed up after the new (?) kernel and/or RAM? I assume it was before, since why else would you have replaced it.

clocker1996
05-29-2002, 10:35 PM
i don't see what is so hard to understand
rpm -qa | grep kernel
rpm -qa | grep rpm
ps awux | grep rpm
rpm -qa | grep kernel
rpm -e kernel-2.4.9-31
ls
rpm -Uvh kernel-2.4.9-31.i686.rpm

that is all they did

just re INSTALLED the kernel so the ram would work properly

Tim Greer
05-29-2002, 10:49 PM
The "confusion" was the fact that you said they did install a new one, then they didn't "compile" one, but they did (after all) remove the previous RPM and then install (upgrade) a new one.

Upgrading, or installing a kernel from RPM is a very risky and bad idea. I wonder if they rebooted it already after they did that? Did you ensure they rebooted it? It's wiser to use up2date or even better yet, compile from source. Oh well, not a big deal, I guess. I was just confused by what you said. I'd compile 2.4.18 anyway. It's a good, stable kernel and it won't hurt to have it upgraded anyway. Just installing an RPM update is not going to ensure it was installed optimally anyway.

ckpeter
05-29-2002, 10:51 PM
Originally posted by Shyne
http://www.simplicity.net/ia/software/linux/mail/linux-kernel/msg00843.html

http://lists.canonical.org/pipermail/kragen-journal/1999-December/000288.html

Look in the Linux Kernel Mailing list. It has a LOT of information about this error. You can try recompiling the kernel, but make sure you make lilo have an option to boot from the "good" kernel in case the new one fails.
Sorry clocker, off-topic, but, is there an option of lilo skipping from a previous bad boot? I have been fearful everytime I boot a new kernel. Is there an option for lilo to skip bad kernels so that, for example, I install a bad kernel, reboot once, and it fails, and I reboot twice, and the old kernel comes up? (This way, only reboot, but no tech support, is required)

Thanks,

Peter

Shyne
05-29-2002, 11:30 PM
Not that I know of. I think LILO boots before the kernel, so it can't check if it's good or not.

Thomas.N11
05-30-2002, 05:41 AM
I'm really surprised rack shack would have just grabbed an rpm and upgraded the kernel that way. That's a serious no no.

You want optimal utilization of your system resources, you're best bet after any initial red hat installation is to go to red hat's errata and grab their latest source rpm.

You won't require bigmem support with only 1GB.

I would highly recommend this path even if the errors are a result of something else. Make sure you have someone there that can reboot the machine for you if you have issues with the compiled kernel and it won't bootup. I don't know of anyway off the top of my head where you could set the new kernel as default and then have it auto switch back upon failure. I doubt that is possible because if you're running lilo you have to execute lilo for any changes to take affect, and that isn't going to happen until you at least get the file system mounted.

Compile a new kernel. Have a rackshack engineer ready to reboot off of the other kernel if any issues arise. And have a few words with them if they did in fact upgrade the kernel.

ckpeter
05-30-2002, 09:15 AM
You must not be familar with rackshack: They are famous for unmanaged support. If you mess up, you have to pay $29 to restore your machine. (they don't have 'engineer' really)

Peter

clocker1996
05-31-2002, 06:31 PM
remember how i was having those seg fault problems earlier..?

well, first they did a ram check. Ram was bad. so they swapped the ram. Then they took out my hd, and put it in a new server. (same type)
Now, few hours after the server swap... I go to untar a .tar.gz file, which i have untar'ed many times on other systems, it is seg faulting yet again.. www.drirc.net/seg/fault2.txt

sigh, this sucks.

clocker1996
05-31-2002, 06:48 PM
Incase anyone is confused, here is my story in detail, this should clarify everything.

[Day one of receiving this amd athlon xp 1900 server]
I discovered the ram was only showing as 896MB (using free -m)

so they re installed the kernel via the rpm, then it showed 1000mb.

[Day two i believe]
discovered compiling errors.
things were crashing on compilation.
sent in a trouble ticket, and they did a ram test, guess it was bad ram, so they swapped ram.

so that was th efirst ram swap.

on (may 30th) i discover seg faults
below are the errors
May 29 20:18:31 sun kernel: <1>Unable to handle kernel NULL pointer dereference at virtual address 00000134
May 29 20:18:31 sun kernel: printing eip:
May 29 20:18:31 sun kernel: c012768d
May 29 20:18:31 sun kernel: *pde = 00000000
May 29 20:18:31 sun kernel: Oops: 0002
May 29 20:18:31 sun kernel: Kernel 2.4.9-31
May 29 20:18:31 sun kernel: CPU: 0
May 29 20:18:31 sun kernel: EIP: 0010:[__remove_inode_page+93/128] Not tainted
May 29 20:18:31 sun kernel: EIP: 0010:[<c012768d>] Not tainted
May 29 20:18:31 sun kernel: EFLAGS: 00010206
May 29 20:18:31 sun kernel: EIP is at __remove_inode_page [kernel] 0x5d
May 29 20:18:31 sun kernel: eax: 00000100 ebx: cc6350f8 ecx: 000001fd edx: f7e708ac
May 29 20:18:31 sun kernel: esi: c136c404 edi: 00003979 ebp: c02b2ec0 esp: d68a9e88
May 29 20:18:32 sun kernel: ds: 0018 es: 0018 ss: 0018
May 29 20:18:32 sun kernel: Process cp (pid: 9591, stackpage=d68a9000)
May 29 20:18:32 sun kernel: Stack: c136c420 c136c404 c012e72f c136c404 c02b2ec0 c02b30a8 00000002 00000001
May 29 20:18:32 sun kernel: c01300d3 c02b2ec0 00000001 c02b30b0 00000000 000000d2 c01301cf c02b30a0
May 29 20:18:32 sun kernel: 00000000 00000002 00000001 00000001 000000d2 f7f752c4 00000000 00000000
May 29 20:18:32 sun kernel: Call Trace: [reclaim_page+671/944] reclaim_page [kernel] 0x29f
May 29 20:18:32 sun kernel: Call Trace: [<c012e72f>] reclaim_page [kernel] 0x29f
May 29 20:18:32 sun kernel: [__alloc_pages_limit+99/144] __alloc_pages_limit [kernel] 0x63
May 29 20:18:32 sun kernel: [<c01300d3>] __alloc_pages_limit [kernel] 0x63
May 29 20:18:32 sun kernel: [_wrapped_alloc_pages+175/608] _wrapped_alloc_pages [kernel] 0xaf
May 29 20:18:32 sun kernel: [<c01301cf>] _wrapped_alloc_pages [kernel] 0xaf
May 29 20:18:32 sun kernel: [__alloc_pages+15/160] __alloc_pages [kernel] 0xf
May 29 20:18:32 sun kernel: [<c013038f>] __alloc_pages [kernel] 0xf
May 29 20:18:32 sun kernel: [generic_file_write+860/1552] generic_file_write [kernel] 0x35c
May 29 20:18:32 sun kernel: [<c012ab1c>] generic_file_write [kernel] 0x35c
May 29 20:18:32 sun kernel: [do_generic_file_read+1300/1312] do_generic_file_read [kernel] 0x514
May 29 20:18:32 sun kernel: [<c0128c64>] do_generic_file_read [kernel] 0x514
May 29 20:18:32 sun kernel: [sis900:__insmod_sis900_O/lib/modules/2.4.9-31/kernel/drivers/net/s+-1263102/96]
__insmod_ext3_S.text_L$
May 29 20:18:32 sun kernel: [<f880da02>] __insmod_ext3_S.text_L43056 [ext3] 0x19a2
May 29 20:18:32 sun kernel: [sys_write+150/256] sys_write [kernel] 0x96
May 29 20:18:32 sun kernel: [<c01368c6>] sys_write [kernel] 0x96
May 29 20:18:32 sun kernel: [system_call+51/56] system_call [kernel] 0x33
May 29 20:18:32 sun kernel: [<c0106f3b>] system_call [kernel] 0x33
May 29 20:18:32 sun kernel:
May 29 20:18:32 sun kernel:
May 29 20:18:32 sun kernel: Code: 89 50 34 89 02 c7 46 34 00 00 00 00 ff 0d 00 2b 2b c0 5b 5e
May 29 20:19:39 sun kernel: <1>Unable to handle kernel NULL pointer dereference at virtual address 00000834

then i spoke with someone at rackshack, and they told me to send in a TT requesting a ram check, and they said if the ram turns out to be bad again, then to just give me a new server.

So after i sent in a TT requesting ram check, they replied saying
bad data bits in ram and wasnt operating at full speed.
Changed the ram.

So then seeing as the ram was bad again, they swapped servers... (giving me a new server)

This was confirmed, and they "swapped" servers for me today (may 31st)

a server swap according to rackshack is where they take your HD, and put it in another box (new box)
after they did that, i went in, everything seemed fine.

When i tried to untar a .tar.gz file, boom --> seg faulted.

see the errors here --> www.drirc.net/seg/fault2.txt

that should clear up any confusion.

so is this the kernel? or is this the harddrive messing up?

Does anyone know.?