Topic: How to determine broken WRT54G hardware

The content of this topic has been archived on 16 Apr 2018. There are no obvious gaps in this topic, but there may still be some posts missing at the end.

Page 1 of 1

Post #1

HansA

24 May 2007, 00:27

I own a WRT54G V1.1 which was already bricked when I bought it.
I've solved a couple of problems but now I'm stuck with the question how I can possibly determine if there's really a HW failure which would render any further efforts useless.

The following steps of trial and error and partial success might be useful for people facing similar problems (I'm only reporting the main line of events)

What did I do:
First act
- When I got the router none of the LEDs was working. Power supply was ok, 3.3 Volts at the regulator output. Someone had attached a JTAG connector which was removed. The holes were drilled out.
No ping reply.
- I detected an interrupted circuit path which I fixed. Result: The power LED started flashing. No ping replies.
- I tried the PIN 15/16 (5/6) trick without any success (this has obviously been tried before, since there were scratch marks at thr pins)
- I wired a JTAG cable, directly soldered to the PCB, omitting the 100 Ohm resistors - there is a 4,7k resistor on the board for every signal line.
wrt54g V4.5 could not detect a valid processor ID although a value different from all FF was reported.
- I moved from my notebook to my desktop (with an honest parallel interface), used EPP mode with IRQ7 use forced: wrt54g could detect CPU and flash correctly
- I made a backup of the CFE and compared the image with a 'generated CFE'. There were two differencies:
1. the MAC address in the CFE backup (matching the one on the label) was preceded by another MAC address different from the one reported on the label (??)
2. one day difference in a version date (??)
- I tried to flash the new CFE: transfer stops at ca. 4%
- I tried wrt54g V4.8 (xource www.skynet.com) and could fully flash a new CFE: no change, still no ping reply
- I erased the NVRAM: no change, still no ping reply
- I made another backup of the CFE and compared again: MANY differences! Obviously could wrt54g v4.8 read correctly but flashing was corrupt
- I added the missing 100Ohm resistors and treid wrt54g v4.5 again: sucess. flashing works correctly and a following backup matches the source.
- I flashed the new CFE and erased the NVRAM: response to ping!!

Second act
I do not go through the steps here. I rather report the state and the observed effects
- tftp seems to work: number of bytes transferred is correct
- no boot, power LED keeps flashing although with different speeds depending on the situation
- the DMZ LED never comes to life; the LAN (and internet) LEDs do
- no ping after 'sucessful' tftp and a waiting phase of >10 minutes
- recycling the power brings the ping replies back but no boot
- any attempt to use the reset button in one of the known procedures leads to 'no ping'. To bring the ping back I have at least to erase the NVRAM if not to flash the CFE again
- comparing a backup of the kernel with the flashed image (starting at HDR0) shows differences starting at 0x4000 (I'm not sure but I think that was the offset)
- different images (original firmwares, openWrt, ...) don't make a difference
- tftp by the way comes back with a success message even after I waited far more than 5 seconds from powering on and this was even before I tried to switch BOOT_WAIT on
- I tried to set BOOT_WAIT=ON by editing the respective string in the CFE image before flashing it. After this I checked an NVRAM backup for this string and it was there with BOOT_WAIT=ON while it was OFF before. So this seemed to be successful
- Last thing I did was to strip the header off the openWrt bin (so it starts with HDR0 now) and flash it with JTAG
No boot, same as with tftp. I didn't have the nerve to make a backup again and compare...

Third act
So this is where I got stuck.
I hope for some ideas to complete the drama. Did I miss something which I could still try or is there any clear indication that my WRT should simply go to thrash?

Hans

Post #2

mbm

24 May 2007, 05:02

The biggest mistake people seem to make with JTAG is the "wipe everything and reload CFE" approach; they either can't find the correct CFE version after wiping the device, or they reflash with a CFE which is incompatible with their device. You should always try to use the CFE version that came with the device rather than attempting to replace it with some random CFE you found on the internet.

Second mistake - embedded within CFE is a set of NVRAM defaults to be used if the NVRAM partition is missing. This means that in most cases you can just wipe everything but CFE and it'll happily boot, recreate NVRAM and start waiting for a firmware via TFTP. In some cases however, the defaults embedded defaults (in the CFE shipped with the device) don't match the actual hardware and CFE will fail to boot. This is why we have the warnings not to wipe NVRAM. To recover from this situation you need either the original NVRAM contents, or a version of CFE with the correct defaults.

(it should also be noted that the bin versions are nothing more than the trx version with device specific padding added to the start, hence stripping the padding gave you exactly the same as just downloading the trx version in the first place)

Post #3

HansA

24 May 2007, 08:25

Thanks for the answer.
This is why I made a CFE backup in the first place - just to find those minor(?) differences to the one I found on the internet. However, what i cannot be sure about is if anyone before me fiddled with the CFE.
The CFE's not on the CD which came with the device, is it?

The CFE appears to do SOMETHING anyway. It brings the switch to life making it reply to pings and tftp obviously works and a previously clear NVRAM is filled again. So why does tftp always appear to work? Because the kernel won't load and the bootloader goes back to waiting for a better one?

OK, I understood the bootloader could be intact but there might still be a wrong set of variables which prevents loading the kernel.
Is somebody out there to supply me with the correct values (WRT54G v1.1 CDF30D200399)?

BTW, I created the trx myself because there was only one trx in the download section which looked very unspecific to me while there was a bin for any of the different models. I'm wiser now having learned that the only difference is in the header describing the model.

So I'm having a correct OpenWrt bin and a correct trx, this being an academic fact as long as I can't even get the original fw to work.

Hans
(still bricked)

There seems to be another myth widely found on the net:
Rename the firmware image to code.bin (kernel.bin,... now what?) otherwise tftp wouldn't recognize it as a firmware. I don't believe this is true. Only wrt54g requires a file name specific to the section to flash.

Post #4

Kevin

24 May 2007, 08:27

a serial console would be rather useful in this case it sounds like. you would easily determine what's going on with tftp

i've never needed a specific filename to flash via tftp

Post #5

mbm

24 May 2007, 11:41

Inside pretty much any home router or access point you'll find the following
- flash chip (2M, 4M or somewhat rarely 8M)
- ram (8x the amount of flash)
- cpu (mips; provided by a broadcom 47xx or 5352)
- 6 port vlan managed switch (adm6996l, or more commonly the broadcom "roboswitch")
- wifi (broadcom 43xx based)

Chances are that almost all of that functionality will come from one or two Broadcom chips. The ram and flash are the exception.

Depending on the device you could have as little as 2/8 (ram/flash) or as much as 8/32, but by far the most common combination is 4/16; probably an intel flash chip.

The flash chip can be represented as a large block of continuous space:

[ start of flash ...... end of flash ]

There is no ROM to boot from; at power up the CPU begins executing the code at the very start of flash. Luckily this isn't the firmware or we'd be in real danger every time we reflashed. Boot is actually handled by a section of code we tend to refer to as the boot loader. In Broadcom devices this is CFE -- "Common Firmware Environment"; think of it like the BIOS in your computer.

(note - in wrt54g v1.x hardware, it was actually another boot loader called "PMON", it wasn't until the wrt54g v2.0 that they switched to CFE; both provide the exact same functionality)

[ CFE ] [ firmware ....... ] [ NVRAM ]

(there's no actual partitions, just hard coded locations)

The job of the boot loader is to initialize the memory and other hardware and then begin booting the firmware. In most cases there's a recovery mechanism that allows you to reflash the firmware so that a bad flash doesn't render the device useless. CFE does this through the use of a TFTP server; this can be triggered by the firmware not matching the firmware checksum, the boot_wait variable or via CFE's serial console command line.

If you dig into the "firmware" section you'll find a trx. A trx is just an encapsulation, which looks something like this -

[ HDR0 ][ length ][ crc32 ][ flags ][ pointers ][ data ... ]

"HDR0" is a magic value to indicate a trx header, rest is 4 byte unsigned values followed by the actual contents. Here's a few diagrams to help you understand the flash layout; each line represents the flash, exact same but with increasing levels of detail as to the contents.

[ CFE ][ firmware .... ][ NVRAM ]
[ CFE ][ trx .... ] [ "unused" ][ NVRAM ]
[ CFE ][ trx ( kernel )( squashfs ) ] [ "unused" ] [ NVRAM ]
[ CFE ][ trx ( kernel )( squashfs ) ] [ JFFS2 ] [ NVRAM ]
[ CFE ][ trx ( lzma boot ( kernel ) ) ( squashfs ) ][ JFFS2 ][ NVRAM ]

As for the proper ways to recover -

boot_wait -
The single best thing you can do is have boot_wait set, meaning that all you have to do is TFTP a new firmware. At one time the reflashing instructions included a an exploit for the Linksys firmware that set the boot_wait variable; as time progressed and Linksys eventually fixed the bug (after several failed attempts) we found that people were flashing to other firmwares for the sole purpose of setting boot_wait so they could reflash to OpenWrt. We figured this was somewhat pointless and altered the instructions to indicate that you could safely reflash to OpenWrt without setting boot_wait.

JTAG -
It's one of those amazingly useful things that allows you to recover from pretty much anything that doesn't involve a hardware failure. While the JTAG can technically be used to watch every instruction and register as the system boots, the recovery software only uses it for DMA access to the flash chip, making it somewhat a blind recovery mechanism.

Serial -
Serial consoles are great, there's just one problem - the routers run on 3.3v and a normal PC serial port puts out +/-12v, easily frying a router. This means that a level shifter such as a max233 is required, and adding the ICs and caps required is beyond the ability of most users -- luckily there's a shortcut. Most cellphones are either USB or 3.3v serial, so the data cable for a 3.3v cellphone can be used to make an easy and professional looking serial console connection. You only need to identify and connect 4 wires (vcc, rx, tx, gnd) -- and if your cable uses a pl2303 you can skip the vcc connection.

Serial console allows you to interact with the CFE command line, watch the kernel boot and console access to linux. This is probably the only way you'll every get any meaningful feedback about the device boot up.

LEDs -
Most people assume the LEDs on the front are deterministic, and that by telling you which LEDs are lit you can instantly tell if the hardware is working or where it crashed in bootup. This unfortunately isn't the slightest bit true.

- Power LED. The biggest mistake people make here is "my power led is blinking, what does that mean?". There's an assumption that if the LED is blinking there must be software turning the LED on and off, and that it must mean something. The blinking is actually done in hardware; software only as the ability to set the LED "on" or "blink" -- it defaults to blink on power up and isn't set to on until after the firmware boots. If the led is on then you know the firmware booted; blinking really doesn't tell you much.

- Switch LEDs. The second common mistake is "the switch still works". Of course the switch still works, it's a separate piece of hardware and the LEDs are wired directly to it. The only useful bit of information you can get is "all the switch LEDs are lit". When the switch chip is reset, all of the ports will light up (even if no devices are connected) for about a second; this happens at power up and again as the firmware boots and reprograms the switch. If they stay lit, you're either a moron for not noticing the ports are actually in use, or someone has broken/shorted the switch chip. You can also notice reboot loops by watching for the switch reset.

- Diag/DMZ LED. Controlled by OpenWrt (diag module) to indicate bootup.

- Wifi. Controlled by the wifi driver; trivia - the wifi driver can also reset the power led in certain situations.

....

Stupid things people do -

Pin shorting -

In the past we used to suggest that people shorted a few pins of the flash; when CFE booted and attempted to perform the CRC32 there would be a flash read error which would change the outcome of the CRC and the resulting failure would force CFE into recovery mode. It's a great trick, but over the years we've learned that people are idiots and will take that as an invitation to poke mangle and short just about every pin on the device based on some irrational belief that if they find the right pin everything will magically work again. You do not want someone paranoid at the thought of breaking the device scraping up every single electrical connection on the device -- it never ends well, and generally results in the flash chip or the router being damaged in the process.

- frying a chip (worst case)
- lifting/breaking electrical connections
- permanently shorting (best case)

The best case is that they simply bent a pin and you can easily bend it back - providing you can find it.

Depending on which pins are shorted/broken, it may be possible to access CFE but not to access the rest of the flash. Meaning CFE boots fine but can't read or write the firmware. This can be confirmed by JTAG.

Wrong CFE version -
Loading the wrong CFE version can also lead to devices which boot into CFE but are unable to write to the flash, or are unable to initialize the networking.

And yes, there are actually a few obscure versions that require the firmware to be named "code.bin" or a specific port to be used. Unfortunately nobody can remember exactly which devices, leading to all sorts of superstition.

Post #6

HansA

24 May 2007, 14:02

I still suspect wrong or missing NVRAM settings. This is what I found in my NVRAM after it was erased and reinitiated by the CFE:

FLSH(<bh:02><bh:00><bh:00>J<bh:01><bh:19><bh:04><bh:00><bh:00>@<bh:80><bh:00><bh:00><bh:00><bh:00>
os_ram_addr=80001000<bh:00>
et0macaddr=00:0F:66:24:CE:07<bh:00>
boot_wait=on<bh:00>
et0mdcport=0<bh:00>
Intel_firmware_version=v1.41.8<bh:00>
pmon_ver=PMON 3.31.15.0<bh:00>
os_flash_addr=bfc40000<bh:00>
boardtype=bcm94710dev<bh:00>
et1macaddr=00:0F:66:24:CE:08<bh:00>
lan_netmask=255.255.255.0<bh:00>
et1mdcport=1<bh:00>
flash_type=Intel 28F320C3 2Mx16 BotB<bh:00>
lan_ipaddr=192.168.1.1<bh:00>
clkfreq=125<bh:00>
firmware_version=v1.42.2<bh:00>
sdram_config=0x0000<bh:00>
scratch=a0180000<bh:00>
sdram_refresh=0x8040<bh:00>
et0phyaddr=30<bh:00>
sdram_init=0x0419<bh:00>
dl_ram_addr=a0001000<bh:00>
boot_date=Fri Sep 26 00:37:28 2003<bh:00>
boot_ver=v1.5<bh:00>
et1phyaddr=30<bh:00>
boardnum=42

Any ideas are welcome.
If there was lets say a variable missing could I simply patch it into the NVRAM image and reflash it using JTAG? I think I already got BOOT_WAIT=ON to work that way.

Hans

Post #7

noelbou

24 May 2007, 15:12

Great mbm !

This post is one of the most instuctive i have read so far ...
I suggest to make it a "sticky" one ...

Thanks for your great explanation ... everything is now clearer ...

Noël

Post #8

HansA

23 Jul 2007, 21:00

--- Resolved ---
This post is for the benefit of those who have a bricked WRT54G V1.1 and don't get it to work.
Although I'm not fully aware of what the problem actually was, the following operations led to success:

- strip a cfe.txt from a generated CFE.BIN to get a list of NVRAM variables
- make the follwoing changes to cfe.txt
et0macaddr=<actual mac-address>
et1macaddr=<actual mac-address + 1>
boot_wait=on
- use nvserial to create a new CFE.BIN

I don't know, if this is really relevant but the difference is the following: the MAC address in the 'generated' CFE.BIN is placed at offset 0x2000 and et0macaddr/et1macaddr have some generic or random values. The CFE created by nvserial has all 0x00 after the flash area unti offset 0x2430 and it should have a valid checksum.
My theory is that placing the MAC at offset 0x2000 is not safe for the V1.1: the switch would work and show the correct MAC address but the unit won't boot.

Ok, next steps
I found the Linksys Autoupdater on http://www.linksysinfo.org/forums/downl … e&id=9 which is a tftp program with an integrated 4.30.5 firmware binary.

- erase:wholeflash with wrt54g
- flash:cfe
- start autoupdater, delete the default admin in the password field
- power cycle the router
- start update
Autoupdater found the router, did the transfer and after reboot the unit started with the Linksys firmware.

Finally, I upgraded to OpenWrt from the Linksys web interface without any problem.

This proccedure was reproducable. I only repeated it once since I thought I made a mistake with the NVRAM settings. Since the the WRT54G V1.1 has been working flawlessly.

Hans

Post #9

ragtap

23 Aug 2007, 13:22

Hi to everyone...

I own a Linksys WRT54G V2.0 with a similar problem; I´ve tried it all but nothing seems to fix it.
When i flash the CFE i can ping it and upload a firmware BUT, when i swicht it off and on again after flash it, then dies again...

These are some conclusions after some research:

I can flash everything (KERNEL CFE AND NVRAM) with the JTAG with no errors at all
After flashing the CFE in CAN ping and upload again
When i backup the CFE /KERNEL after flashing the kernel (to compare it with the one uploaded) i realized that half CFE is now half empty and the KERNEL is completely empty!!!! (i think that is the reason why now is dead again)

What could be going on here?

Thanx a lot in advance

Post #10

ragtap

25 Aug 2007, 04:09

Please help!

Post #11

mbm

25 Aug 2007, 05:20

ragtap -

I get the feeling that you left out the key piece of information - that you previously attempted to recover by shorting the pins of the flash.

The proof to that is that you can write to the CFE to the flash chip, but any attempts to write to other sections of the flash overwrites a portion of CFE, rendering the board unbootable. In other words, you've left the flash shorted such that the addresses now overlap.

Post #12

ragtap

26 Aug 2007, 23:59

Unfortunately, you´re right, i tried to shorted it...

When the flash and the CFE becomes half blank is not when i flash them (because it doesnt give me any errors at all when flashing) it comes when i reboot it!!

Is there any way to fix this?

Thanx a lot in advance

(Last edited by ragtap on 27 Aug 2007, 00:02)

Post #13

mbm

27 Aug 2007, 05:01

ragtap -

You don't seem to understand, you've physically damaged the connection between the flash chip and the board; it's not a software problem and you can't fix it via jtag.

Post #14

ragtap

27 Aug 2007, 07:38

Thanx mbm 4 your reply. I understood the problem but what i was wodering is is there´s any way to fix the connection, like resolder pins or wire something...

Thnx again!

Post #15

ragtap

4 Sep 2007, 00:52

maybe buying and soldering a new flash??

Post #16

HansA

12 Sep 2007, 09:33

ragtap,
there's still something not clear to me: you said CFE's 'half empty' and KERNEL's empty when you do a backup. Is this directly after you flashed both or after you tried a reboot? The bootloader can do anything...

To make sure if your flash memory's blown I suggest that you do the following tests:
Perform a wholeflash:erase then cfe:backup and nvram:backup - both should be empty (all 0x00). You can also do a kernel backup which should also show all zeros.
Next step flash your CFE and make a cfe:backup instantly afterwards, without trying to reboot and compare the files (frhed is a nice freeware program to look into hex files, and it has a compare option).
Do the same with NVRAM.
You may also prepare files with all 0xFF in the length of CFE, NVRAM,... and flash and backup them (wholeflash or kernel are time consuming but you should should also test kernel if the other tests were positive).
My point is that I wouldn' believe in a hardware failure before I made absolutely sure there is one.
Still, you could suffer from bad JTAG communication (I had to try a lot until I got it right).

Before you consider buying a new flash ram chip: it's not much cheaper than a new WRT...
It's another story what you do with your spare time.

Hans

Post #17

napierzaza

12 Sep 2007, 13:33

The way to fix a short to to look at it and find the short, and stop it from shorting! If the chip itself is not damaged then you can just verify it's soldered correctly. If you actually have the facilities you _could_ desolder and then resolder the chip, but it's hopefully not necessary. Just make sure none of the pins are touching each other,

The discussion might have continued from here.

Page 1 of 1