Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Just to save you a somewhat pointless read, they didn't really debug anything but just found the right forum to ask.


They debugged the system, not the driver. The way they did that was to identify and confirm it was the driver that caused the problem and in what circumstances, so they could report it to the people responsible for actually dealing with that.

That's still a form of debugging. It's all a matter of perspective. If you had a hardware device that you were interacting directly with in an applicaiton, and you found that if you utilized in in a specific way it crashed, so you changed how the application used it so it wouldn't crash, that would be debugging the application, even if not really debugging the hardware.


As a counterpoint, I found the journey interesting and learned a lot about various tools on the way. Only caveat is that they didn't end up pinpointing the error - understandable, given that they are not paid to fix bugs in Intel code, and Intel having fixed the bug already in a newer version anyway.


But it seems Intel did end up pinpointing the error. The last link in the article ("Since then, Intel has removed the faulty driver from their website.") points to https://downloadmirror.intel.com/30190/eng/635390-TA-256.pdf which says "The driver instability was caused by an incomplete backport to i40e from the upstream kernel." Frustratingly, it doesn't give any more detail than that.


It looks like the Intel out-of-tree driver is carrying around some legacy HAVE_PAGE_COUNT_BULK_UPDATE option that is making their porting efforts difficult.

This commit in upstream ends up getting split in half:

https://github.com/torvalds/linux/commit/8ce29c679a6ecefb88d...

With only 3 lines of it getting pulled into i40e-2.13.10:

https://github.com/dmarion/i40e/blob/master/src/i40e_txrx.c#...

https://github.com/dmarion/i40e/blob/master/src/i40e_txrx.c#...

(Can't link git diff line for 2.13.10->2.14.13 because diff is too big, annoying!)

And the final line getting pulled into i40e-2.14.13:

https://github.com/dmarion/i40e/commit/135d6d885aa4704180e10...

  --- if (unlikely(!pagecnt_bias)) {
  +++ if (unlikely(pagecnt_bias == 1)) {
Best thing I can find in i40e_txrx.c where a single patch in Linux upstream got split across 2.13.10 and 2.14.13. Not a smoking gun exactly, still some exercise left for the reader.


Not entirely pointless, they did provide some useful tips (I wasn't aware of Bcc), but yeah the story ends with them not resolving the issue and just using a different version of the driver that doesn't have the bug.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: