Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Deep Learning Papers Are Kinda Bullsh-T (dagshub.com)
96 points by codeinassembly on July 19, 2022 | hide | past | favorite | 64 comments


Academics are judged by the publications not their implementations, so the system favours over-sold manuscripts and it-ran-once implementations. Until funding is conditional-on (and provided for) robust well maintained code it will remain challenging to get reproducibility.

Frequently the PIs (bosses) will not even glance at the repositories written by junior members, probably can't read code anyway, and certainly won't allocate time for their maintenance. Even worse, most academics who do publish code have never been exposed to real world software engineers, their techniques, or tools.


The basic issue is the labor. We don't pay for good science.

Suppose I told you to develop good software that's novel enough to publish about, but only gave you enough budget to pay your SWEs a maximum of $30K/yr. That's one zero, for those reading quickly. Additional non-beneifts:

1. Unlike literally every other job in the country, you don't have budget to pay FICA taxes for your employees, and tax code allows this. This means your employees don't even have the USA's paltry social safety net to fall back on if they are hit by a bus or graduate into a massive recession, and their years working for you do not count toward social security or medicare retirement benefits.

2. Obviously, there is no budget for 401K retirement benefits

3. No CoL raises

4. Healthcare benefits will be paltry.

5. Your SWEs need to serve as a teaching assistant every once in a while. This likely means grading homework and a few late evenings of grading exams. No overtime for those late nights, obviously.

6. All travel, which is mandatory and often international, must be paid by the employee up front and reimbursement can take 1-3 months. We don't trust $30K/yr drones with corporate cards. Good luck making rent after a conference :)

Just to reiterate: You need to hire SWEs. You pay $30K/yr (less than some Amazon warehouses!), benefits package is literally worse than a part-time gig at a supermarket or fast food joint, and your employee is expected to give you $2K-$4K loans a few times a year while living paycheck to paycheck.

I just roll my eyes hard when I see complaints about garbage research code. Almost everyone in my PhD cohort had FAANG or finance offers; we were all taking 5x-10x paycuts to work on interesting problems and do science. If you want productizable research prototypes, hire PhDs to do science for you.

(And I say this, for the record, as a rare PhD who during their phd wrote code that is well-documented, well-maintained, and still used by dozens of companies for business-critical processes many years later.)


I agree. Plus most of the code is written by a single person, and while most first authors are relatively responsive on github, they soon get overwhelmed with other projects, manuscript responsibilities, and job hunts. Coding and its maintenance is only a small part of an overworked and underpaid academic's responsibilities, so frankly its understandable. There's also a good chance that they are no longer employed at the same place a few years out.


We the taxpayers do pay, just the money doesnt reach the researchers.


Not really.

You could maybe squeeze another $20K-$30K in gross salary out of what NSF/NIH allocate for PhD students. But even that is still WAY below starting rates for the quality of folks you need to run a research lab.


I get where you're coming from, but this is unnecessarily reductive.

By this logic all companies maliciously sell broken software in order to charge for updates.

But obviously not all companies do that, those that do get called out for it, those that produce good products get a good reputation for it, etc. Similar things apply to academia.

"It ran once" papers run the risk of not getting cited as much compared to good papers with robust implementations, so the maligned incentive you describe isn't as clear cut, even in the corner case where the novelty is considered to be in the algorithm rather than the particular implementation. Worse, if the algorithm fails to reproduce, a researcher runs the risk of being retracted or shamed in subsequent publications when their work fails to reproduce. And reproducibility is a key aspect of journal publications in reputable journals, meaning less reproducible work will end up in lower quality outlets which often hurt one's career more than they help.


> By this logic all companies maliciously sell broken software in order to charge for updates.

Your analogy here is not great since the parent's claim is that academics have no incentive to produce good, reproducible research, not that they are maliciously creating bad research.

A more apt analogy would be:

"By this logic all companies would be driven by quarterly metrics and rush out broken software and then charge for updates/support"

... which is pretty much the exact state of the industry right now.


Totally - I remember hearing from Sean Eddy how hard it was to get continued funding to work on hmmer - making it faster, more sensitive, more general, more robust etc.

Well crafted academic software is a rarity - the stuff that does exist tends to comes out of institutes where the software is necessary to their wider mission - like the Broad or Sanger Institutes.


Regarding replicability, I disagree this is a problem at all. Writing shit code is not going to prevent someone highly capable from replicating your results. If anything, I empathize with researchers writing sloppy code. It’s a creative field and they already have to do enough editing and documentation. Omitting code or fabricating/manipulating evaluation results is what prevents replication.

Frankly, unlike the author, I think there’s too many people in the field. They produce a handful of papers worth reading every year along with thousands upon thousands of models that may or may not slightly improve performance on a specific task and then have no general value beyond that. And I don’t believe this will change much- ml is likely the most monetizable PhD path by a safe margin, so there is too much profit incentive to churn out crap at any cost.


There really is a lot of cherry picking, etc. going on in this area. Papers released without code and weights or even data make reproduction and validation nearly impossible.


Yeah it's stunning to me that people can apparently run experiments with code, produce results with code, write a paper about that code, and then release poorly written prose in a garbled way without also releasing that code (or at the very least releasing a video demonstrating results).


...And get a PHD for it!


I was under the impression this is just how academia worked nowadays.


There are really just two situations if the solution generalizes well - and if it doesn’t it might be worth just mentioning that and to move on:

1. similar open data exists, great, just publish a sample implementation

2. if not the first task is to generate such an open data set

Edit: formatting


There's also the problem that most complex neural networks are highly sensitive to initial weights. My friends and I have frequently tried to reproduce famous papers and it's remarkable how often getting the initial settings nearly exactly correct is the key to achieving the targeted bench mark.

This is a problem because cherry picking is essentially built into the frame work.

If I was building ranking algorithm and just kept picking a random seed to arbitrarily sort a list of numbers until it was correct, most people would consider that obviously cheating. However if I did the same thing but stuck 3 dense matrices between the seed and the list to be ranked it would considered AI.


First hurdle is to simply get the (more often than not, Python) dependencies to work. I've worked on reproducing some relatively simple DL programs - written by academics I know - where I've literally spent days to weeks just to get all the dependencies right. And I've had direct contact with them - which may absolutely not be the case for other people.

I don't know why DL libraries are so afflicted by this, maybe things just move so fast. But it is such a pain in the ass.


IMO it's not DL libraries, it's Python. Python sucks at managing dependencies. It's a hilarious mess of pipenv, prose, conda, vex, pex, shmex and god knows what else is hot now. It seems that every time I want to write a simple Python utility, there is a new way to install and track dependencies.


I always see this and I wonder what other people are doing wrong. pip and a requirements.txt has worked for over a decade (along with virtualenv.) Yes, you may run into issues with larger projects and need something more complex. However, if you just want a list of dependencies and how to install them for a "simple" standalone utility like you describe, pip is the way to go.


It's Python and the people using it - in theory pip + requirements.txt + virtualenv/conda should work, but here's over 1 million issues of people asking for requirements files or fixes to requirements files: https://github.com/search?q=requirements+file&type=issues

PS: For ML/DL enthusiasts, there is a small Ruby DL scene with some nice ported libraries done by the amazing @ankane - super infant right now, but if more folks use it, maybe we can bring Ruby to the DL mainstream? https://ankane.org/new-ml-gems


> I always see this and I wonder what other people are doing wrong. pip and a requirements.txt has worked for over a decade (along with virtualenv.)

Because python is just a tip of the iceberg. Python in Machine Learning is a glue and API that ties together multiple high performance numerical libraries and frameworks for CPU, GPU. Now you are fighting with your CUDA libraries, pytorch and its toolkits versions, BLAS, MKL, LAPACK, versions of you NVIDIA cards, support for non-standard floating point types (fp16,fp8), you name it. It is a zoo out there, python is just an obvious folly boy.


I can believe that. No Python (or other language) dependency solution will be able to solve that. You need OS level configuration management. The average dev just installs and upgrades things willy-nilly.


Open question here is whether a central repository at OS level solves anything. Logically, modulo reliable networking and fat pipes, the dependency could be coming from anywhere. So centralization doesn't appear to be the issue. The issue is the degree of version explosion for nominal dependency D by applications and figuring out acceptable transitive relations (alternative version) for imprecise matches. (And of course D will have its own set of dependencies.) If you can 'can' that, a systemic way to declare (exact), find (best attempt) match, and use (compatible) dependencies, you can serve up those dependencies from the network or canonical source or a local store cache. And only then you have solved the problem.


Probably not. You need someone actually vetting dependencies, developing setup / onboarding scripts that install actual, approved, verified-working dependencies. OS packages are held, frozen at a specific version. Third-party installs are the same. You don't upgrade random stuff.

Obviously this is a lot less "agile" than most of us are used to.


Once you learn the basics of pip/venv that should mostly work for everything.

Make a new venv for everything and don’t pollute the global environment and it should be fine.


> Make a new venv for everything and don’t pollute the global environment and it should be fine.

This just proves the point that Python sucks at managing dependencies, which exacerbates -- perhaps even encourages -- the reproducibility issues being discussed.


How's that? Similar approaches are used elsewhere. Python just makes the creation of the virtualenv more explicit. npm doesn't pollute the global environment either, by default.


> maybe things just move so fast

This is definitely the case in DL (and I'm assuming elsewhere too but I wouldn't know).

I've lost count honestly, running 1-2 year old paper github repos with some detail missing (like the Python version!) that make it non-trivial to run as is. Libraries make undocumented breaking changes, wrong pickle format, authors used a nightly version which didn't make it to a tagged version, and so on.

This perhaps says also something about the CS (versus software eng) background that most people engaging in DL publishing have.


> This perhaps says also something about the CS (versus software eng) background that most people engaging in DL publishing have.

Are those things enjoyable? Or is hacking and playing with ideas enjoyable?

Huge portions of PhD students spent time as software engineers prior to starting their programs. It's not about know-how. It's about not being paid to engineer systems in addition to doing research.

Fewer than 1 in 100 labs have dedicated software engineers, and PhD students are paid $30K/yr. There's no way in hell most of them are going to spend their time doing dependency management or setting up CI/CD pipelines for that salary. If they wanted to spend their time doing software engineering, then can (and would) move to an industry SWE job at 10x the total comp.


You hit the nail on the head. DL library maintainers basically have no respect for backwards compatibility and ensuring everything works. New versions are pushed out on a weekly basis that break existing APIs, and no one really cares because dependency management has been abstracted so far away maintainers don't even understand the repercussions to this 'move fast, break things' mindset (namely, lots of broken software)


IMHO, it’s very important for papers to lay out the idea lineage and contextualize their incremental progress (analogous to a codebase commit history). For whatever reason, ML seems to prefer the practice of framing each paper as a shiny nugget uniquely disconnected from the ecosystem of ideas, and with is own fancy name (as if it was born a fully developed rockstar).

Imagine isolated code dumps without the shared history leading to nightmare merge conflicts…

I think this makes its really hard for anyone not steeped in experience to parse through the outputs of the spraying firehouse, and organize their thinking rigorously — thereby fragilizing the field’s intellectual output in a vicious cycle.


One detail some might not realize is the fact that research code is often a heaping pile of garbage written by a single graduate student. Some are ashamed of their code and simply don't want to publish shit code. Also, strategically, it is probably better to _not_ publish code than risk being rejected by a future job interviewer because your research code is shitty and you didn't prioritize refactoring it.

With that being said, this is not an excuse for refusing to share paper code or making sure the experiments are reproducible.


Other causes include the pressure to publish quickly in ML (while your approach is en vogue), with small teams, before your funding runs out, while hitting conference deadlines.

In these situations, I have suggested releasing anonymous implementations after the paper is accepted just to get the code out there. I am not certain this is the right thing to do!


My trust in most academic papers is very low. Exceptions are made for certain well known groups and authors, but generally I’m not going slog through difficult to understand paper for results that can’t be reproduced - not even to mention recreated in a production setting.


There does need to be some kind of reckoning here, pseudoscience is ballooning and dwarfing the actual practice of science. We're long past the point where researchers "should know better" --- we're now into Nature publishing this BS.


one possible explanation: if you open source all your data and code you likely reveal the secrets that aren't in your paper that give you a competitive research edge. Writing papers so that you are telling enough to get a good publication while obfuscating the details that give you an edge in your research fiefdom is a bit of a dark art in academia. I think it should be discussed more.


The Yolo7 Github has a good amount of spelling and capitalization errors. I guess they only looked once.


Why are you blaming an implementation project here ? YOLOv7 is based on multiple papers?


You seemed to have missed a joke here. I guess you only looked once.


Yeah, and don't just dump shitty impossible to run research code on github with a half-assed README.

Give me a one liner huggingface or torchhub, or a working google colab. Or I'm probably nexting your work and trying the second and third best model instead.


Can you even control which version of python + external dependencies you get in a colab? Whatever you publish and works now will not necessarily work in a year or two (which is often how much time it takes to get a journal paper published).


>However, this might not be the case. Let’s take for example the Fall 2021 Reproducibility Challenge - an event designed to encourage reproducing recently published research from top conferences. Only 43/102 (~%42.16!) of the papers entered into the double-blind peer-review process were accepted – which means that more than half the papers, despite being written with reproduction as a priority, couldn’t actually be reproduced.

I don't think this is what a rejection means. Papers are accepted and rejected from the challenge depending on whether they do a good and thorough job of attempting to replicate the original work, not depending on whether or not they succeed.


Collecting the data is costly. I don't believe we will see many private datasets openly released.


>Science, since time immemorial, has relied on the systemic replication of any presented result or finding. Reproducing experiments and their reported results remains a cornerstone of the validation of any scientific theory.

No, no it really hasn't. It has relied on the ability to make predictions based on pusblished theories, methods, laws etc. Even for hard-science experiments it's not even clear how you could record all the required knowledge to replicate an experiment. Every configuration, every machine, every particle in the air, every bit of software.

I really wish people engaged more with the actual history of science instead of what they believe it to be.

edit. To give a little more meat to my rant here's a good reading (https://plato.stanford.edu/entries/scientific-reproducibilit...):

"If replication played such an essential or distinguishing role in science, we might expect it to be a prominent theme in the history of science. Steinle (2016) (...) claims that the role and value of replication in experimental replication is 'much more complex than easy textbook accounts make us believe' (2016: 60), particularly since each scientific inquiry is always tied to a variety of contextual considerations that can affect the importance of replication. Such considerations include the relationship between experimental results and the background of accepted theory at the time, the practical and resource constraints on pursuing replication and the perceived credibility of the researchers. These contextual factors, he claims, mean that replication was a key or even overriding determinant of acceptance of research claims in some cases, but not in others."

The history of replications is extremely nuanced. Empirical results and by extension replications are one line of argument in scientific discourse, but by no means the only one.

I personally hold that valid predictions in the context of interesting problems are where it's really at. In the "Structure of Scientific Revolutions", Kuhn argues that at some point paradigms cannot make THESE kind of predictions anymore. Revolutions do not happen because of failed or missing replications.

Therefore, stating science "has relied on replication" is historically and epistemologically false. It's also misleading because the replication crisis happens due to a lack of theory and misguided incentives, not because some discipline has left the holy path of finding truth.


> It has relied on the ability to make predictions based on pusblished theories, methods, laws etc

This might be a semantic argument, but what you describe is replication. Imagine every scientist would say "In my experiment, I perfectly predicted this and that, oh but no one else would ever be able to run that experiment again, so just trust me, ok?"

Replication/reproducibility isn't about logging every configuration, machine or particle. It's about being able to run the same method and get the same result. If that isn't the case, how do we know the predictions are correct?


> Imagine every scientist would say "In my experiment, I perfectly predicted this and that, oh but no one else would ever be able to run that experiment again, so just trust me, ok?"

I don't know much about this topic, but isn't this kind of what's happening with for instance the Large Hadron Collider? We seem to collectively trust the results from LHC experiments, even though no one can replicate them because there is only one LHC and AFAIK no one is building another particle accelerator for similar or greater energy levels. There's no guarantee that we'll see another particle accelerator of that scale in our lifetime.

It seems to me that your point implies that LHC experiments shouldn't be considered to be science, since they cannot practically be replicated any time soon (except at the LHC itself, which somewhat defeats the purpose of replication). But I (not a physicist) find myself quite trusting of their results tbh. I'm not sure I can fully articulate why, and I'm not sure I have a good reason for it.


>Science, since time immemorial, has relied on the systemic replication of any presented result or finding

>>No, no it really hasn't.

What do you mean no it hasn't? Reproducibility of scientific findings is certainly a cornerstone of science, at least according to all definitions of "scientific method" I have ever found.

>Crucially, experimental and theoretical results __must be reproduced__ by others within the scientific community.

https://en.wikipedia.org/wiki/Scientific_method

>Reproducibility, also known as replicability and repeatability, __is a major principle underpinning the scientific method.__

https://en.wikipedia.org/wiki/Reproducibility


At the state of the art, reproducibility is less important for competitors than simply knowing that your competitor- who you trust is nearly as competent as you- did achieve what they claim.

I see reproducibility as an aspirational goal, at least in the subset of "discovery" science where the competition is to be first to identify a new scientific principle. Reproducibility is more important in another area- people building tools for others to use for their own research.


see my edit


>Empirical results and by extension replications are one line of argument in scientific discourse, but by no means the only one.

Someone said science relies on reproducibility and you outright dismissed that with a "no, no...", now it sounds like you're saying "well, yes but it's nuanced".

If you just said "yes, but it's nuanced" from the start without being so dismissive you probably wouldn't have been met with disagreement.


The full quote in question: "Science, since time immemorial, has relied on the systemic replication of any presented result or finding."

This is wrong, plain and simple. It paints a picture of necessary and sufficient conditions for scientific progress which are incorrect.

- The vast majority of "results and findings" is not looked at anybody other than the researchers directly involved. If you have 5 people on a paper, be sure at at best 3 of them have seen actual data, or were even involved in the experiment.

- Where is the systemic replication, exactly? Replication rates vary considerably across fields. And, of course, only selected results are replicated.

- If there was a systemic replication of "any result and finding", how is it possible that there is a replication crisis at all? Should the bad apples not have been found long ago?

If science would, in fact, rely on such a system, doing replications would be a normal part of everyday scientific work. It is not, not by a long shot.

So you can conclude that either science is not happening at all (not sure when it did though), or that the quoted premise is incorrect.


computer science imo somewhat gets the highest level of scrutiny when it comes to reproducibility simply because so many people can, in theory, reproduce an experiment without any special expertise. this cannot be said for many other sciences. Others require costly equipment, expertise and time, and even then, may fail to correctly reproduce an experiment. Imagine trying to reproduce a CERN finding in your backyard, or a James Webb observation, a social science survey, or medical study. DL isn't bullsh-T it's one of the few democratized enough sciences (similar to math) that has a reasonably low barrier to entry. unfortunately that may be its undoing, because it means the masses want the off the shelf docker container that just does the toy thing the experimenters spent potentially years working out. Could documentation be better? of course, but before declaring a field BS, maybe compare it to some other scientific fields.


Even then the predictions need to be confirmed and proven not to be a fluke.

While technical reproducibility isn't always required (in case of observing a rare event, and there is enough expertise to evaluate the fidelity of the experiment / observation), it's also a bit of a strawman to attack this point specifically, because in any case science advancement needs a body of evidence appropriate for the theory being tested, and replicability is crucial for a field like DL where it should have been relatively easy, and where the basic premise axiomatically requires reproducibility (that a reapplication of the same techniques should yield comparable results).



The most basic and implicit prediction you can make is "if I do this again I'll get the same result."


My last job was centered around trying to reproduce the results of deep learning academic papers, and maybe porting them to other frameworks or platforms.

Even WITH code supplied by the authors, this was always a struggle. It'd usually take about a week or so just to get their github project out of dependency hell and actually run at all.

And if it needed to be reproduced in another framework, I'd really really want some kind of demo code just to clarify what exactly the authors were trying to describe. Especially if their descriptions had holes or discrepancies that only became clear when trying to fit the pieces together.

I remember trying to reproduce a couple of object tracking papers from the same authors, one with an overly complex and poorly defined feature set, the other with a glaring mistake/omission that forced my team to redesign the model because they described using a certain layer type in a way that made no sense.

There were a few good exceptions that provided nice code, but difficult reproduction seemed to be the norm.


Deep learning is one of the most reproducible areas of science. That might seem like an insane statement until you spend time in a biology (wet) lab. Experiment protocols are often poorly documented and access to materials needed to reproduce experiments are highly uneven. There is even more cherry picking when it comes to biology, especially since biologists don't often have the statistical knowledge to know better.

I'm not writing this to defend deep learning. Reproducibility is an incentives problem across ALL of science. We value novelty and prestigious publications over everything. Nobody wants to fund "boring" research that reproduces existing results. To fix academic science, we need to reward reproducible research and fund groups such that they're capable of performing it.


This post is kinda Bullsh-T. Basically all claims are unfounded, prior work is missing, and it is filled with filler content (i.e., boilerplate? I'm not sure how to call it) instead of providing value.


A large portion of this is due to the corporate enclosure of deep learning and machine learning that has occurred over the past 10 years. This, combined with the scale required for a lot of deep learning research, means that neither the code nor the resources required for reproduction are accessible outside of corporate labs.


Even if deep learning papers were more reproducible, there's still little confidence that the flashy new technique will work well for _my problem_. I'd rather see machine learners work on techniques that work robustly than one that are 1% better on a very narrow problem.


huh? most of this article gripes about the intricacies of publishing a paper. use https://docs.aiqc.io for reproducible protocols


You can't do that with titles here. Please see the site guidelines.


No demo, no read, no cite


Am I the only person who read that title as `Bullsh[T]`?


If I ever write my own shell I'm gonna call it `Bullsh`.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: