Why are AI researchers constantly handicapping everything they do under the guise of ""safety""? It's a bag of data and some math algorithms that generate text....
> It's a bag of data and some math algorithms that generate text....
I agree with the general premise of too much "safety", but this argument is invalid. Humans are bags of meat and they can do some pretty terrible things.
But what we're doing to these models is literally censoring what they're saying - not doing.
I don't think that anyone has any problems with stopping random AIs when they're doing crimes (or more realistically the humans making them do that) - but if you're going to make the comparison to humans in good faith, it'd be a person standing behind you, punishing you when you say something offensive.
> Why are AI researchers constantly handicapping everything
Career and business self-preservation in a social media neurotic world. It doesn't take much to trigger the outrage machine and cancel every future prospect you might have, especially in a very competitive field flush with other "clean" applicants.
Just look at the whole "AI racism" fustercluck for a small taste.
Lets reverse this - why wouldn't they do that? I agree with you, but LLMs tend to be massively expensive and thus innately tied to ROI. A lot of companies fret about advertising even near some types of content. The idea of spending millions to put a racist bot on your home page is, no surprise, not very appetizing.
So of course if this is where the money and interest flows then the research follows.
Besides, it's a generally useful area anyway. The ability to tweak behavior even if not done for "safety" still seems pretty useful.
Yeah. It's will start it's instruction with recommendation of buying some high-tech biolab for $100,000,000.
Seriously. The reason why we dont have mass killings everywhere is not the fact that information on how to make explosive drones or poisons is impossible to find or access. It's also not so hard to buy a car or knife.
Hell you can even find YouTube videos on how exactly uranium enrichment works step by step. Even though some content creators even got police raided for that. Yet we dont see tons of random kids making dirty bombs.
you cannot compare making nuclear weapons to modifying viruses to be more lethal. It is vastly cheaper to modify viruses and the knowledge is bottleneck vs nukes were the knowledge of how to make them is very widespread but getting the materials is very hard.
another example is if a LLM could tell you exactly how to build a tabletop laser device that could enrich uranium for a few hundred thousand dollars.
LLMs are not AGIs. LLMs can only ever tell you how to build a device to enrich uranium for few hundred thousand dollars if this information was already public knowledge and LLM was trained on it. Situation is the same with building biolab tech for few hundred thousand dollars. Also if there was an actor who have few millions already they wouldn't have any problem to get their hands on any LLM or scientist who able to build it for them.
The only "danger" LLM "safety" can prevent is generation of racist porn stories.
With the vast amounts of data LLMs are trained on they make it much easier for people to find harmful and dangerous information if they aren't filtered. See
Societies basic entry barrier: easy enough to make sure the dumb person who hasn't achieved anything in life can't do it but not relevant who is smart enough to make it in society who circumvents it if they want.
> It's a bag of data and some math algorithms that generate text....
That describes almost every web server.
To the extent that this particular maths produces text that causes political, financial, or legal harms to their interests, this kind of testing is just like any other accepting testing.
To the extent that the maths is "like a human", even in the vaguest and most general sense of "like", then it is also good to make sure that the human it's like isn't a sadistic psychopath — we don't know how far we are from "like" by any standard, because we don't know what we're doing, so this is playing it safe even if we're as far from this issue as cargo-cults were from functioning radios.
Can't read the full article due to paywall but ostensibly it's due to bias rules on race and not visa rules? Sounds like visas being abused and then backstopped by unrelated rules does not mean the visa rules shouldn't be fixed.
Not really. Do you think that this is trivial at AWS scale? What do you do when people hit their hard spend limits, start shutting down their EC2 instances and deleting their data? I can see the argument that just because its "hard" doesn't mean they shouldn't do it, but it's disingenuous to say they're shady because they don't.
At AWS engineering scale they can absolutely figure it out if they have the slightest interest in doing so. I've heard all the excuses — they all suck.
Businesses with lawyers and stuff can afford to negotiate with AWS etc. when things go wrong. Individuals who want to upskill on AWS to improve their job prospects have to roll the dice on AWS maybe bankrupting them. AWS actively encourages developers to put themselves in this position.
I don't know if AWS should be regulated into providing spending controls. But if they don't choose to provide spending controls of their own accord, I'll continue to call them out for being grossly irresponsible, because they are.
People kept up bringing this argument since the very beginning when people already asked for this feature. This used to be the most upvoted request on AWS forums with AWS officially acknowledging (back in 2007 IIRC), "We know it's important for you and are working on it". But they made a decision not to implement it.
The details don't matter, really. For those who decide to set up a hard cap and agree to its terms, there could be a grace period or not. In the end, all instances would be shut down and all data lost, just like in traditional services when you haven't paid your bill so you are no longer entitled to them, pure and simple.
They haven't implemented and never will because Amazon is a company that is obsessed with optimization. There is negative motivation to implement anything related to that.
They've had two decades to figure it out. For EC2, they could shut down the instance but keep storage and public IPs. It shouldn't be too hard to estimate when the instance has to be stopped to end up with charges below the hard limit.
What a ridiculous point. AWS achieves non-trivial things at scale all the time, and brag about it too.
So many smart engineers with high salaries and they can't figure out a solution like "shut down instances so costs don't continue to grow, but keep the data so nothing critical is lost, at least for a limited time"?
Disingenuous is what you are writing - oh no, it's a hard problem, they can't be expected to even try to solve it.
> What a ridiculous point. AWS achieves non-trivial things at scale all the time, and brag about it too.
Many companies achieve non-trivial things at scale. Pretty much every good engineer I speak to will list out all the incredibly challenging thing they did. And follow it up with "however, this component in Billing is 100x more difficult than that!"
I've worked in Billing and I'd say a huge number of issues come from the business logic. When you add a feature after-the-fact, you'll find a lot of technical and business blockers that prevent you doing the most obvious path. I strongly suspect AWS realised they passed this point of no return some time ago and now the effort to implement it vastly outweighs any return they'd ever hope to see.
And, let's be honest, there will be no possible implementation of this that will satisfy even a significant minority of the people demanding this feature. Everyone things they're saying the same thing but the second you dig into the detail and the use-case, everyone will expect something slightly (but critically) different.
> "however, this component in Billing is 100x more difficult than that!"
Simply claiming this does not make it true. Anyway, the original claim was simply that it is not trivial. This is what is known as moving the goalposts, look it up.
> let's be honest, there will be no possible implementation of this
Prefixing some assertion with "let's be honest" does not prove it or even support it in any way. If you don't have any actual supporting arguments, there's nothing to discuss, to be honest.
> Disingenuous is what you are writing - oh no, it's a hard problem, they can't be expected to even try to solve it.
I find it funny people bring this pseudo-argument up whenever this issue is discussed. Customers: "We want A, it's crucial for us". People on the Internet: "Do you have any idea how difficult is to implement A? How would it work?" And the discussion diverges into technical details obscuring the main point: AWS is bent on on never implementing this feature even though in the past (that is more than a decade ago) they promised they would do that.
any number of reasons: language barriers, existing American firms anti-competing, smaller domestic markets, less centralisation, and, yes, in some cases, regulation, but, when it comes down to it, it's better to have smaller firms that don't (or less frequently) damage society than larger firms than do, even just from the perspective of wealth distribution.
I think he's saying that, yes, this regulation means that your own companies are more ethical, but European consumers end up using these less-regulated American companies anyway. this is true, but this problem has started to be solved by the EU anyway, for example, with the Digital Markets and Services Acts
The best way to do this is message passing. My current way of doing it is using Aeron[0] + SBE[1] to pass messages very efficiently between "services" - you can then configure it to either be using local shared memory (/dev/shm) or to replicate the log buffer over the network to another machine.
Part of test-driven design is using the tests to drive out a sensible and easy to use interface for the system under test, and to make it testable from the get-go (not too much non-determinism, threading issues, whatever it is). It's well known that you should likely _delete these tests_ once you've written higher level ones that are more testing behaviour than implementation! But the best and quickest way to get to having high quality _behaviour_ tests is to start by using "implementation tests" to make sure you have an easily testable system, and then go from there.
>It's well known that you should likely _delete these tests_ once you've written higher level ones that are more testing behaviour than implementation!
Building tests only to throw them away is the design equivalent of burning stacks of $10 notes to stay warm.
As a process it works. It's just 2x easier to write behavioral tests first and thrash out a good design later under its harness.
It mystifies me that doubling the SLOC of your code by adding low level tests only to trash them later became seen as a best practice. It's so incredibly wasteful.
> As a process it works. It's just 2x easier to write behavioral tests first and thrash out a good design later under its harness.
I think this “2x easier” only applies to developers who deeply understand how to design software. A very poorly designed implementation can still pass the high level tests, while also being hard to reason about (typically poor data structures) and debug, having excessive requirements for test setup and tear down due to lots of assumed state, and be hard to change, and might have no modularity at all, meaning that the tests cover tens of thousands of lines (but only the happy path, really).
Code like this can still be valuable of course, since it satisfies the requirements and produces business value, however I’d say that it runs a high risk of being marked for a complete rewrite, likely by someone who also doesn’t really know how to design software. (Organizations that don’t know what well designed software looks like tend not to hire people who are good at it.)
"Test driven design" in the wrong hands will also lead to a poorly designed non modular implementation in less skilled hands.
I've seen plenty of horrible unit test driven developed code with a mess of unnecessary mocks.
So no, this isnt about skill.
"Test driven design" doesnt provide effective safety rails to prevent bad design from happening. It just causes more pain to those who use it as such. Experience is what is supposed to tell you how to react to that pain.
In the hands of junior developers test driven design is more like test driven self flagellation in that respect: an exercise in unnecessary shame and humiliation.
Moreover since it prevents those tests with a clusterfuck of mocks from operating as a reliable safety harness (because they fail when implementation code changes, not in the presence of bugs), it actively inhibits iterative exploration towards good design.
These tests have the effect of locking in bad design because keeping tightly coupled low level tests green and refactoring is twice as much work as just refactoring without this type of test.
> I've seen plenty of horrible unit test driven developed code with a mess of unnecessary mocks.
Mocks are an anti-pattern. They are a tool that either by design or unfortunate happenstance allows and encourages poor separation of concerns, thereby eliminating the single largest benefit of TDD: clean designs.
> … TDD is a "design practice" but I find it to be completely wrongheaded.
> The principle that tests that couple to low level code give you feedback about tightly coupled code is true but it does that because low level/unit tests couple too tightly to your code - I.e. because they too are bad code!
But now you’re asserting:
> "Test driven design" in the wrong hands will also lead to a poorly designed non modular implementation in less skilled hands.
Which feels like it contradicts your earlier assertion that TDD produces low-level unit tests. In other words, for there to be a “unit test” there must be a boundary around the “unit”, and if the code created by following TDD doesn’t even have module-sized units, then is that really TDD anymore?
Edit: Or are you asserting that TDD doesn’t provide any direction at all about what kind of testing to do? If so, then what does it direct us to do?
>"Test driven design" in the wrong hands will also lead to a poorly designed non modular implementation in less skilled hands.
>Which feels like it contradicts your earlier assertion that TDD produces low-level unit tests.
No, it doesnt contradict that at all. Test driven design, whether done optimally or suboptimally, produces low level unit tests.
Whether the "feedback" from those tests is taken into account determines whether you get bad design or not.
Either way I do not consider it a good practice. The person I was replying to was suggesting that it was a practice that was more suited to be people with a lack of experience. I dont think that is true.
>Or are you asserting that TDD doesn’t provide any direction at all about what kind of testing to do?
I'm saying that test driven design provides weak direction about design and it is not uncommon for test driven design to still produce bad designs because that weak direction is not followed by people with less experience.
Thus I dont think it's a practice whose effectiveness is moderated by experience level. It's just a bad idea either way.
> Whether the "feedback" from those tests is taken into account determines whether you get bad design or not.
Which to me was kind of the whole point of TDD in the first place; to let the ease and/or difficulty of testing become feedback that informs the design overall, leading to code that requires less set up to test, fewer dependencies to mock, etc.
I also agree that a lot of devs ignore that feedback, and that just telling someone to “do TDD” without first making sure that they know that they need to strive to have little to no test setup and few or no mocks, etc., otherwise the advice is pointless.
Overall I get the sense that a sizable number of programmers accept a mentality of “I’m told programming is hard, this feels hard so I must be doing it right”. It’s a mentality of helplessness, of lack of agency, as if there is nothing more they can do to make things easier. Thus they churn out overly complex, difficult code.
>Which to me was kind of the whole point of TDD in the first place; to let the ease and/or difficulty of testing become feedback that informs the design overall
Yes and that is precisely what I was arguing against throughout this thread.
For me, (integration) test driven development development is about creating:
* A signal to let me know if my feature is working and easy access to debugging information if it is not.
* A body of high quality tests.
It is 0% about design, except insofar as the tests give me a safety harness for refactoring or experimenting with design changes.
Don't agree, though I think it's more suble than "throw away the tests" - more "evolve them to a larger scope".
I find this particularly with web services,especially when the the services are some form of stateless calculators. I'll usually start with tests that focus on the function at the native programming language level. Those help me get the function(s) working correctly. The code and tests co-evolve.
Once I get the logic working, I'll add on the HTTP handling. There's no domain logic in there, but there is still logic (e.g. mapping from json to native types, authentication, ...). Things can go wrong there too. At this point I'll migrate the original tests to use the web service. Doing so means I get more reassurance for each test run: not only that the domain logic works, but that the translation in & out works correctly too.
At that point there's no point leaving the original tests in place. They're just covering a subset of the E2E tests so provide no extra assurance.
I'm therefore with TFA in leaning towards E2E testing because I get more bang for the buck. There are still places where I'll keep native language tests, for example if there's particularly gnarly logic that I want extra reassurance on, or E2E testing is too slow. But they tend to be the exception, not the rule.
> At that point there's no point leaving the original tests in place. They're just covering a subset of the E2E tests so provide no extra assurance.
They give you feedback when something fails, by better localising where it failed. I agree that E2E tests provide better assurance, but tests are not only there to provide assurance, they are also there to assist you in development.
Starting low level and evolving to a larger scope is still unnecessary work.
It's still cheaper starting off building a playwright/calls-a-rest-api test against your web app than building a low level unit test and "evolving" it into a playwright test.
I agree that low level unit tests are faster and more appropriate and if you are surrounding complex logic with a simple and stable api (e.g. testing a parser) but it's better to work your way down to that level when it makes sense, not starting there and working your way up.
That’s not my experience. In the early stages, it’s often not clear what the interface or logic should be - even at the external behaviour level. Hence the reason tests and code evolve together. Doing that at native code level means I can focus on one thing: the domain logic. I use FastAPI plus pytest for most of these projects. The net cost of migrating a domain-only test to use the web API is small. Doing that once the underlying api has stabilised is less effort than starting with a web test.
I dont think ive ever worked on any project where they hadnt yet decided whether they wanted a command line app or a website or an android app before I started. That part is usually fixed in stone.
Sometimes lower level requirements are decided before higher level requirements.
I find that this often causes pretty bad requirements churn - when you actually get the customer to think about the UI or get them to look at one then inevitably the domain model gets adjusted in response. This is the essence of why BDD/example driven specification works.
What exactly is it wasting? Is your screen going to run out of ink? Even in the physical contruction world, people often build as much or more scaffolding as the thing they're actually building, and that takes time and effort to put up and take down, but it's worthwhile.
Sure, maybe you can do everything you would do via TDD in your head instead. But it's likely to be slower and more error-prone. You've got a computer there, you might as well use it; "thinking aloud" by writing out your possible API designs and playing with them in code tends to be quicker and more effective.
Time. Writing and maintaining low level unit tests takes time. That time is an investment. That investment does not pay off.
Doing test driven development with high level integration tests also takes time. That investment pays dividends though. Those tests provide safety.
>Sure, maybe you can do everything you would do via TDD in your head instead. But it's likely to be slower and more error-prone.
It's actually much quicker and safer if you can change designs under the hood and you dont have to change any of the tests because they validate all the behavior.
Quicker and safer = you can do more iterations on the design in the available time = a better design in the end.
The refactoring step of red, green, refactor is where the design magic happens. If the refactoring turns tests red again that inhibits refactoring.
> It's well known that you should likely _delete these tests_ once you've written higher level ones that are more testing behaviour than implementation!
Is it? I don't think I've ever seen that mentioned.
I did some research in to this this year in the context of maybe trying to start a business to solve this - and this was the conclusion I came to. There’s lots of threads here on HN about it too. It’s a structural, market-wide issue where the primary service Ticketmaster provide is reputation laundering, and in return, large agents and promoters agree to continue to use Ticketmaster despite their reputation.
Like, don't go to overpriced concerts? How much more obvious than this should the solution be? None in that chain create value, so no need for you to feed their greed.
> There are many instances I've encountered where two pieces of code coincided to look similar at a certain point in time. As the codebase evolved, so did the two pieces of code, their usage and their dependencies, until the similarity was almost gone