I've been programming for over a quarter century, and I feel pretty confident that I've never written anything that could confuse 'null' with NULL. I can't even think of a language that would let you easily do this.
If web forms aren't accepting NULL, then somebody probably specifically programmed the word 'null' into a filter of disallowed entries. Probably to stop clerks from entering the word 'null' to mean empty string. This has nothing to do with null being a reserved word in many languages, I'll bet the forms that aren't accepting 'null' aren't accepting 'none' or 'empty' either.
Object a = null;
String s = "foo bar " + a; // s == "foo bar null"
So, it's possible to get "null" as a string downstream, when some variable that should be non-nulleable - was null. If you find such a bug and incompetently fix it by checking for "null" downstream instead of checking before turning variables into strings - you have the error from the article. Especially if you check ignoring the case.
I've known Java to be a bit verbose, but I thought it was mostly a reasonable language aside from that. I might have expected something like that from Javascript or PHP. Never in my wildest dreams would I have thought that boring old Java would interpret a null cast to a string as a literal string "null" when combining strings. Even Ruby and Python don't do that - they throw type mismatch errors instead. C# treats it as empty string.
Ehh, I don't really think this is a huge deal personally. Keep in mind that it is not that `(string)null` gives you the "null" string, or `((string)null).ToString()` gives you the null string. You have to use actual string concatenation to get it (Or as others have pointed out, `String.ToValue(null)`). Point being, it isn't/shouldn't be messing up your regular comparisons (Unless you turn it into a string beforehand), and you really shouldn't be checking for stuff like `null` against a string literal anyway - it's meant for displaying a human readable string, not an unambiguous string that could be converted back at a later time. If you wanted that then you should use proper serialization techniques rather then just concatenating a bunch of values.
It's not that it's really that huge of a deal, just that it's so out of character for Java and everything it is seen as.
It's like if a guy you were friends with since college and always knew to be solid and reliable but a little boring, and now many years later, he's happily married with kids, watching sports, working in a large hierarchical organization, doing pretty average stuff but nothing unusual or exciting. If you suddenly found out that guy had secretly been a Furry the whole time, that would be really shocking.
Not that there's anything wrong with being boring and reliable, or being a furry. But the sudden change in how you saw somebody or something is stunning.
I half expect anything written in Java to be littered with AbstractFactoryFactories and giant frameworks using 20 different design patterns to write Hello World. I never would have expected Java to silently convert an actual null to the string "null". I always thought the one thing you can count on Java for was to be strictly strongly typed, and never silently do any weird random conversions that nobody would have expected. Guess I was wrong.
Though poking around in a few other languages, JavaScript does indeed do that also, though I kinda expect JavaScript to do things like that. In Ruby, nil.to_s gives empty string, and adding a nil to a string gives you a type error. In Python, trying to add None to a string also gives you a type error, but str(None) does give you 'None'. That's a bit disappointing, but not shocking to me.
A lot of languages overload + to mean concatenate. I think that's a mistake, however...
In Python:
"abc" + None
TypeError: cannot concatenate 'str' and 'NoneType' objects
"abc".join(None)
TypeError: can only join an iterable
In Ruby
"abc" + nil
TypeError: no implicit conversion of nil into String
In Lua
= "abc" .. Nil
attempt to concatenate global 'Nil' (a nil value)
But not Java. Java implicitly tries to coerce the provided value to be a string, which seems out of place in a language that values type safety. That all types might also be null also seems out of place in a language with strong, static typing, but that's been discussed to death. The problem is compounded by the fact that null is actually converted to "null" instead of an empty string.
That's a NameError in Python: undefined variables don't have a value, not even None. It won't be caught until runtime since there's no AOT compiler in Python, but it's the same kind of error it is in Java.
My claim isn't that Python is safer overall than Java. Instead, it's that Java, a language that is mostly type safe, most of the time should not have these two potentially surprising behaviors:
1. The standard string concatenation operator does implicit coercion rather than rejecting an input that isn't a string. There should be a builtin to make a string from any value no matter what for logging and debugging, but that shouldn't be the standard concatenation operator.
2. This is more controversial, but strongly statically typed languages should not allow arbitrary values to be null. That sabotages one of the major strengths of strong static typing. Instead, there should be an option type to make it explicit. For something familiar to most programmers, SQL does this.
The reason is that the string concatenation operation is doing a String.valueOf() on the argument, and that delegates to toString() that promises a human readable string that is relevant.
For some reason someone decided to check for null instead of the stricter option of throwing a NullPointerException. Maybe to help debugging or to follow the gist of toString().
I guess it was too late to change even in Java 0.9... Autoboxing, iterators and other newer features are more strict,so I suppose it's just one of the few irregularities left from prehistoric times...
> it's possible to get "null" as a string downstream
I don't see the problem with that. Strings that contain formatted variables should be used for display only. Besides, when someone inputs his name, the input is already a string, and I don't see how you would dereference string contents.
Sure – that's just casting a null type as a string.
So I guess if you have a comparison that somehow casts null to a string before comparing, you could run into this issue, but that's still bad programming.
I used to own null@myundergrad.edu as an email alias (with my real university, not that generic .edu, of course) and I got all sorts of interesting things... but that was very intentional on my part.
Funnily enough - it's not the null-> String conversion, it's the "+" operator. That was at one point subject of heated debate in my previous job :)
See:
"null".equals((String)null) //false
"null".equals((String)null+"") //true
System.out.print((String)null) // throws NPE
System.out.println((String)null) // prints "null", I guess because it appends "\n" inside
Right, but that operator is implicitly casting the null object to a string type – it has to, in a strict sense... some other languages would raise an error (and many would do the same as Java).
It doesn't have to. null.toString() throws NPE as it should. I would expect "foo" + null to throw NPE as well.
BTW there's another "fun" gotcha, when you interface java code and oracle database which consider empty string and null to be the same thing. Depending on how you handle data from database you end up with null, "", or "null" :)
String.valueOf(null) returns "null" though, and that is what "foo" + null uses. I can understand why it would be surprising if you thought it used .toString, but it's clearly listed in the standard.
One place where that can easily happen is loading comma-separated value files into a database. Some CSV files are formatted with the convention that fields only appear in quotes if they contain special characters. Thus, a data value of NULL is not quoted. This loses the distinction made in SQL databases between NULL, the null value, and "NULL", the string.
(I just received some files like this. One is a file that has names. If someone had a name of NULL, it would go into the database as a null.)
i just checked one of the projects where i work and a search for "null" returns 32 matches. its a java project and i believe the checks came about because stuff being converted to strings and then being sent back and forth between html forms and the backend.
It's one of the reasons I wrote a pared down 'dumb' YAML parser that assumes scalar values are strings unless directed otherwise: https://github.com/crdoconnor/strictyaml
Javascript lets you do anything wrong if you use == and not === though, I'm pretty sure the machine apocalypse is going to happen because someone types == instead of === at this point.
Blackhat conference talk we might see in the future: "How I achieved remote code execution on the T-1000 and singlehandedly averted the extinction of the human race"
Can you post exactly what you did in PHP that shows "NULL" is equal to NULL? There's quite a few approaches like ==, ===, and is_null(). I can't get any to think "NULL" is NULL, though I imagine I'm missing something.
When using sequel pro with a MySQL database, typing NULL into a string field will make that field null rather than 'null', which always seemed like an odd design choice to me
Its a good compromise i think. OTherwise to update a field to null it would need a button or right clicking and selected a null option. I would expect the amount of people who actually want to enter a 'null' string is minimal...
Eval used to be used a lot, because web forms send everything as strings and you need to convert them to whatever datatypes on the back end. Eval probably permeates a lot of older legacy systems.
Reminds me of the time we ordered our high school football jerseys. We filled out a form listing the requested size and spelling of our last name to be printed on the back of our jersey. A couple of weeks later, all the jerseys were delivered and we excitedly opened up the packing boxes to hand them out. Imaging the surprise and ensuing hilarity when our good friend, Marshall Blank's jersey came out of the box with no name printed on it whatsoever.
My old Manager's last name is Blank. The sad thing, given blank isnt a reserved word at all, is he has the same problem. I think its EventBright or Ticketmaster, i forget, but one of those sites wont accept "Blank" as his last name, literally with the message of "Last name can not be blank" ...
We're Dutch, and the é is part of our language, and even part of the legacy character encoding standard everyone used before Unicode's widespread adoption. This is just a matter of code that works perfect as long as all characters are part of the ASCII set, but fails on the characters that don't conveniently match between UTF-8 and ISO-8859-15.
I doubt these issues will go away within even, say, twenty years.
It's getting much better. Almost everything new is in UTF-8.
I've been writing code to clean up a 2013 database dump. The database stored everything in LATIN-1 fields. Not because the data is in LATIN-1, but because LATIN-1 will accept any byte value. This makes error messages during input go away. See this bad advice on Stack Overflow.[1]
Some of the data is ASCII. Some is UTF-8. Some is Windows-1252. Some data is none of those, but is mostly ASCII except that there's a 0x9d once in a while. (Still haven't figured out what character set that is. From context, the ™ or ® symbol is intended.) So I have recognizers for these cases, and convert everything to UTF-8, testing every field value individually.
One column has garbaged non-English names. Someone had tried to "normalize" UTF-8 to lower case by using an ASCII lowercasing function on UTF-8 stored in a LATIN-1 field:
KACMAZLAR MEKANİK -> kacmazlar mekanä°k
Anita Calçados -> anita calã§ados
Felfria Resor för att Koh Lanta -> felfria resor fã¶r att koh lanta
Even if something new is UTF-8, you'd basically have to guarantee that it never interfaces with anything old and/or broken to ensure the data survives intact. I've recently received a package where the ö in my name was written as ̦ which I've never seen before. Things go wrong in the most unexpected places and the sad thing is that even if you use UTF-8 for text storage, you still have to know what you can do and how to do it to not mangle it.
There are a lot of Unicode-hostile environments out there. Java is old enough to always require explicit encoding declaration for pretty much any tool ... compiler, documentation generator, etc. Forget it at any one point and you get garbage. Reading or writing text files should always make the encoding explicit, but rarely does so. C#'s string methods all support, but don't require, a Culture parameter, without which you're practically guaranteed to do things like case conversion, or substring searches wrong in the general case. There was an awesome and long answer by tchrist on SO once about what the Perl boilerplate is to properly support Unicode for many or most circumstances (it's complicated and long and I doubt many people are going those lengths).
Point being, even when using something that supports Unicode well, the programmer still has to care, simply because text and language are messy things and it simply isn't possible to have a magic bullet that does everything right.
Dutch programmers probably have little incentive to get it right, because Dutch makes relatively light use of accents. I wouldn't be surprised if Czech programmers, for example, were much more meticulous at converting between UTF-8 and legacy encodings.
Here in CJK territory, using the wrong encoding makes the output so obviously broken [1] that mistakes are almost always caught before hitting production.
I'm French and I see this all the time, even Outlook, given the right conditions, will gives you this in the default folders, Inbox in French is "Boite de réception".
Don't get me started on Outlook folders. The actual names are of the folders are localised, not just the way it's presented to the user. The folders are created when you first start Outlook (and not when your account is created).
If you happen to be using Windows configured in a foreign language the first time you start Outlook, your inbox, sent mail, etc, folders are named according to that language, and will never change, and you'll have to live with non-standard names for the folders.
At least its teaches you how to configure folders manually in most email clients.
That's why I take care that my OS doesn't know I'm French. But it's a luxury that most French people can't afford and they have to live with bugs. Ex: Having both a "Download" and a "Téléchargement" folder, with "Download" being sometimes translated.
This is something Apple got right. The underlying folder has a standard name, and the translated name is just a different presentation. If you change language, the names of the translated folders change.
For anyone who doesn't know OSX, this translation happens on the UI level. Typing ls in a terminal gives you the real directory name.
> I doubt these issues will go away within even, say, twenty years.
Much like the printing press, I'm 100% certain that the computing (and the internet specifically) is altering human written language across the world.
It is just so much easier to avoid anything outside ASCII because you can be certain ASCII will always work - even though some awful MS Access -> CSV -> SQL -> SQL -> Excel -> SQL -> COBOL ETL pipeline. No matter what version of any software is being used.
Technology has always shaped written language and we should fight to do better but at this point it seems inevitable.
(To be clear: I'm not saying this is a good or desirable state of affairs)
I really doubt that there is any transliteration into ASCI that's acceptable to European speakers.
Eg A Ą Å Æ Ä are all different letters in most languages, not simply a pronunciation guide. I think most European languages use at least two from that list.
I feel your pain. My wife has a French first name with an acute accent over the "e" as well. Many forms will outright reject the accent, so I've taken to just not typing it most of the time. Being American, it doesn't really bother her, though. I know in some languages missing accents and other similar markings changes the pronunciation and meaning of words and can be very irritating to some.
> accents and other similar markings changes the pronunciation and meaning of words
Yes. In some languages those are actually not "markings" but denote proper letters, like in German ä,ö,ü and ß. But even if not, like in French, it can alter the meaning of words. E.g la != là. Therefore, for most Europeans and speakers of other languages that depend on more letters than ASCII provides, it is very annoying when that is not supported properly.
However, I have made the experience in a few cases that particularly Americans have a hard time understanding this. The remark about your wife not caring seems to be in this vein, too. Recently, I decided to convert our MySQL DB tables from latin1 to UTF8. (I wasn't even aware that we didn't have some form unicode, as our DB is only few years old, and I thought some unicode is the default nowadays everywhere. But then MySQL...)
Anyway, my CEO (also an American incidentally) was trying to keep me from it because he thought it's not high priority. However, we're about to go live in a French-speaking region, but which also has other indigenous languages (and therefore names), with their own "special" characters (I put "special" in quotes because for those languages, they're not "special" at all -- but I guess you get my gist by now).
Also, in previous jobs I have converted legacy systems to unicode and know what a pain it is down the road. Not to mention all the hard-to-find bugs if you don't do it, because some strings don't compare as they should, or people are just annoyed because their name is not shown correctly.
So I went ahead with the conversion anyway. We may never know for sure, but I'm convinced that I saved us some major customer frustrations, days of bug hunting and weeks of converting everything later, when existing data would need to be migrated.
So please everyone, just use UTF8 or some other unicode variant from the get-go. The few bits you might save otherwise are just not worth it.
It's not always a computer problem.
I have an O' name. There have been many college educated adults I've had to explain the difference between a quote ("), an apostrophe ('), and whatever the hell you call the character under the tilde (`). I kid, it's the grave.
If you really want some fun, try explaining to people the difference between modern Greek tonos (e.g. ή, Unicode 03ae) and ancient Greek oxia (e.g. ή, Unicode 1f75). Or in UTF-8:
It's super fun when you are the only tech person in a Classics grad program and everyone else is turning in papers that look like ransom notes, because every fourth vowel has a different x-height from all the other letters. :-)
You're not saying the classicists confuse the smooth breathing with the acute accent, right? Only they can't figure out how to typeset them?
How do you enter them nowadays by the way? In the early days of the internet before unicode there were special fonts from SIL for example that first of all were using Latin characters (you'd type W and it would look like Ω) and secondly I think had the diacritics as separate characters, so the font would combine them. It was messy but at least you could type it on a normal keyboard. You could even read it in its ASCII form, most of the hacks were pretty reasonable.
Not smooth breathing and acute, but Unicode has two codepoints for acute accents (sort of). One is the modern Greek tonos, which often looks just like an acute accent but sometimes appears as a dot or straight vertical line, and the other codepoint is the ancient Greek oxia (the ancient term for the acute accent), which appears with all the other marks and their combinations (grave, circumflex, smooth/rough breathing, iota subscript).
Personally I type Greek using a vim keymap file, usually when writing LaTeX. I believe my keymap file is influenced by those older non-Unicode fonts, because I type w for ω and ;h for ή and >~h| for ᾖ. But my fellow students would mostly use Word. I don't know how they would type the letters, but commonly they would mix the tonos letters with everything else. I think what was happening is that Word would automatically substitute fonts that offered those codepoints, so it would end up showing Times for the letters with tonos and Palatino for the others (or something like that). Hence the ransom note effect.
My kingdom for a good monospaced font and supporting editor to handle these and RTL languages. I've found Setups that work with one or the other, but none seems to do it all just yet.
"Programmers use the grave accent symbol as a separate character (i.e., not combined with any letter) for a number of tasks. In this role, it is known as a backquote or backtick."
That was very illuminating, thanks. Interesting that the idea was there originally but was later deprecated when they realized video terminals couldn't do it. Today it would be trivial to convert the sequence letter/backspace/accent to its UTF-8 equivalent.
Heck, in some fonts ` even takes the form of ‛ or ‘. Which is probably where typing quotes like ``this'' comes from (which looks absolutely awful nowadays since it mixes an uncombined diacritic with a kludge of a punctuation character).
Manpages aren't written in TeX though (apparently they use something called "roff"), but they also contain things like `read' instead of 'read'... perhaps manpage writers tended to like TeX too?
Probably the other way round: roff/troff/nroff is older than TeX. I guess that on old computer typesetters, ` and ' turned into nicely balanced quotes.
In old fonts, ` often displayed as ‘ and ' displayed as ’. When they wrote ``foo'', that was meant to show up as ‘‘foo’’, which was a workaround for not having characters to properly spell “foo”.
Even more annoying is when you realise that those quotes are a hack because keyboards don't have separate left and right single and double quotation marks. The rabbit hole goes deeper…
I get that with periods in a street name. Like when I write "123 Main St.", and the web site checks against the USPS's database, it will either reject it outright for having "special" characters, or will say, "We didn't find that, but we found this similar address - '123 Main St', which should we use?" It's so fucking dumb.
I had a fun issue signing up for a bank here in Japan.
Japanese people can't have middle names (the citizen registry doesn't allow it), but foreigners can, so many systems will reject spaces in the name field. Meanwhile they're meticulous about making sure your name matches your ID exactly, leading to the situation I had.
The bank's web signup form disallowed spaces in your name, so I wrote my name FIRSTMIDDLE. Then when they processed my application they sent me an email "Your name doesn't match your ID card! Please approve the change to 'FIRST MIDDLE'."
I sympathize, but it is a hard problem to solve. On the seller side, many orders come in with addresses that are just wrong, or missing suite numbers, business names, etc.
Ups and FedEx charge shippers ~$15 for each instance where they have to address correct.
So, yeah, the period thing is dumb, but automated correction is hard. Even experts, like SmartyStreets get it wrong often.
Dutch first names confuse systems as well. Gmail doesn't understand the first name of my colleague is "Jan Willem" and not just "Jan", when showing the list of recipients.
Many years ago computer systems analyzed my last name, which are two seperate words, as me being married. I got quite some snail mail calling me Mrs. [2nd part of last name] for some weird reason.
One of my LLCs is named "Null Ventures LLC". I sometimes get things via snail mail that are addressed in interesting ways. The most common is simply "<space>Ventures LLC". Unlike the author, I've never had issues (AFAIK, anyways) receiving e-mail to the domain.
Reminds me of running into this bug [1] at work that caused a few minutes of head scratching. Couldn't make any changes to a user's profile as their surname contained eval.
I don't get what's happening at the low-level for this to be a problem.
It seems like you'd have to do something pretty stupid at the coding level to introduce a problem with "Null" by mistake. I'm sure it happens, but not more of an occasional issue.
My best guess is that there are common old databases that did not have a first class null type where it was common practice to use the string "NULL" for that purpose. And that companies that have these old systems are proactively filtering user input to prevent causing these old system to choke... It sounds like the filter is case-insensitive, though, which would be too aggressive for the case I'm thinking of. Maybe they are (mis)using a bad word filter for this, which would tend to be aggressive.
> My best guess is that there are common old databases that did not have a first class null type where it was common practice to use the string "NULL" for that purpose.
It isn't a DB issue. It's more of a front-end or middle tier issue.
In SQL standards - NULL is a "marker"/TYPE. NULL and "NULL" are two separate things. One is a null type and the other is a string type.
Or more specifically, it is a "interface" problem between RDBMs and front-end since languages handle null differently. Many languages didn't have null types and null in certain languages mean different things that "lack of information".
For example, if a database column was a nullable int column and you wanted to bring it out to the java or .net space you would have issues since "int" in java and .net are value types and not reference types. So you could assigned null to the values. Where as a string/text/varchar column you could since string in java and .net are reference types and can be null.
In some languages, checking for null means you have to convert null into a string and then compare "null" == "null".
It's a legacy of lack of Nullable types in many programming languages. With the introduction of Nullable types many of these problems went away.
I believe the problem might be more common when exporting data from one system to another. I've written code to migrate (and merge) large datasets from old legacy systems. I needed quite a lot of heuristics and "best-effort" transformations in order to deal with data inconsistencies... mostly string manipulations where it's not hard to imagine bugs like this happening.
I'm trying to think how this would be a problem and it wouldn't directly be a issue created by company but could be a issue in companies we deal with passing around large databases of different types on people in America. It's common we have to deal with a pipe delimited set of large data and after loading it I'm sure my boss could see last name null in MySQL and go "let's delete all "null" names.
I don't understand. Are you agreeing with me or suggesting a string with the contents "curl" will somehow cause the server to execute the curl command?
This happened. I was working tech support for a company with a very old billing database. Customer was attempting to update his billing information with a new credit card and was immediately getting an error - Forbidden.
Turned out it was his last name, Curl, that was causing the issue. The system was throwing an exception because it interpreted the customer's entry as attempting to execute the curl command.
We ended up having him put "JR" at the end of his last name to prevent it.
Maybe this was back in the days of CGI? It's perfectly feasible that the right combination of characters might break out of an ill-advised pipe to get to the shell.
If web forms aren't accepting NULL, then somebody probably specifically programmed the word 'null' into a filter of disallowed entries. Probably to stop clerks from entering the word 'null' to mean empty string. This has nothing to do with null being a reserved word in many languages, I'll bet the forms that aren't accepting 'null' aren't accepting 'none' or 'empty' either.