> Latin 1 standard is still in widespread inside some systems (such as browsers)
That doesn't seem to be correct. UTF-8 is used by 98% of all the websites. I am not sure if it's even worth the trouble for libraries to implement this algorithm, since Latin-1 encoding is being phased out.
One place I know where latin1 is still used is as an internal optimization in javascript engines. JS strings are composed of 16-bit values, but the vast majority of strings are ascii. So there's a motivation to store simpler strings using 1 byte per char.
However, once that optimization has been decided, there's no point in leaving the high bit unused, so the engines keep optimized "1-byte char" strings as Latin1.
> What advantage would this have over UTF-7, especially since the upper 128 characters wouldn't match their Unicode values?
(I'm going to assume you mean UTF-8 here rather than UTF-7 since UTF-7 is not really useful for anything, it's jus a way to pack Unicode into only 7-bit ascii characters.)
Fixed width string encodings like Latin-1 let you directly index to a particular character (code point) within a string without having to iterate from the beginning of the string.
JavaScript was originally specified in terms of UCS-2 which is a 16 bit fixed width encoding as this was commonly used at the time in both Windows and Java. However there are more than 64k characters in all the world's languages so it eventually evolved to UTF-16 which allows for wide characters.
However because of this history indexing into a JavaScript string gives you the 16-bit code unit which may be only part of a wide character. A string's length is defined in terms of 16-bit code units but iterating over a string gives you full characters.
Using Latin-1 as an optimisation allows JavaScript to preserve the same semantics around indexing and length. While it does require translating 8 bit Latin-1 character codes to 16 bit code points, this can be done very quickly through a lookup table. This would not be possible with UTF-8 since it is not fixed width.
EDIT: A lookup table may not be required. I was confused by new TextDecoder('latin1') actually using windows-1252.
More modern languages just use UTF-8 everywhere because it uses less space on average and UTF-16 doesn't save you from having to deal with wide characters.
And yet HTTP/1.1 headers should be sent in Latin1 (is this fixed in HTTP/2 or HTTP/3?). And WebKit's JavaScriptCore has special handling for Latin1 strings in JS, for performance reasons I assume.
> Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets.
In practice and by spec, HTTP headers should be ASCII encoded.
> Newly defined header fields SHOULD limit their field values to US-ASCII octets
ASCII octets! That means you SHOULD NOT send Latin1 encoded headers. The opposite of what pzmarzly was saying. I don't disagree Latin-1 being a superset of ASCII or having backward compatibility in mind, but that's not relevant to my response.
SHOULD is a recommendation, not a requirement, and it refers only to newly-defined header fields, not existing ones. The text implies that 8-bit characters in existing fields are to be interpreted as ISO-8859-1.
There is a RFC (2119) that specifies what SHOULD means in RFCs:
> SHOULD This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.
Web servers need to be able to receive and decode latin1 into utf-8 regardless of what the RFC recommends people send. The fact that it's going to become rarer over time to have the 8th bit set in headers, means you can write a simpler algorithm than what Lemire did that assumes an ASCII average case. https://github.com/jart/cosmopolitan/blob/755ae64e73ef5ef7d1... That goes 23 GB/s on my machine using just SSE2 (rather than AVX512). However it goes much slower if the text is full of european diacritics. Lemire's algorithm is better at decoding those.
Otherwise known as "Making other people's incompetence and inability to implement a specification your problem." Just because it's a widely quoted maxim doesn't make it good advice.
The spec may disagree, but webservers do sometimes send bytes outside the ASCII range, and the most sensible way to deal with that on the receiving side is still by treating them as latin1 to match (last I checked) what browsers do with it.
I do agree that latin1 headers shouldn't be _sent_ out though.
Only because those websites include `<meta charset="utf-8">`. Browsers don't use utf-8 unless you tell them to, so we tell them to. But there's an entire internet archive's worth of pages that don't tell them to.
Not including charset="utf-8" doesn't mean that the website is not UTF-8. Do you have a source on a significant percentage of website being Latin-1 while omitting charset encoding? I don't believe that's the case.
> Browsers don't use utf-8 unless you tell them to
This is wrong. You can prove this very easily by creating a HTML file with UTF-8 text while omitting the charset. It will render correctly.
Answering your "do you have a source" question, yeah: "the entire history of the web prior to HTML5's release", which the internet has already forgotten is a rather recent thing (2008). And even then, it took a while for HTML5 to become the de facto format, because it took the majority of the web years before they'd changed over their tooling from HTML 4.01 to HTML5.
> This is wrong. You can prove this very easily by creating a HTML file with UTF-8 text
No, but I will create an HTML file with latin-1 text, because that's what we're discussing: HTML files that don't use UTF-8 (and so by definition don't contain UTF-8 either).
While modern browsers will guess the encoding by examining the content, if you make an html file that just has plain text, then it won't magically convert it to UTF-8: create a file with `<html><head><title>encoding check</title></head><body><h1>Not much here, just plain text</h1><p>More text that's not special</p></body></html>` in it. Load it in your browser through an http server (e.g. `python -m http.server`), and then hit up the dev tools console and look at `document.characterSet`.
Both firefox and chrome give me "windows-1252" on Windows, for which the "windows" part in the name is of course irrelevant; what matters is what it's not, which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8.
Chromium (and I'm sure other browsers, but I didn't test) will sniff character set heuristically regardless of the HTML version or quirks mode. It's happy to choose UTF-8 if it sees something UTF-8-like in there. I don't know how to square this with your earlier claim of "Browsers don't use utf-8 unless you tell them to."
That is, the following UTF-8 encoded .html files all produce document.characterSet == "UTF-8" and render as expected without mojibake, despite not saying anything about UTF-8. Change "ä" to "a" to get windows-1252 again.
<html>ä
<!DOCTYPE html><html>ä
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html>ä
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"><html>ä
Into your url bar and inspect that. Avoids server messing with encoding values. And yes, here on my linux machine in firefox it is windows-1252 too.
(You can type the complete document, but <html> is sufficient. Browsers autocomplete a valid document. BTW, data:text/html,<html contenteditable> is something I use quite a lot)
But yeah, I think windows-1252 is standard for quirks mode, for historical reasons.
Well, I'm on Linux - system encoding set to UTF-8 which is pretty much standard there.
But I think the "windows-1252 for quirks" is just driven by what was dominant back when the majority of quirky HTML was generated decades ago.
The historical (and present?) default is to use the local character set, which on US Windows is Windows-1252, but for example on Japanese Windows is Shift-JIS. The expectation is that users will tend to view web pages from their region.
I'm in Japan on a Mac with the OS language set to Japanese. Safari gives me Shift_JIS, but Chrome and Firefox give me windows-1252
edit: Trying data:text/html,<html>日本語 makes Chrome also use Shift_JIS, resulting in mojibake as it's actually UTF-8. Firefox shows a warning about it guessing the character set, and then it chooses windows-1252 and displays more garbage.
Okay, it's good that we agree then on my original premise, the vast majority of websites (by quantity and popularity) on the Internet today are using UTF-8 encoding, and Latin-1 is being phased out.
Btw I appreciate your edited response, but still you were factually incorrect about:
> Browsers don't use utf-8 unless you tell them to
Browsers can use UTF-8 even if we don't tell them. I am already aware of the extra heuristics you wrote about.
> HTML file with latin-1 ... which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8
You are incorrect here as well, try using some latin-1 special character like "ä" and you will see that browsers default to document.characterSet UTF-8 not windows-1252
> You are incorrect here as well, try using some latin-1 special character like "ä" and you will see that browsers default to document.characterSet UTF-8 not windows-1252
I decided to try this experimentally. In my findings, if neither the server nor the page contents indicate that a file is UTF-8, then the browser NEVER defaults to setting document.characterSet to UTF-8, instead basically always assuming that it's "windows-1252" a.k.a. "latin1". Read on for my methodology, an exact copy of my test data, and some particular oddities at the end.
To begin, we have three '.html' files, one with ASCII only characters, a second file with two separate characters that are specifically latin1 encoded, and a third with those same latin1 characters but encoded using UTF-8. Those two characters are:
Ë - "Latin Capital Letter E with Diaeresis" - Latin1 encoding: 0xCB - UTF-8 encoding: 0xC3 0x8B - https://www.compart.com/en/unicode/U+00CB
¥ - "Yen Sign" - Latin1 encoding: 0xA5 - UTF-8 encoding: 0xC2 0xA5 - https://www.compart.com/en/unicode/U+00A5
To avoid copy-paste errors around encoding, I've dumped the contents of each file as "hexdumps", which you can transform back into their binary form by feeding the hexdump form into the command 'xxd -r -p -'.
The full contents of my current folder is as such:
$ ls -a .
. .. ascii.html latinone.html utf8.html
Now that we have our test files, we can serve them via a very basic HTTP server. But first, we must verify that all responses from the HTTP server do not contain a header implying the content type; we want the browser to have to make a guess based on nothing but the contents of the file. So, we run the server and check to make sure it's not being well intentioned and guessing the content type:
Now we've verified that we won't have our observations muddled by the server doing its own detection, so our results from the browser should be able to tell us conclusively if the presence of a latin1 character causes the browser to use UTF-8 encoding. To test, I loaded each web page in Firefox and Chromium and checked what `document.characterSet` said.
Firefox (v116.0.3):
http://127.0.0.1:8000/ascii.html result of `document.characterSet`: "windows-1252"
http://127.0.0.1:8000/latinone.html result of `document.characterSet`: "windows-1252"
http://127.0.0.1:8000/utf8.html result of `document.characterSet`: "windows-1252"
Chromium (v115.0.5790.170):
http://127.0.0.1:8000/ascii.html result of `document.characterSet`: "windows-1252"
http://127.0.0.1:8000/latinone.html result of `document.characterSet`: "macintosh"
http://127.0.0.1:8000/utf8.html result of `document.characterSet`: "windows-1252"
So in my testing, neither browser EVER guesses that any of these pages are UTF-8, all these browsers seem to mostly default to assuming that if no content-type is set in the document or in the headers then the encoding is "windows-1252" (bar Chromium and the Latin1 characters which bizzarely caused Chromium to guess that it's "macintosh" encoded?). Also note that if I add the exact character you proposed (ä) to the text body, it still doesn't cause the browser to start assuming everything is UTF-8; the only change is that Chromium starts to think the latinone.html file is also "windows-1252" instead of "macintosh".
> Both firefox and chrome give me "windows-1252" on Windows, for which the "windows" part in the name is of course irrelevant; what matters is what it's not, which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8.
While technically latin-1/iso-8859-1 is a different encoding than windows-1252, html5 spec says browsers are supposed to treat latin1 as windows-1252.
The following .html file encoded in UTF-8, when loaded from disk in Google Chrome (so no server headers hinting anything), yields document.characterSet == "UTF-8". If you make it "a" instead of "ä" it becomes "windows-1252".
<html>ä
The renders correctly in Chrome and does not show mojibake as you might have expected from old browsers. Explicitly specifying a character set just ensures you're not relying on the browser's heuristics.
There may be a difference here between local and network, as well as if the multi-byte utf-8 character appears in the first 1024 bytes or how much network delay there is before that character appears.
The original claim was that browsers don't ever use UTF-8 unless you specify it. Then ko27 provided a counterexample that clearly shows that a browser can choose UTF-8 without you specifying it. You then said "I'm pretty sure this is incorrect"--which part? ko27's counterexample is correct; I tried it and it renders correctly as ko27 said. If you do it, the browser does choose UTF-8. I'm not sure where you're going with this now. This was a minimal counterexample for a narrow claim.
I think when most people say "web browsers do x" they mean when browsing the world wide web.
My (intended) claim is that in practise the statement is almost always untrue. There may be weird edge cases when loading from local disk where it is true sometimes, but not in a way that web developers will usually ever encounter since you don't put websites on local disk.
This part of the html5 spec isn't binding so who knows what different browsers do, but it is a reccomendation of the spec that browsers should handle charset of documents differently depending on if they are on local disk or from the internet.
To quote: "User agents are generally discouraged from attempting to autodetect encodings for resources obtained over the network, since doing so involves inherently non-interoperable heuristics. Attempting to detect encodings based on an HTML document's preamble is especially tricky since HTML markup typically uses only ASCII characters, and HTML documents tend to begin with a lot of markup rather than with text content." https://html.spec.whatwg.org/multipage/parsing.html#determin...
Fair enough. I intended only to test the specific narrow claim OP made that you had quoted, which seemed to be about a local file test. This shows it is technically true that browsers are capable of detecting UTF-8, but only in one narrow situation and not the one that's most interesting.
Be careful, since at least Chrome may choose a different charset if loading a file from disk versus from a HTTP URL (yes this has tripped me up more than once).
I've observed Chrome to usually default to windows-1252 (latin1) for UTF-8 documents loaded from the network.
Be aware that with the WHATWG Encoding specification [1], that says that latin1, ISO-8859-1, etc. are aliases of the windows-1252 encoding, not the proper latin1 encoding. As a result, browsers and operating systems will display those files differently! It also aliases the ASCII encoding to windows-1252.
That's not what your linked spec says. You can try it yourself, in any browser. If you omit the encoding the browser uses heuristics to guess, but it will always work if you write UTF-8 even without meta charset or encoding header.
I don't doubt browsers use heuristics. But spec-wise I think it's your turn to to provide a reference in favour of a utf-8-is-default interpretation :)
The WHATWG HTML spec [1] has various heuristics it uses/specifies for detecting the character encoding.
In point 8, it says an implementation may use heuristics to detect the encoding. It has a note which states:
> The UTF-8 encoding has a highly detectable bit pattern. Files from the local file system that contain bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents with byte sequences that do not match it are very likely not. When a user agent can examine the whole file, rather than just the preamble, detecting for UTF-8 specifically can be especially effective.
In point 9, the implementation can return an implementation or user-defined encoding. Here, it suggests a locale-based default encoding, including windows-1252 for "en".
As such, implementations may be capable of detecting/defaulting to UTF-8, but are equally likely to default to windows-1252, Shift_JIS, or other encoding.
No it isn't. My original point is that Latin-1 is used very rarely on Internet and is being phased out. Now it's your turn to provide some references that a significant percentage of websites are omitting encoding (which is required by spec!) and using Latin-1.
> UTF-8 is the default character encoding for HTML5. However, it was used to be different. ASCII was the character set before it. And the ISO-8859-1 was the default character set from HTML 2.0 till HTML 4.01.
> My original point is that Latin-1 is used very rarely on Internet and is being phased out.
Nobody disagrees with this, but this is a very different statement from what you said originally in regards to what the default is. Things can be phased out but still have the old default with no plan to change the default.
Re other sources - how about citing the actual spec instead of sketchy websites that seem likely to have incorrect information.
In countries communicating in non-English languages which are written in the latin script, there is a very large use of Latin-1. Even when Latin-1 is "phased out", there are tons and tons of documents and databases encoded in Latin-1, not to mention millions of ill-configured terminals.
That doesn't seem to be correct. UTF-8 is used by 98% of all the websites. I am not sure if it's even worth the trouble for libraries to implement this algorithm, since Latin-1 encoding is being phased out.
https://w3techs.com/technologies/details/en-utf8