Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Latin 1 standard is still in widespread inside some systems (such as browsers)

That doesn't seem to be correct. UTF-8 is used by 98% of all the websites. I am not sure if it's even worth the trouble for libraries to implement this algorithm, since Latin-1 encoding is being phased out.

https://w3techs.com/technologies/details/en-utf8



One place I know where latin1 is still used is as an internal optimization in javascript engines. JS strings are composed of 16-bit values, but the vast majority of strings are ascii. So there's a motivation to store simpler strings using 1 byte per char.

However, once that optimization has been decided, there's no point in leaving the high bit unused, so the engines keep optimized "1-byte char" strings as Latin1.


>So there's a motivation to store simpler strings using 1 byte per char.

What advantage would this have over UTF-7, especially since the upper 128 characters wouldn't match their Unicode values?


> What advantage would this have over UTF-7, especially since the upper 128 characters wouldn't match their Unicode values?

(I'm going to assume you mean UTF-8 here rather than UTF-7 since UTF-7 is not really useful for anything, it's jus a way to pack Unicode into only 7-bit ascii characters.)

Fixed width string encodings like Latin-1 let you directly index to a particular character (code point) within a string without having to iterate from the beginning of the string.

JavaScript was originally specified in terms of UCS-2 which is a 16 bit fixed width encoding as this was commonly used at the time in both Windows and Java. However there are more than 64k characters in all the world's languages so it eventually evolved to UTF-16 which allows for wide characters.

However because of this history indexing into a JavaScript string gives you the 16-bit code unit which may be only part of a wide character. A string's length is defined in terms of 16-bit code units but iterating over a string gives you full characters.

Using Latin-1 as an optimisation allows JavaScript to preserve the same semantics around indexing and length. While it does require translating 8 bit Latin-1 character codes to 16 bit code points, this can be done very quickly through a lookup table. This would not be possible with UTF-8 since it is not fixed width.

EDIT: A lookup table may not be required. I was confused by new TextDecoder('latin1') actually using windows-1252.

More modern languages just use UTF-8 everywhere because it uses less space on average and UTF-16 doesn't save you from having to deal with wide characters.


Latin1 does match the Unicode values (0-255).


Java nowadays does the same.


perl5 also does the same


And yet HTTP/1.1 headers should be sent in Latin1 (is this fixed in HTTP/2 or HTTP/3?). And WebKit's JavaScriptCore has special handling for Latin1 strings in JS, for performance reasons I assume.


> should be sent in Latin1

Do you have a source on that "should" part. Because the spec disagrees https://www.rfc-editor.org/rfc/rfc7230#section-3.2.4:

> Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets.

In practice and by spec, HTTP headers should be ASCII encoded.


ISO-8859-1 (aka. Latin-1) is a superset of ASCII, so all ASCII strings are also valid Latin-1 strings.

The section you quoted actually suggests that implementations should support ISO-8859-1 to ensure compatibility with systems that use it.


You should read it again

> Newly defined header fields SHOULD limit their field values to US-ASCII octets

ASCII octets! That means you SHOULD NOT send Latin1 encoded headers. The opposite of what pzmarzly was saying. I don't disagree Latin-1 being a superset of ASCII or having backward compatibility in mind, but that's not relevant to my response.


SHOULD is a recommendation, not a requirement, and it refers only to newly-defined header fields, not existing ones. The text implies that 8-bit characters in existing fields are to be interpreted as ISO-8859-1.


There is a RFC (2119) that specifies what SHOULD means in RFCs:

> SHOULD This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.

https://datatracker.ietf.org/doc/html/rfc2119


Haven't you heard of Postel's Maxim?

Web servers need to be able to receive and decode latin1 into utf-8 regardless of what the RFC recommends people send. The fact that it's going to become rarer over time to have the 8th bit set in headers, means you can write a simpler algorithm than what Lemire did that assumes an ASCII average case. https://github.com/jart/cosmopolitan/blob/755ae64e73ef5ef7d1... That goes 23 GB/s on my machine using just SSE2 (rather than AVX512). However it goes much slower if the text is full of european diacritics. Lemire's algorithm is better at decoding those.


>Haven't you heard of Postel's Maxim?

Otherwise known as "Making other people's incompetence and inability to implement a specification your problem." Just because it's a widely quoted maxim doesn't make it good advice.


The spec may disagree, but webservers do sometimes send bytes outside the ASCII range, and the most sensible way to deal with that on the receiving side is still by treating them as latin1 to match (last I checked) what browsers do with it.

I do agree that latin1 headers shouldn't be _sent_ out though.


Only because those websites include `<meta charset="utf-8">`. Browsers don't use utf-8 unless you tell them to, so we tell them to. But there's an entire internet archive's worth of pages that don't tell them to.


Not including charset="utf-8" doesn't mean that the website is not UTF-8. Do you have a source on a significant percentage of website being Latin-1 while omitting charset encoding? I don't believe that's the case.

> Browsers don't use utf-8 unless you tell them to

This is wrong. You can prove this very easily by creating a HTML file with UTF-8 text while omitting the charset. It will render correctly.


Answering your "do you have a source" question, yeah: "the entire history of the web prior to HTML5's release", which the internet has already forgotten is a rather recent thing (2008). And even then, it took a while for HTML5 to become the de facto format, because it took the majority of the web years before they'd changed over their tooling from HTML 4.01 to HTML5.

> This is wrong. You can prove this very easily by creating a HTML file with UTF-8 text

No, but I will create an HTML file with latin-1 text, because that's what we're discussing: HTML files that don't use UTF-8 (and so by definition don't contain UTF-8 either).

While modern browsers will guess the encoding by examining the content, if you make an html file that just has plain text, then it won't magically convert it to UTF-8: create a file with `<html><head><title>encoding check</title></head><body><h1>Not much here, just plain text</h1><p>More text that's not special</p></body></html>` in it. Load it in your browser through an http server (e.g. `python -m http.server`), and then hit up the dev tools console and look at `document.characterSet`.

Both firefox and chrome give me "windows-1252" on Windows, for which the "windows" part in the name is of course irrelevant; what matters is what it's not, which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8.


Chromium (and I'm sure other browsers, but I didn't test) will sniff character set heuristically regardless of the HTML version or quirks mode. It's happy to choose UTF-8 if it sees something UTF-8-like in there. I don't know how to square this with your earlier claim of "Browsers don't use utf-8 unless you tell them to."

That is, the following UTF-8 encoded .html files all produce document.characterSet == "UTF-8" and render as expected without mojibake, despite not saying anything about UTF-8. Change "ä" to "a" to get windows-1252 again.

    <html>ä

    <!DOCTYPE html><html>ä

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html>ä

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"><html>ä


A simpler test FWIW.. type:

   data:text/html,<html>
Into your url bar and inspect that. Avoids server messing with encoding values. And yes, here on my linux machine in firefox it is windows-1252 too.

(You can type the complete document, but <html> is sufficient. Browsers autocomplete a valid document. BTW, data:text/html,<html contenteditable> is something I use quite a lot)

But yeah, I think windows-1252 is standard for quirks mode, for historical reasons.


>data:text/html,<html contenteditable>

thank you, I learned nice trick today.

re windows1252 - this could be driven by system encoding settings, for most people it is 1252, but for eastern europe it is windows-1251.

when viewed from IBM z mainframe - encoding will be something like IBM EBCDIC


Well, I'm on Linux - system encoding set to UTF-8 which is pretty much standard there. But I think the "windows-1252 for quirks" is just driven by what was dominant back when the majority of quirky HTML was generated decades ago.


The historical (and present?) default is to use the local character set, which on US Windows is Windows-1252, but for example on Japanese Windows is Shift-JIS. The expectation is that users will tend to view web pages from their region.


I'm in Japan on a Mac with the OS language set to Japanese. Safari gives me Shift_JIS, but Chrome and Firefox give me windows-1252

edit: Trying data:text/html,<html>日本語 makes Chrome also use Shift_JIS, resulting in mojibake as it's actually UTF-8. Firefox shows a warning about it guessing the character set, and then it chooses windows-1252 and displays more garbage.


Okay, it's good that we agree then on my original premise, the vast majority of websites (by quantity and popularity) on the Internet today are using UTF-8 encoding, and Latin-1 is being phased out.

Btw I appreciate your edited response, but still you were factually incorrect about:

> Browsers don't use utf-8 unless you tell them to

Browsers can use UTF-8 even if we don't tell them. I am already aware of the extra heuristics you wrote about.

> HTML file with latin-1 ... which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8

You are incorrect here as well, try using some latin-1 special character like "ä" and you will see that browsers default to document.characterSet UTF-8 not windows-1252


> You are incorrect here as well, try using some latin-1 special character like "ä" and you will see that browsers default to document.characterSet UTF-8 not windows-1252

I decided to try this experimentally. In my findings, if neither the server nor the page contents indicate that a file is UTF-8, then the browser NEVER defaults to setting document.characterSet to UTF-8, instead basically always assuming that it's "windows-1252" a.k.a. "latin1". Read on for my methodology, an exact copy of my test data, and some particular oddities at the end.

To begin, we have three '.html' files, one with ASCII only characters, a second file with two separate characters that are specifically latin1 encoded, and a third with those same latin1 characters but encoded using UTF-8. Those two characters are:

    Ë - "Latin Capital Letter E with Diaeresis" - Latin1 encoding: 0xCB  - UTF-8 encoding: 0xC3 0x8B   - https://www.compart.com/en/unicode/U+00CB
    ¥ - "Yen Sign"                              - Latin1 encoding: 0xA5  - UTF-8 encoding: 0xC2 0xA5   - https://www.compart.com/en/unicode/U+00A5
To avoid copy-paste errors around encoding, I've dumped the contents of each file as "hexdumps", which you can transform back into their binary form by feeding the hexdump form into the command 'xxd -r -p -'.

    $ cat ascii.html | xxd -p
    3c68746d6c3e3c686561643e3c7469746c653e656e636f64696e67206368
    65636b2041534349493c2f7469746c653e3c2f686561643e3c626f64793e
    3c68313e4e6f74206d75636820686572652c206a75737420706c61696e20
    746578743c2f68313e3c703e4d6f7265207465787420746861742773206e
    6f74207370656369616c3c2f703e3c2f626f64793e3c2f68746d6c3e0a
    $ cat latinone.html | xxd -p
    3c68746d6c3e3c686561643e3c7469746c653e656e636f64696e67206368
    65636b206c6174696e313c2f7469746c653e3c2f686561643e3c626f6479
    3e3c68313e546869732069732061206c6174696e31206368617261637465
    7220307841353a20a53c2f68313e3c703e54686973206973206368617220
    307843423a20cb3c2f703e3c2f626f64793e3c2f68746d6c3e0a
    $ cat utf8.html | xxd -p
    3c68746d6c3e3c686561643e3c7469746c653e656e636f64696e67206368
    65636b207574663820203c2f7469746c653e3c2f686561643e3c626f6479
    3e3c68313e54686973206973206120757466382020206368617261637465
    7220307841353a20c2a53c2f68313e3c703e546869732069732063686172
    203078433338423a20c38b3c2f703e3c2f626f64793e3c2f68746d6c3e0a
The full contents of my current folder is as such:

    $ ls -a .
    .  ..  ascii.html  latinone.html  utf8.html
Now that we have our test files, we can serve them via a very basic HTTP server. But first, we must verify that all responses from the HTTP server do not contain a header implying the content type; we want the browser to have to make a guess based on nothing but the contents of the file. So, we run the server and check to make sure it's not being well intentioned and guessing the content type:

    $ curl -s -vvv 'http://127.0.0.1:8000/ascii.html' 2>&1 | egrep -v -e 'Last|Length|^\*|^<html|^{|Date:|Agent|Host'
    > GET /ascii.html HTTP/1.1
    > Accept: */*
    >
    < HTTP/1.0 200 OK
    < Server: SimpleHTTP/0.6 Python/3.10.7
    < Content-type: text/html

    $ curl -s -vvv 'http://127.0.0.1:8000/latinone.html' 2>&1 | egrep -v -e 'Last|Length|^\*|^<html|^{|Date:|Agent|Host'
    > GET /latinone.html HTTP/1.1
    > Accept: */*
    >
    < HTTP/1.0 200 OK
    < Server: SimpleHTTP/0.6 Python/3.10.7
    < Content-type: text/html

    $ curl -s -vvv 'http://127.0.0.1:8000/utf8.html' 2>&1 | egrep -v -e 'Last|Length|^\*|^<html|^{|Date:|Agent|Host'
    > GET /utf8.html HTTP/1.1
    > Accept: */*
    >
    < HTTP/1.0 200 OK
    < Server: SimpleHTTP/0.6 Python/3.10.7
    < Content-type: text/html
Now we've verified that we won't have our observations muddled by the server doing its own detection, so our results from the browser should be able to tell us conclusively if the presence of a latin1 character causes the browser to use UTF-8 encoding. To test, I loaded each web page in Firefox and Chromium and checked what `document.characterSet` said.

    Firefox (v116.0.3):
        http://127.0.0.1:8000/ascii.html     result of `document.characterSet`: "windows-1252"
        http://127.0.0.1:8000/latinone.html  result of `document.characterSet`: "windows-1252"
        http://127.0.0.1:8000/utf8.html      result of `document.characterSet`: "windows-1252"

    Chromium (v115.0.5790.170):
        http://127.0.0.1:8000/ascii.html     result of `document.characterSet`: "windows-1252"
        http://127.0.0.1:8000/latinone.html  result of `document.characterSet`: "macintosh"
        http://127.0.0.1:8000/utf8.html      result of `document.characterSet`: "windows-1252"

So in my testing, neither browser EVER guesses that any of these pages are UTF-8, all these browsers seem to mostly default to assuming that if no content-type is set in the document or in the headers then the encoding is "windows-1252" (bar Chromium and the Latin1 characters which bizzarely caused Chromium to guess that it's "macintosh" encoded?). Also note that if I add the exact character you proposed (ä) to the text body, it still doesn't cause the browser to start assuming everything is UTF-8; the only change is that Chromium starts to think the latinone.html file is also "windows-1252" instead of "macintosh".


> Both firefox and chrome give me "windows-1252" on Windows, for which the "windows" part in the name is of course irrelevant; what matters is what it's not, which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8.

While technically latin-1/iso-8859-1 is a different encoding than windows-1252, html5 spec says browsers are supposed to treat latin1 as windows-1252.


> This is wrong. You can prove this very easily by creating a HTML file with UTF-8 text while omitting the charset. It will render correctly.

I'm pretty sure this is incorrect.


The following .html file encoded in UTF-8, when loaded from disk in Google Chrome (so no server headers hinting anything), yields document.characterSet == "UTF-8". If you make it "a" instead of "ä" it becomes "windows-1252".

    <html>ä
The renders correctly in Chrome and does not show mojibake as you might have expected from old browsers. Explicitly specifying a character set just ensures you're not relying on the browser's heuristics.


There may be a difference here between local and network, as well as if the multi-byte utf-8 character appears in the first 1024 bytes or how much network delay there is before that character appears.


The original claim was that browsers don't ever use UTF-8 unless you specify it. Then ko27 provided a counterexample that clearly shows that a browser can choose UTF-8 without you specifying it. You then said "I'm pretty sure this is incorrect"--which part? ko27's counterexample is correct; I tried it and it renders correctly as ko27 said. If you do it, the browser does choose UTF-8. I'm not sure where you're going with this now. This was a minimal counterexample for a narrow claim.


I think when most people say "web browsers do x" they mean when browsing the world wide web.

My (intended) claim is that in practise the statement is almost always untrue. There may be weird edge cases when loading from local disk where it is true sometimes, but not in a way that web developers will usually ever encounter since you don't put websites on local disk.

This part of the html5 spec isn't binding so who knows what different browsers do, but it is a reccomendation of the spec that browsers should handle charset of documents differently depending on if they are on local disk or from the internet.

To quote: "User agents are generally discouraged from attempting to autodetect encodings for resources obtained over the network, since doing so involves inherently non-interoperable heuristics. Attempting to detect encodings based on an HTML document's preamble is especially tricky since HTML markup typically uses only ASCII characters, and HTML documents tend to begin with a lot of markup rather than with text content." https://html.spec.whatwg.org/multipage/parsing.html#determin...


Fair enough. I intended only to test the specific narrow claim OP made that you had quoted, which seemed to be about a local file test. This shows it is technically true that browsers are capable of detecting UTF-8, but only in one narrow situation and not the one that's most interesting.

Indeed, in the Chromium source code we can see a special case for local files with some comment explanation. https://github.com/chromium/chromium/blob/dea8b2608dd5d95e38...


Be careful, since at least Chrome may choose a different charset if loading a file from disk versus from a HTTP URL (yes this has tripped me up more than once).

I've observed Chrome to usually default to windows-1252 (latin1) for UTF-8 documents loaded from the network.


It's the default HTTP character set. It's not clear whether the above stat page is about what charsets are explicitly specified.

Also headers, mostly relevant for header values, are I think ISO-8859-1.


Be aware that with the WHATWG Encoding specification [1], that says that latin1, ISO-8859-1, etc. are aliases of the windows-1252 encoding, not the proper latin1 encoding. As a result, browsers and operating systems will display those files differently! It also aliases the ASCII encoding to windows-1252.

[1] https://encoding.spec.whatwg.org/#names-and-labels


Since HTML5 UTF-8 is the default charset. And for headers, they are parsed as ASCII encoded in almost all cases although ISO-8859-1 is supported.


I tried to find confirmation of this but found only: https://html.spec.whatwg.org/multipage/semantics.html#charse...

> The Encoding standard requires use of the UTF-8 character encoding and requires use of the "utf-8" encoding label to identify it. Those

Sounds to me like it tells you that you have to explicitly declare the charset as UTF-8, so you don't get the HTTP default of Latin-1.

(But that's just one "living standard" not exactly synonymous with with HTML5 and it might change, or might have been different last week..)


> so you don't get the HTTP default of Latin-1.

That's not what your linked spec says. You can try it yourself, in any browser. If you omit the encoding the browser uses heuristics to guess, but it will always work if you write UTF-8 even without meta charset or encoding header.


I don't doubt browsers use heuristics. But spec-wise I think it's your turn to to provide a reference in favour of a utf-8-is-default interpretation :)


The WHATWG HTML spec [1] has various heuristics it uses/specifies for detecting the character encoding.

In point 8, it says an implementation may use heuristics to detect the encoding. It has a note which states:

> The UTF-8 encoding has a highly detectable bit pattern. Files from the local file system that contain bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents with byte sequences that do not match it are very likely not. When a user agent can examine the whole file, rather than just the preamble, detecting for UTF-8 specifically can be especially effective.

In point 9, the implementation can return an implementation or user-defined encoding. Here, it suggests a locale-based default encoding, including windows-1252 for "en".

As such, implementations may be capable of detecting/defaulting to UTF-8, but are equally likely to default to windows-1252, Shift_JIS, or other encoding.

[1] https://html.spec.whatwg.org/#determining-the-character-enco...


No it isn't. My original point is that Latin-1 is used very rarely on Internet and is being phased out. Now it's your turn to provide some references that a significant percentage of websites are omitting encoding (which is required by spec!) and using Latin-1.

But if you insist, here is this quote:

https://www.w3docs.com/learn-html/html-character-sets.html

> UTF-8 is the default character encoding for HTML5. However, it was used to be different. ASCII was the character set before it. And the ISO-8859-1 was the default character set from HTML 2.0 till HTML 4.01.

or another:

https://www.dofactory.com/html/charset

> If a web page starts with <!DOCTYPE html> (which indicates HTML5), then the above meta tag is optional, because the default for HTML5 is UTF-8.


> My original point is that Latin-1 is used very rarely on Internet and is being phased out.

Nobody disagrees with this, but this is a very different statement from what you said originally in regards to what the default is. Things can be phased out but still have the old default with no plan to change the default.

Re other sources - how about citing the actual spec instead of sketchy websites that seem likely to have incorrect information.


In countries communicating in non-English languages which are written in the latin script, there is a very large use of Latin-1. Even when Latin-1 is "phased out", there are tons and tons of documents and databases encoded in Latin-1, not to mention millions of ill-configured terminals.

I think it makes total sense to implement this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: