Well that definitely takes the ๐ก๐ฃ๐๐ซ๐ for most noticeable Hacker News submission.
Suggestion (if you are author): There are a lot of chars that look like another char, often used on the web, so i think that there are more advanced versions to be made. I think i read that a lot of thai signs and cyrillic look like latin chars.
Russian government officials are obliged to put all their purchases on the online tender platform.
So they are using this trick (but in the opposite direction, latin `a` instead of cyrillic `ะฐ`) to avoid undesired competitors from entering those biddings and lowering the purchase prices (and not paying kickbacks, obviously).
https://navalny-en.livejournal.com/52565.html
or having your variable names in _๐ฑ๐๐๐๐๐๐ which might be more appearent but none the less annoying. That'd make a nice useless language though.
I've always found that attempts at germanization of subjects where English is the lingua franca are incredibly amusing. Further germanization of German words, such as the conversion of "Nase" to "๐ฒ๐๐๐๐๐๐๐๐๐๐๐๐๐" also is at least worth a chuckle despite the solemn background that spawned the movement.
Google translate doesn't seem to do well with those characters ... could someone please help with "๐ญ๐๐๐รผ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐".
I remember my German teacher struggling to get the class to remember Schwarzwรคlder Kirschtorte (admittedly two words). So she taught us Vierwaldstรคtterseedampfschiffgesellschaftskapitรคnsmรผtzensternlein instead. After that Schwarzwรคlder Kirschtorte was easy.
This is now my favorite code snippet. I didn't have one before. Love "Begrรผssungsanzeigebedienmechanismus" and the hopelessly verbose way it was implemented.
Too bad the source code of that beautiful toy is nowhere to be found - I'd gladly provide a patch that teaches it about the umlauts which it unfortunately left alone in your piece of art you created here <3
It's trivial to dump the tables at least. Just enter all printable ascii characters :). The umlauts would be by first fully decomposing the string down to letters+combining characters, right?
I have a tool to make this text, though I'll admit I never even thought about decomposing inputs like รผ and then recomposing them after Fraktur-izing.
Sad thing is, Unicode still doesn't seem to properly support titlos and (not so sad, since personally I think Unicode shouldn't really do anything with fonts unless absolutely necessary) has no separate characters for Ustav and Poluustav scripts.
Oh it can get much much worse... have a look at greek questionmark: "[...] canonically decomposes to U+003B ; semicolon making the marks identical in practice." [1]
If you happen to use cyrillic in your source code (for comments or even strings) and constantly switch between latin and cyrillic, then this actually happens with ะฐ "c" letter, because both latin and cyrillic "c" occupy the same button. And that's not fun, btw.
Depends on which keyboard layout you use, of course.
Russian is my first language, but English is my primary language, and I never had my chance to practice typing using the standard Russian keyboard layout, so I almost always use the "Phonetic" layout - where the latin c is the cyrillic ั. (Also, w is ั, and who the hell remembers what []\-= map to - always trial and error for me to find ัะถััั.)
Cyrillic, sure. But Thai? Their alphabet is credited to one เธเนเธญเธเธธเธเธฃเธฒเธกเธเธณเนเธซเธเธกเธซเธฒเธฃเธฒเธ. I've never thought there was any resemblance between Thai symbols and Latin ones, but... judge for yourself, I guess?
Would you really mind if I said that the Greek alphabet was credited to one ฮฮฌฮดฮผฮฟฯ? We know that's not true, but it doesn't change the legend (and indeed, the legend of Cadmus explicitly states that the Greek alphabet was derived from the Phoenician one...).
Both Thai and Khmer are Indic abugida scripts that derive (just like Burmese, Lao, Sinhalese, Balinese, etc.) from Brahmi. Claiming any of these scripts is one person's work is displaying abject ignorance of one of the most significant families of writing in human history.
These are called Homoglyphs, right? I remember reading an article about phishing that used these characters to register almost perfect looking domain names.
Great multiline stuff.
Could be improved by using the actual U+2212 minus sign โ, not - (U+002D HYPHEN-MINUS) when getting super pedantic.
Did something like this last week making extensive use of unicode block 1D400 and different space widths. http://math.typeit.org/ helped as well.
โif โ> (+ a b)โ โcase x โ โcond โโ
โ โ (- c d)โ โ (1 'foo)โ โ ((> y 2) 'quux) โโ
โ โโ (2 'bar)โ โ (t 'error)โ โ
โ โ โ (3 'baz)โ โ โ โ
...(hmmm. For some reason that looks better in my editor than on the webpage. Apparently a fixed width font isn't necessarily fixed when it comes to unicode).
Funny how it triggered a bug in Firefox. When the tab is unfocused, its title in the handle is "๐ผ๐โฆ", but when it gets the focus it becomes "๐ผ<D835>โฆ" (in a square box). The next codepoint is U+1D48F whose UTF-16 BE encoding is d8 35 dc 8f.
I'd say that the truncation algorithm operates on bytes and that it can't make sense of d8 35, but I'm not too sure how to fix that since graphemes can have arbitrary length (right?). Do you have to compute the width in advance?
>I'd say that the truncation algorithm operates on bytes
This seems likely, as another notable weirdness is that even with full width tabs, where there's plenty of space for at least "๐ผ๐๐๐๐๐ ๐ ๐ป๐๐๐..." it still only shows "๐ผ๐๐๐๐...".
This is similar to the pseudolocalization (รพลกรฉรปรฐรถฤผรถรงรฅฤผรฎลพรฅลฃรฎรถรฑ), that adds random accents to English word to test the localization capabilities of a program without requiring another language knowledge.
Hey! I was just thinking about this site, and visited it for the first time in years, after mentioning the old San Francisco ransom-font in another thread.
By randomly mixing these Unicode letter and letterlike characters, you can simulate a cut-and-paste ransom-note. For example, an acquired company could announce changes to its privacy policy:
wE โรฅve yรธuR ฯrIvแดรงy โ n a แดกiNdรธwleSs โoรธm,
& โโaโ ฯรธ โ o ยตnSฯฮตaKแดble โ hiโโs tโ โ t
I saw a thing recently where a unicode encoding trick was used in an oauth phishing scam -- using unicode characters, a scammer was able to make an oauth connector that looked like the real company but passed through the company's "if (oauthConnector.name.toLowercase().contains('our name')) { throw new DenyError();}" check.
Now, it's up for debate whether any (psuedo?) financial institution should offer full oauth access (at least without having a human review possible oauth connectors), but the point is, decorative hackernews submissions are the least malicious use of this trick.
Go to your browser's menu bar, click 'View', go to 'Character Encoding', and select 'Western (ISO-8859-1)'. Now it's just garbage characters. (It's not reversed, but at least it's not bold?)
> if you are on a high DPI display, chrome just looks awful
I'm fairly sure this is no longer the case. Chrome is high-DPI aware on Windows now, and it uses DirectWrite for font rendering, the same as IE. It just can't display these characters for some reason.
Nope, the UI got an update too. It renders at high-DPI on Windows. Chrome on a high-DPI machine looks exactly the same as on a low-DPI machine, except sharper. It used to be plagued with issues, but I'm fairly sure they're all gone now. DirectWrite isn't perfect. It still has weird hinting and kerning at high-DPI with some fonts, but it's better than GDI.
I find Chrome better than IE, actually. IE ignores my DPI settings and scales pages to 250%, so everything looks too large. Chrome renders correctly at 200%.
On my Fedora box with Chrome negative circled, squared and negative squared don't show up but everything else does. Firefox and Konqueror are the same so I imagine it is a font issue.
This surprises me, what exactly is the point of encoding what are essentially different fonts in unicode? Isn't that the job of the presentation layer?
(the Fraktur variant is awesome btw, and is apparently in the valid unicode range for Java...)
Personally I find it annoying how mathematical notation seems so intractable today. Things that are easily understood in code for me are a mystery in math notation. But I guess there will never be an overhaul with a more intuitive typography...
The book Structure and Interpretation of Classical Mechanics redefines some of the trickier parts of the standard mathematical notation, and does all of the actual computation in Scheme. They extended the standard Scheme interpreter/compiler to support algebraic manipulation of Scheme programs, which lets them do all of the higher-order computations in Scheme as well (things like transforming between coordinate systems, finding the derivative of a function, computing the Lagrange equations from partial derivatives, etc). Usually the proofs/derivations are shown in the modified standard notation, and then the resulting implementation is shown in Scheme.
I haven't finished the book (turns out I know less calculus than I thought), but the result is pretty effective. You're much less likely to get confused about which things are numbers and which are functions, and which of those functions operate on numbers and which ones operate on other functions, once you see the Scheme implementation of something.
In some cases, you might be reading poor-quality mathematical writing.
According to my generalization of some advice from Knuth:[1] in a good math text, definitions of terms are presented as they go along, and they are explicit about what means what. Furthermore, one of the factors that determines the quality of mathematical writing is
- Did you use words, especially for logical connectives, whenever you could have used words (instead of symbols) to express something?
and
> Try to state things twice, in complementary ways, especially when giving a definition. This reinforces the readerโs understanding. [...] All variables must be defined, at least informally, when they are first introduced.
This is repeated:
> Be careful to define symbols before you use them (or at least to define them very near where you use them).
There are some cases where "the general mathematical community is expected to know what you mean," like when publishing papers in some specialized field, but if you're writing a book, these rules hold quite true. Books certainly should explain their notation, especially since the general consensus for certain notations is expected to change over the decades ...
Keep in mind it is also true the other way around. Something can be mathematically clear to someone and totally a mystery in code form. Each one has his/her strengths and weaknesses.
For some concepts that can be expressed in both code and math, I prefer the code notation because I can run it, and also make small tweaks and see what happens. For example, I got a better understanding of Lรถb's theorem [1] by translating the proof into Haskell [2].
If it can be coded, I prefer having both, or implementing the code. It helps in understanding the algorithm behind. But maths is much larger than what can be coded, or is useful in code, so the only thing left is playing with toy examples ("coding" when working with really weird stuff.)
I'd love to see more of APL (and a "larger" set of APL functions, actually) in use. The idea of a notation we could run directly is/was awesome.
Probably true, and I guess if you're a mathematician, you quickly get used the symbols. And I'm not arguing against having those symbols in the first place, its just that some of them have an 19th century feel to them, and do not seem intuitive.
The art of typography and signage really only matured in the 20th century, and I'm certain some of the symbols would look very different if they were designed today. Anything that helps with teaching math and making it appear friendlier is a plus, imho.
I'm not sure what symbols are you hinting at. First I thought it was to Fraktur kind of letters, but obviously this shouldn't be the case, as you point "teaching" as a plus of redesigning them, and Fraktur symbols are used "traditionally" in relatively high level algebra (for some reason some symbols are used more in some realms, for me Fraktur started appearing when talking about complex stuff about ideals). Once you get used to them, it's like a second language, and that's it. I remember reading Feynman used his own symbols for sin, cos and other basic functions (turning them to one-stroke symbols) but he had to give up once he had to talk with other people.
Math symbols are more or less a universal language. Once you know how the symbol appeared, or get used to "reading it right" they are totally natural. I don't see โ as a "weird d," I read this as "partial." It wasn't natural at first, but I got used to it, just like I got used to English.
It's like three-letter names in assembly. It's good when you're doing it, but step away from it for a while and you can't remember what the signs mean anymore.
For an enlightening read, buy a copy of the Unicode standard. An amazing book, containing what I think is the single greatest achievement in anthropology. And read about the history and the imperfect process that has produced a system with duplicates, inconsistencies, but a system nonetheless.
Since it wasn't mentioned here earlier, it's worth to take a look at shapecatcher to see what glyphs might resemble latin letters.
Scribbling something resembling the latin capital letter A returns for example any of these codepoints: A๐ฮะร ๐ โะฮ๐ด๐บะดแช฿ก๐ขโซ4๐ฅแดฌแโต ๐๐ผ๐ฌฮโณ๐ฆฤ๐๐โโงแ๐โฒ๐ป๐โฒัฆแฉแ
One of my friends, moving to China for a semester to teach, was thinking of using a proper Chinese name to make it easier for students to address him. He had a good idea, even, which he shared on Facebook.
I proposed that we should name him after the lack of unicode support in our browsers, and we ended up calling him "Box Boxbox" for a couple of months.
Does anyone know why there are separate Unicode code points for letters in bold, bold italic and Fraktur? Normally this sort of thing should be handled by different fonts / font variants. Is it for compatibility with some legacy encoding?
I couldn't help but notice that this converter was copyrighted by Eli the Bearded. Google "Eli the Bearded", but not from work. You'll get some very interesting results.
I was once bilked into buying some scraped content as original work by this method. It passed copyscape, and my test of Googling a a random sentence in quotes didn't bring anything up. I let it go because I had already accepted the work, and the lesson was worth more than the article anyway.
Don't be fool as I was! Had I manually transcribed a sentence into Google instead of copying + pasting the Unicode chars, I would have found hundreds of copies of the same article.
In Javascript, many unicode characters are allowed [0], so hรกฤแธฑรฉลลรฉแบล is a valid variable name [1].
Note: The number of ัllัะััlัVะฐััะฐัlัะะฐะผัั [2] used in your production code is inversely proportional to the number of friends you'll make in the maintenance team.
What I need is something that takes all the extended characters (think Spanish or Swedish) and turns them into alternative safe versions.
For instance, รก into a, รฑ into n, รฅ into a, etc.
Had my hopes up when I saw the title.
Does anyone have any ideas or links to working scripts that I can turn into something useful? I need to "sanitize" a database of foreign documentaries before uploading to YouTube (their metadata input system chokes on extended chars). Thanks!
When you say safe alternatives, you mean ASCII right. You should think about looking into something which also understand the characters a bit better. For example รฅ,รฆ,รธ can mostly be turned into aa,ae,oe for danish and norwegian. Just turning them into a,?,o would change the meaning.
First is phonetic similarity. This is mostly just to allow users to be able to understand each other and to help automatically catch alternate latinizations so you find out "Hey, he already registered under a latinized-spelling name".
The second is glyph similarity. This is the security concern where you have two glyphs that are graphically similar but phonetically completely different, but can easily be mistaken for each other. These glyphs are used to trick and confuse users. The first kind of check won't catch these, but they're the reason we don't have unicode in domain names.
Probably a correct system would have a very liberal interpretation of glyph similarity and would treat strings as matched when they contain similar glyphs.
I made an iPhone app that does kind of the same thing, but converts letters to their upside-down unicode equivalent. It's fun for sending upside-down texts.
Would it be possible to use the new third party keyboard API in iOS8 to have a regular styled keyboard that types in an upside down fashion? This would allow the user to continue having the same input experience, but translate the output experience? Once confirmed this is possible, you could take OP's idea and apply as well.
Just a PSA for discoverability: since the replacement characters use different code points than their more standard equivalents, the default HN search (https://hn.algolia.com) at least doesn't find this submission when searching for "unicode."
Great, now we'll have to rely on IDEs with clickable drop-down lists of variables and function names because simple text input just got a lot harder for languages where Unicode is allowed for symbols!
Presumably, we are now in a situation where it is actually more difficult to learn computer programming if you happen to have had the misfortune to be born into a 'non-western' language and, to some extent, even non-english. That is an absurd situation and means that, as a collective species, we are wasting a huge amount of resources and potential. Definitely something we should look to resolve.
Having a drop-down for variables certainly isn't a solution, granted. Hopefully, there are some more sensible compromises - e.g. being able to specify a locale-dependent subset of unicode in your personal environment, appropriate use of metadata to describe the language of a file, etc.
Interesting; the title displayed OK minutes ago, on the main page, in Firefox/OSX. But now it's showing as unsupported-glyph boxes inside the page... but still looks OK in the titlebar of the item (comments) page.
Did some automated or administrative process mutate the characters? Or is this just Firefox drifting, in choice of font?
Strangely, for me on Firefox 33.1 on OS X, the title shows up fine on the main page. But when I click through to the comment, I get boxes only, and from then on, the main page also doesn't work anymore until I restart Firefox. I suspect an extension, but I'm not sure.
Also, strike-through. Which is the one I find genuinely useful because I like the suggestive way to say sฬถoฬถmฬถeฬถtฬถhฬถiฬถnฬถgฬถ then visibly correcting to something else.
I just noticed that in the Chrome tabs it shows the title correctly, i guess its because it just uses Windows unicode support there. But everywhere else its not showing.
There's this great quote that anything that was fun when you were five is still fun when you're thirty five, and playing around with funky letters was certainly fun at the age of 5.
Different problem, but someone who knows about unicode will probably know this -
When I paste from microsoft documents into putty, characters will often be transformed to weird versions. Example - emdash is a different character to '-'. It comes through as a weird tilda character instead of a dash. Mmm. Frustating.
Is there a robust program you can run on putty to catch such type and flatten it to ascii?
I use Linux but there are similar problems, I usually will paste text like that into sublime to remove all the special formatting, then re-copy paste it. I also found this stack overflow post, which mentions a program (puretext) that maps win+v to do a text only paste: http://stackoverflow.com/questions/122404/how-to-copy-and-pa...
Except when the site in question is completely broken wrt astral codepoints.
Which is unexpectedly common as MySQL's "utf8" can't handle codepoints outside the BMP and will just truncate text at the first astral codepoint[0]. You need MySQL 5.5.3 (because adding a whole new encoding in a minor version makes perfect sense) and "utf8mb4" (because why would a codec called "utf8" actually do UTF8?). And then the regex are probably broken because it's PHP and developers use neither UNICODE mode nor properties (PCRE's "\w" will not match all unicode letters, you need "\p{L}" for that, also note that e.g. "๐" is a symbol not a letter, although "๐น" is a letter)
MySQL is horrible for all the same reasons PHP is horrible, and this applies to Unicode too, except PHP is actually trying to fix its Unicode problems (UTF8 is the default now, moves towards adding a UString class), while MySQL isn't fixing them.
Iโve never been a fan of this sort of thing. The Unicode characters in these font blocks are not letters for making words; at least the doubleโstruck, fraktur, bold, italic, and bold italics are semantically for use in mathematical equations.
This can have some strange effects if you try to use them like letters. Example: Whatโs the lowercase transform of ๐ผ? ๐ผ! Not ๐.
Someone submitting a path to an open-source program (in Ruby) with a NBSP somewhere that changes the program logic or something. (a<NBSP>or<NBSP>b, where earlier you did a<NBSP>or<NBSP>b=x, or something similar, is the first example that comes to mind.
Well it does / should make people rethink allowing UTF-8 by default in user-generated content. I wonder if the stuff generated by http://www.eeemo.net/ works here:
This comment has a strange behavior in Firefox, which is not surprising but it's probably a bug:
When I scrolling to this comment there is no characters outside of the comment box but when i switch back to this page from another tab then the characters are going outside the comment box.
I don't really speak/read Russian, but I have a passable understanding of Cyrillic, and those always look dumb. It doesn't look like "the" to be, it looks lik "guh-buh-yeh" or something.
Finally a way to express myself on facebook properly ;) I wonder if bold text would lead to better conversion from ads using this trick. And I wonder when is facebook going to ban this because obviously it works :)
Which makes sense, as fullwidth is likely to be accidentally typed when using a Chinese/Japanese/Korean IME, and is entirely equivalent to normal characters, it just fits in with CJK text layouts better.
This has been a โฃโโโโ [thing] for quite some time - guess it might be making a come back, I've seen zalgo (http://knowyourmeme.com/memes/zalgo NSFW; ฬอฬชออฬฐอZออฬฌอฬฏฬAฬถฬฏฬฬอฬฅฬLฬปGฬขฬฃฬอฬอฬO [Zalgo] generator http://www.eeemo.net/) and flip and reverse text live on my Facebook in the past at least.
I've used this page for a long time. ๏ผท๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ ๏ฝ๏ฝ ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ ๏ฝ๏ฝ๏ฝ ๏ฝ๏ฝ๏ฝ๏ฝ ๏ฝ๏ฝ๏ฝ๏ฝ ๏ฝ ๏ฝ๏ฝ ๏ฝ๏ฝ๏ฝ๏ฝ ๏ฝ๏ฝ๏ฝ๏ฝ ๏ฝ๏ฝ๏ฝ๏ฝ๏ฝ
The question I have is, what's the easiest way to strip this ๐ น๐๐ ฝ๐ บ out of unicode strings submitted by web users? With a nod to Cunningham's Law, surely the right answer is a regular expression?
Unicode generally includes these things because an older encoding did, in the name of roundtrip compability. I expect some older font encoding did it to cater to people who need more than 26 symbols in their maths papers. Let ๐ be the...
And yet the Unicode consortium went with Han unification, which is still blocking adoption for a significant potential userbase (pretty much any software that needs to display Japanese names).
I went to a unicode meeting about a decade ago, and asked one of the luminaries over beer one night. He told me that they did some practical research, including reading newspapers and talking to editors. In Japan they would ask questions like "I see that you mention Shanghai in today's paper, and you use Japanese glyphs for the city's name, not the same as Chinese newspaper use. Why?". The answer was generally "that's how we write Shanghai here" and out of that came Han unification.
I suspect that if you could find a couple of mainstream publishers in Taiwan or Japan that prefer to print the names of mainland Chinese using the same glyphs as are used on mainland China instead of the glypths used on Taiwan or in Japan, you might be able to reopen the discussion of han unification.
Or even better: A directive from the someone's ministry of education decreeing deunified Han in school books, so at least one country's pupils would actually learn to read deunified Han.
Now wouldn't that be fun: "When history textbooks coverthe civil war in 1927-50, they shall use traditional Chinese for the names of then KMT-held cities and simplified Chinese for the names of then communist-held cities."
Well, the original reasoning behind Han unification was the (horrendously impractical) idea of storing all of Unicode in 16 bits. Most of these characters were added later; you can tell because their codepoints are greater than U+FFFF.
They're not encoding different fonts. They're encoding distinct character forms, often necessary for historical texts and such. Some of these are actually symbols, too.
Suggestion (if you are author): There are a lot of chars that look like another char, often used on the web, so i think that there are more advanced versions to be made. I think i read that a lot of thai signs and cyrillic look like latin chars.