JavaScript
Raiting:
5

Hidden Messages in JavaScript Property Names


For testing, the code should be selected and copied directly from tweets . - approx. trans.

I recently found this tweet from <a href="https://twitter.com/FakeUnicode/"> @FakeUnicode < / a>. There was a JavaScript snippet that looked pretty innocuous, but displayed a hidden message. It took me some time to understand what was going on. I think that the record of the steps of my investigation may be of some interest to someone.

Here is the snippet:

image

What would you expect from him?

Here we use the for in loop, which passes through the enumerable properties of the object. Since only the property A is specified, we can assume that the message with the letter A will be shown. Well ... I was wrong. : D

image

This surprised me, so I started debugging through the Chrome console.
Opening hidden character codes
The first thing I did was simplify the snippet.

for(A in {A:0}){console.log(A)};
// A
Hmm ... okay, nothing here, let's go further.

for(A in {A:0}){console.log(escape(A))};
// A%uDB40%uDD6C%uDB40%uDD77%uDB40%uDD61%uDB40%uDD79%uDB40%uDD73%uDB40%uDD20%uDB40%uDD62%uDB40%uDD65%uDB40%uDD20%uDB40%uDD77%uDB40%uDD61%uDB40%uDD72%uDB40%uDD79%uDB40%uDD20%uDB40%uDD6F%uDB40%uDD66%uDB40%uDD20%uDB40%uDD4A%uDB40%uDD61%uDB40%uDD76%uDB40%uDD61%uDB40%uDD73%uDB40%uDD63%uDB40%uDD72%uDB40%uDD69%uDB40%uDD70%uDB40%uDD74%uDB40%uDD20%uDB40%uDD63%uDB40%uDD6F%uDB40%uDD6E%uDB40%uDD74%uDB40%uDD61%uDB40%uDD69%uDB40%uDD6E%uDB40%uDD69%uDB40%uDD6E%uDB40%uDD67%uDB40%uDD20%uDB40%uDD71%uDB40%uDD75%uDB40%uDD6F%uDB40%uDD74%uDB40%uDD65%uDB40%uDD73%uDB40%uDD2E%uDB40%uDD20%uDB40%uDD4E%uDB40%uDD6F%uDB40%uDD20%uDB40%uDD71%uDB40%uDD75%uDB40%uDD6F%uDB40%uDD74%uDB40%uDD65%uDB40%uDD73%uDB40%uDD20%uDB40%uDD3D%uDB40%uDD20%uDB40%uDD73%uDB40%uDD61%uDB40%uDD66%uDB40%uDD65%uDB40%uDD21
Mother of God! Where did this come from?

I had to take a step back and look at the length of the line.

image

Interesting. Then I copied A from the object and immediately realized that the Chrome console was working with something hidden, because the cursor was "frozen" and did not respond to a few keystrokes left / right.

But let's see what's inside, and get the values ​​of all 129 code units.

image

Here we see the letter A with the value of the code unit65, followed by several code units in the region of 55 thousand and 56 thousand, whichconsole.log visualizes with a familiar sign of the question. This means that the system does not know how to handle this code unit.
Surrogate pairs in JavaScript
These values ​​are parts of the so-called surrogate pairs , which are code points with values ​​greater than 16 bit (i.e., code points greater than 65536). This is necessary because Unicode itself defines 1 114 122 different code points, and JavaScript has a UTF-16 string format. That is, only the first 65,536 code points from Unicode can be represented by one element of the JavaScript code unit.

Higher values ​​can be calculated by applying a crazy formula to the pair, resulting in a value greater than 65536.

Bold inset: I read the lecture on this topic, which can help you understand the concept of code points, emoji and surrogate pairs.

So, we found 129 code units, of which 128 are surrogate pairs representing 64 code points. So what are these code points?

To get the value of the code point from the string, there is a very convenient loopfor of which runs the code points of the string (not the code units, as the first cyclefor), and also the operator ... that is used in for.

image

Because console.log does not even know how to display these code points, we need to figure out what we are dealing with.

image

Note: note that JavaScript has two functions for processing code units and code points charCodeAt and codePointAt . They behave a little differently, so look.
Identifier names in JavaScript objects
The code points are 917868.917879 and continue to be part of the Variation Selectors Supplement in Unicode. Variant selectors in Unicode are used to indicate standardized variant sequences for mathematical symbols, emoji, Mongolian square letter symbols and eastern single ideograms corresponding to the eastern compatibility ideograms. They are usually not used by themselves.

Great, but what does this have to do with it?

If you look at the ECMAScript specification , you will find that the names of the property IDs can not only contain "Ordinary symbols."

Identifier ::
IdentifierName but not ReservedWord
IdentifierName ::
IdentifierStart
IdentifierName IdentifierPart
IdentifierStart ::
UnicodeLetter
$
_
\ UnicodeEscapeSequence
IdentifierPart ::
IdentifierStart
UnicodeCombiningMark
UnicodeDigit
UnicodeConnectorPunctuation
<ZWNJ>
<ZWJ>
As you can see, the identifier can consist of IDentifierName and IDentifierPart. Important is the definition of IdentifierPart. In addition to the first character of the identifier, all other names are completely valid:

const examples = {
// UnicodeCombiningMark example
  somethingî: 'LATIN SMALL LETTER I WITH CIRCUMFLEX',
somethingi\u0302: 'I + COMBINING CIRCUMFLEX ACCENT',

// UnicodeDigit example
  something1: 'ARABIC-INDIC DIGIT ONE',
something\u0661: 'ARABIC-INDIC DIGIT ONE',

// UnicodeConnectorPunctuation example
  something﹍: 'DASHED LOW LINE',
something\ufe4d: 'DASHED LOW LINE',

// ZWJ and ZWNJ example
something\u200c: 'ZERO WIDTH NON JOINER',
something\u200d: 'ZERO WIDTH JOINER'
}
So when calculating this expression, you get the following result:

{
  somethingî: "ARABIC-INDIC DIGIT ONE",
  somethingî: "I + COMBINING CIRCUMFLEX ACCENT",
  something1: "ARABIC-INDIC DIGIT ONE"
  something﹍: "DASHED LOW LINE",
something: "ZERO-WIDTH NON-JOINER",
something: "ZERO-WIDTH JOINER"
}
This led me to the main day's opening .

In accordance with ECMAScript specifications:

Two IdentifierName, canonically equivalent to the Unicode standard, are not the same unless they are represented exactly by the same sequence of code units.

This means that two object identifier keys can look exactly the same, but consist of different code units, which means they will both be included in the object. As in the case of the symbol ", which corresponds to the code unit with the value of 00ee and the symbol with the circumflex <code lang =" cpp "> COMBINING CIRCUMFLEX ACCENT </ code>. So it's not the same thing, and the object includes double properties. The same is done with the Zero-Width joiner or Zero-Width non-joiner . They look the same, but they are not!

But back to the topic: the found values ​​of variant selectors belong to the category UnicodeCombiningMark, which makes them valid identifier names (even if they are invisible). They are invisible, because with high probability the system will show the result only if they are used in a valid combination.
escape function and line replacement
What makes the escape function so passes through all the code units and treats them as an escape . That is, it takes the first letter A and all parts of the surrogate pairs - and simply converts them again into lines. Invisible values ​​"are converted to a string form". So there is that long sequence that you saw at the beginning of the article.

A%uDB40%uDD6C%uDB40%uDD77%uDB40%uDD61%uDB40%uDD79%uDB40%uDD73%uDB40%uDD20%uDB40%uDD62%uDB40%uDD65%uDB40%uDD20%uDB40%uDD77%uDB40%uDD61%uDB40%uDD72%uDB40%uDD79%uDB40%uDD20%uDB40%uDD6F%uDB40%uDD66%uDB40%uDD20%uDB40%uDD4A%uDB40%uDD61%uDB40%uDD76%uDB40%uDD61%uDB40%uDD73%uDB40%uDD63%uDB40%uDD72%uDB40%uDD69%uDB40%uDD70%uDB40%uDD74%uDB40%uDD20%uDB40%uDD63%uDB40%uDD6F%uDB40%uDD6E%uDB40%uDD74%uDB40%uDD61%uDB40%uDD69%uDB40%uDD6E%uDB40%uDD69%uDB40%uDD6E%uDB40%uDD67%uDB40%uDD20%uDB40%uDD71%uDB40%uDD75%uDB40%uDD6F%uDB40%uDD74%uDB40%uDD65%uDB40%uDD73%uDB40%uDD2E%uDB40%uDD20%uDB40%uDD4E%uDB40%uDD6F%uDB40%uDD20%uDB40%uDD71%uDB40%uDD75%uDB40%uDD6F%uDB40%uDD74%uDB40%uDD65%uDB40%uDD73%uDB40%uDD20%uDB40%uDD3D%uDB40%uDD20%uDB40%uDD73%uDB40%uDD61%uDB40%uDD66%uDB40%uDD65%uDB40%uDD21 The trick is that @FakeUnicode chose specific variant selectors - those that end with a number that sends back to this symbol. Let's look at an example.

// a valid surrogate pair sequence
'%uDB40%uDD6C'.replace(/u.{8}/g,[]);
// %6C 6C (hex) === 108 (dec) LATIN SMALL LETTER L
unescape('%6C')
// 'l'
The only thing in this example is that it's a little confusing to use an empty array [] as a replacement for the string. It will be evaluated viatoString (), that is, it is converted to ''.

An empty string also does the trick. The meaning of [] is that in this way you can bypass the quotes filter or something like that .

In this way, you can encode the whole message with invisible symbols.
General functionality
So if we look again at an example:

image

The following occurs:


A: 0 - hereA includes many "hidden code units"
these symbols become visible with the help ofescape
mapping is performed using replace
The result will again be unescaped and ready to output to the notification window

I think it's pretty cool!
Additional Resources
This small example covers many Unicode themes. If you want to learn more, I highly recommend reading articles Matthias Bienens on Unicode and JavaScript:


JavaScript has a Unicode problem
Escape sequences of characters in JavaScript
KlauS 23 september 2017, 8:40
Vote for this post
Bring it to the Main Page
 

Comments

Leave a Reply

B
I
U
S
Help
Avaible tags
  • <b>...</b>highlighting important text on the page in bold
  • <i>..</i>highlighting important text on the page in italic
  • <u>...</u>allocated with tag <u> text shownas underlined
  • <s>...</s>allocated with tag <s> text shown as strikethrough
  • <sup>...</sup>, <sub>...</sub>text in the tag <sup> appears as a superscript, <sub> - subscript
  • <blockquote>...</blockquote>For  highlight citation, use the tag <blockquote>
  • <code lang="lang">...</code>highlighting the program code (supported by bash, cpp, cs, css, xml, html, java, javascript, lisp, lua, php, perl, python, ruby, sql, scala, text)
  • <a href="http://...">...</a>link, specify the desired Internet address in the href attribute
  • <img src="http://..." alt="text" />specify the full path of image in the src attribute