t1ha = Fast Positive Hash
Slightly less than the fastest, portable, 64-bit hash function, with decent quality.
Yes, in the air and in the king, about like that. Read on?
Instead of Disclaimer We drop the definition of hash functions along with a detailed listing of properties and requirements for their cryptographic application, assuming that the reader either owns the necessary knowledge minimums, or will make up for them . Also we agree that here and further we mean non-cryptographic (cryptographically non-persistent) hash functions, unless otherwise specified.
Banality Hashing is used in a lot of algorithms, and almost always requires the most efficient (fast) data processing, while maintaining some minimum level of hashing quality. And by "quality", first of all, we mean "conditional randomness" (stochasticity) of the result relative to the initial data. Somewhat more rarely are additional requirements: resistance to deliberate generation of collisions or irreversibility.
A bit more tediousness For the sake of consistency, it is necessary to define the notions of "quality" of the hash function and other requirements in a little more detail:
Basic quality: Changing one or more arbitrary bits in an arbitrary set of raw data results in a change in each bit of the result with a probability of ½.
Irreversibility (resistance to restore the prototype): The impossibility of obtaining the source data or individual bits by the result of hashing.
Resilience to the selection under the result (resistance to collisions of the first kind): The complexity of finding / selecting the initial data set in order to obtain a given result or even its individual bits, including when the initial data set is known.
Resistant to the selection of messages (resistance to collisions of the second kind): The complexity of finding / matching two different sets of data that would give the same result or the coincidence of individual bits.
Omitting the citation of evidence and other calculations, we can state:
The proper implementation of all the points, while ensuring performance, is a fairly difficult task, the solution of which gives a good cryptographic hash function. But we will not deal with this yet.
Providing basic quality requires a fairly large number of ALU operations. Simply put, quality always conflicts with speed.
Obtaining a qualitative result with a bit capacity greater than the number of operations of the ALU requires more than a multiple increase in the number of interlaces, and hence the basic operations of the ALU.
In general, the creation of a fast hash function involves achieving a weighted compromise between speed, quality and bit depth of the result.
So, we can say that t1ha appeared as a result of searching for a compromise between quality and speed, while taking into account the capabilities of modern processors and the methods (arithmetic and logical combinations) already found for mixing and spreading the dependencies (avalanche effect).
The basic version of t1ha is the fastest portable hash function for constructing hash tables and other related applications. Therefore, the base version of t1ha is oriented to 64-bit little-endian architectures, takes a 64-bit seed value and produces a 64-bit result that includes key-length and seed-length gain. It's worth noting that t1ha is intentionally constructed to return 0 with zero input data (zero-size key and zero seed).
I decided to make 64-bit operations out of the answers to the comments on : Perhaps it is worth explaining that it is 64-bit operations that give speed and quality without taking away portability. Actually, the wider the bit depth of arithmetic operations, the more they give an avalanche effect and better mix the data. Well, to process data, all other things being equal, of course faster by 8 bytes than by 4. On the other hand, it is 64-bit operations that are natively available on many modern processors, and can be more or less tolerantly translated into 32-bit ones. All other options, including SIMD -operations, are forced to strongly sacrifice portability and / or speed on "non-native" platforms.
64-bit result: To build hash tables, in many cases, there really is a result of a smaller width. Even 32 bits can be more than enough. However, when using 64-bit operations, the 64-bit result is actually obtained by itself. In this case, a sufficiently high-quality 64-bit hash result allows you to quickly perform a comparison for non-equality, and with good reliability compare to equality.
This "magic" of replacing comparisons can be incomprehensible and not demanded, but can increase the speed of work by an order of magnitude only due to locality, i. less flushing CPU cache. Simply put, you can build a hash table structure so that the calculated hash values lie side by side (packed in a cache line). And for real data, the processor had to go only if the hash values match. And here in this case 64-bits from t1ha allow to get the ultimate result. And 128 bits already will not give a prize, and to take from 64 bits less - it is possible always.
Comparison with HighwayHash: I have a dual attitude to this unofficial Google project .
First: On the one hand, good code and excellent technical implementation. On the other hand, HighwayHash is positioned as possibly cryptographic resistant (at least equal to SipHash ). Although the proof of "durability" is reduced to the results of statistical tests, but without the possibility of reproducing them (somehow I even allowed myself an extra ).
Secondly: HighwayHash is really fast only on x86_64 with AVX2 or SSE41. So is not it easier to use AES-NI or SHA acceleration?
If everything goes well, then in the t1ha set there will be additional options (primarily the width of the result) and optimized for E2K. On this topic of comparison with HighwayHash, I would like to close.
QualityIt is quite difficult to assess the quality of the hash function in all aspects. You can go analytically, or conduct various statistical tests. Unfortunately, the analytical approach is ineffective for estimating hash functions with a compromise between quality and speed. Moreover, the comparative analytical evaluation of such functions tends to subjective.
On the contrary, it is easy to obtain transparent quantitative estimates for statistical tests. In this case there are well-proven test packages, for example SMHasher . For t1ha, the results are simple - all t1ha variants pass all tests without any remarks. On the other hand, we should not assume that t1ha has any properties beyond what is needed for the target application (hash-table construction).
Readers would certainly appreciate a thorough and in-depth analysis of the quality and / or durability of t1ha. However, based on the target applications of t1ha, this seems unnecessary. Simply put, speed was more important to us, including for short keys. Therefore, many-round mixing was not considered. The presented variant t1ha saves on matches cycles and produces a 64-bit result - it is practically pointless to measure the found compromise differently, as statistically, and these results are simply good.
In fact I'm just taking an example in the statistical prove with Google colleagues ;)
BenchmarksIt is worth explaining the presence in the title of the phrase "the fastest". Indeed, it is highly unlikely that there is a hash function that will be useful and at the same time the fastest on all platforms / architectures. Different sets of instructions are available on different processors, and similar instructions are performed with different efficiency. Obviously, the "most common fastest" function, most likely, can not be created. However, it seems permissible to use the "fastest" for a function that is portable and at the same time fastest, at least on the most common platform (x86_64), while having little chance of losing on any modern processor with a decent optimizing compiler.
The source code of the project includes a test that checks both the correctness of the result and measures the speed of each implemented version. At the same time on x86, depending on the capabilities of the processor (and the compiler), additional options for functions can be checked, and measurements are made in processor cycles.
In addition, the project site contains tables with the results of performance measurements through a modified version of SMHasher from Reini Urban . Accordingly, all the numbers can be rechecked and / or obtained results on a specific processor using a specific compiler.
Here you can compare with some of the closest competitors t1ha.
Hashing of short keys (average for 1.31 bytes).
Look at the right column "Cycles / Hash" (the lower the value, the faster):
Look at the middle column "MiB / Second" (the higher the value, the faster):
Variants t1haDevelopment t1ha pursued purely practical purposes. The first such goal was to obtain a fast, portable and sufficiently high-quality function for constructing hash tables.
Then, the fastest version of the hash function was needed, which would give a comparable result in quality, but it was maximally adapted to the target platform. For example, the base version of t1ha works with little-endian byte order, which means that large-endian architectures require conversion with the inevitable loss of performance. So why not get rid of unnecessary operations on a specific target platform? In the same way, a few more options were added:
A simplified version for 32-bit platforms, both little and big-endian.
Option using instructions AES-NI for processors without AVX.
Two options using AES-NI instructions using AVX.
A little later it became clear that more options would be required, designed for various applications, including different bit depth of the result, requirements for quality and durability. Such a variety required the establishment of order. What was expressed in changing the naming scheme, in which the digital suffix denotes the "level" of the function:
t1ha0 ()is the fastest option for the current processor.
t1ha1 ()is the basic portable 64-bit version of t1ha.
t1ha2 ()is a portable 64-bit version with a little more care for quality.
t1ha3 ()is a fast, portable 128-bit version for obtaining prints.
In this scheme, it is assumed that t1ha0 () is a dispatcher that implements redirection depending on the platform and the capabilities of the current processor. In addition, the use of the suffixes "_le" and "_be" is not excluded for an explicit choice between the little-endian and big-endian variants. Thus, under the "sign" t1ha now there are several hash functions and this family will be replenished, including with an eye on the domestic E2K "Elbrus".
The idea of the current set of functions and their properties can be obtained from the output of the test. It should be noted that all functions pass all SMHasher tests, and the performance of AES-NI variants varies greatly depending on the processor model:
Simple bench for x86 (large keys, 262144 bytes):
t1ha1_64le: 47151 ticks, 0.1799 clk/byte, 16.679 Gb/s @3GHz
t1ha1_64be: 61602 ticks, 0.2350 clk/byte, 12.766 Gb/s @3GHz
t1ha0_32le: 94101 ticks, 0.3590 clk/byte, 8.357 Gb/s @3GHz
t1ha0_32be: 99804 ticks, 0.3807 clk/byte, 7.880 Gb/s @3GHz
Simple bench for x86 (small keys, 31 bytes):
t1ha1_64le: 39 ticks, 1.2581 clk/byte, 2.385 Gb/s @3GHz
t1ha1_64be: 42 ticks, 1.3548 clk/byte, 2.214 Gb/s @3GHz
t1ha0_32le: 51 ticks, 1.6452 clk/byte, 1.824 Gb/s @3GHz
t1ha0_32be: 54 ticks, 1.7419 clk/byte, 1.722 Gb/s @3GHz
Simple bench for AES-NI (medium keys, 127 bytes):
t1ha0_ia32aes_noavx: 72 ticks, 0.5669 clk/byte, 5.292 Gb/s @3GHz
t1ha0_ia32aes_avx: 78 ticks, 0.6142 clk/byte, 4.885 Gb/s @3GHz
t1ha0_ia32aes_avx2: 78 ticks, 0.6142 clk/byte, 4.885 Gb/s @3GHz
Simple bench for AES-NI (large keys, 262144 bytes):
t1ha0_ia32aes_noavx: 38607 ticks, 0.1473 clk/byte, 20.370 Gb/s @3GHz
t1ha0_ia32aes_avx: 38595 ticks, 0.1472 clk/byte, 20.377 Gb/s @3GHz
t1ha0_ia32aes_avx2: 19881 ticks, 0.0758 clk/byte, 39.557 Gb/s @3GHz
A bit about the internal device In a little more detail, t1ha is built according to the Merkla-Damgard scheme (in the "wipe-pipe" version) with hardening from the data size and the additive value. Within the main compressive cycle, a 256-bit state is used, with the same size of the input block. And for each data operand, two injection points with cross pollination are realized. At the end of the compression cycle, 256-bit state is compressed to 128 bits.
When performing the described operations, 64-bit operations are used, which are combined into mixers ARX (Add-Rotate-Xor) and MUX / MRX (Mul-Rotate-Xor). It is important that all these calculations are arranged in such a way as to allow parallel operation of most operations and dense laying of u-ops both in the pipeline and in the x86_64 executing devices. Due to this, a fairly good quality is achieved with an almost limiting hashing speed of long keys.
It should be noted that the compressing cycle is only started for blocks of sufficient size. If the data is less, then the intermediate 128-bit state will consist only of the size of the key and the salting value.
Further, the remaining tail of the data in batches of 64 bits is mixed alternately to the halves of the 128-bit state. In conclusion, the state is mixed simultaneously with compression to a 64-bit result. An important feature of t1ha here is the use of a mixer based on a wide multiplication (128-bit product of two 64-bit multipliers). This allows to implement qualitatively mixing with a good avalanche effect for fewer operations. Despite the fact that wide multiplication is relatively expensive, fewer operations allow t1ha to process short keys in a record low number of processor cycles.
It should be noted that the mixer used based on wide multiplication and exclusive OR is not ideal. Despite the fact that t1ha passes all SMHasher tests, the author has an idea about the consequences of non-injectivity. Nevertheless, the resulting quality appears to be rationally sufficient, and in the plans for the development of the t1ha line, the intention to provide a slightly better option is already reflected.
The rest is .
Thank you for attention. All good.
|Vote for this post
Bring it to the Main Page