Java
Raiting:
6

How does JVM allocate objects?


image How does the JVM create new objects? What exactly happens when you write a new Object ()?
At conferences, it is periodically told that the allocation of objects uses the thread-local allocation buffer (TLABs): memory areas allocated exclusively to each thread, creating objects in which is very fast due to the lack of synchronization.
But how correctly to choose the size of TLAB'a? What to do if you need to allocate 10% of the size of TLAB'a, but only 9% is free? Can an object be allocated outside TLAB? When (if) is the allocated memory set to zero?
Having asked these questions and did not find all the answers, I decided to write an article to correct the situation.
Before reading it, it's useful to remember how some garbage collector works
Introduction
What steps are needed to create a new object?
First of all, it is necessary to find an unoccupied memory area of ​​the correct size, then the object needs to be initialized: zero memory, initialize some internal structures (information that is used when calling getClass () and when synchronizing on the object etc.) and at the end you need to call the constructor.
The article is structured like this: first we'll try to understand what's going to happen in theory, then we'll somehow get into the JVM's insides and see how things really are, and at the end we'll write some benchmarks to make sure for sure.
Disclaimer: Some parts are deliberately simplified without loss of generality. Speaking of garbage collection, I mean any compacting-collector, and speaking about address space - eden of the younger generation. For other [standard or well-known] garbage collectors, parts can vary, but not too much.
TLAB 101

The first part is to allocate free memory to our object.
In the general case, the effective allocation of memory is a non-trivial task, full of pain, suffering and dragons. For example, linked lists are created for sizes that are multiples of two, they search and, if necessary, memory areas are cut and moved from one list to another (aka buddy allocator ). Fortunately, there is a garbage collector in the Java machine, which takes a complex piece of work on itself. In the process of assembling the young generation, all living objects are moved to the survivor space, leaving eden'e one large continuous region of free memory.
Since memory in JVM releases GC, the allocator only needs to know where to find this free memory, in fact, control access to one pointer to this most free memory. That is, the allocation must be very simple and consist of ponies and rainbows: you need to add to the pointer to the free eden size of the object, and our memory (this technique is called bump-the-pointer).
Memory can allocate several threads at the same time, so some form of synchronization is needed. If you make it the easiest way (blocking the heap region or atomic increment of the pointer), allocation of memory can easily become a bottleneck, so the JVM developers developed the previous idea with the bump-the-pointer: each thread is allocated a large piece of memory that belongs only to it . Allocations inside such a buffer occur all the same increment of the pointer (but already local, without synchronization) as long as this is possible, and a new region is requested each time the current one ends. This area is called the thread-local allocation buffer. It turns out a kind of hierarchical bump-the-pointer, where the first level is the heap region, and the second TLAB of the current thread. Some on this stop and go even further, hierarchically by placing buffers in buffers.
image
It turns out that in most cases, the allocation must be very fast, executed in just a couple of instructions and look something like this:
start = currentThread.tlabTop;
end = start + sizeof(Object.class);

if (end > currentThread.tlabEnd) {
goto slow_path;
}

currentThread.setTlabTop(end);
callConstructor(start, end);
It looks too good to be true, so let's use PrintAssembly and see, in what the method which creates java.lang.Object is compiled:
; Hotspot machinery skipped
mov 0x60(%r15),%rax ; start = tlabTop
lea 0x10(%rax),%rdi ; end = start + sizeof(Object)
cmp 0x70(%r15),%rdi ; if (end > tlabEnd)
ja 0x00000001032b22b5 ; goto slow_path
mov %rdi,0x60(%r15) ; tlabTop = end
; Object initialization skipped
With secret knowledge about The fact that the register% r15 always contains a pointer to the VM thread (lyrical digression: due to such an invariant thread-local and Thread.currentThread () work very quickly), we understand that this is exactly the code that we expected see. At the same time, we note that the JIT compiler zanilaynil allocation directly into the calling method.
This way JVM almost free (not remembering about garbage collection) creates new objects for a dozen instructions, shifting the responsibility for clearing memory and defragmenting to GC. A nice bonus is the locality of allocated consecutive data, which can not guarantee the classic allocators. There is an entire study about the impact of such locality on the performance of typical applications. Spoiler alert: makes everything a little faster even despite the increased load on the GC.
The effect of the TLAB size on an event
What should be the size of TLAB'a? In the first approximation it is reasonable to assume that the smaller the buffer size, the more often memory allocation will pass through a slow branch, and, therefore, TLAB needs to be done more: we go to a relatively slow common heap for memory and create new objects more quickly.
But there is another problem: internal fragmentation.
Consider the situation where TLAB is 2 megabytes in size, the eden region (from which TLABs are allocated) is 500 megabytes, and the application has 50 threads. As soon as the place for the new TLABs in the heap ends, the first thread, which will end its TLAB, will provoke garbage collection. Assuming that the TLABs are filled ± uniformly (in real applications this may not be the case), the average remaining TLABs will be approximately half full. That is, if there is still0.5 * 50 * 2 == 50 megabytes of unoccupied memory (as much as 10%), garbage collection begins. It does not turn out very well: a significant part of the memory is still free, and GC is still called.
image
If you continue to increase the size of TLAB or the number of threads, the memory loss will grow linearly, and it turns out that TLAB accelerates allocations, but slows the application as a whole, once again straining the garbage collector.
And if the place in TLAB'e still exists, but the new object is too big? If you throw out the old buffer and allocate a new one, the fragmentation will only increase, and if in such situations you always create an object directly in eden, then the application will start to run slower than it could?
In general, what to do is not very clear. You can cough up mystical constant (as is done for heuristics of in-lineing ), you can give the size of the developer's purchase and tune it for each application individually (incredibly convenient), you can teach JVM somehow guess the right answer.
What should I do?
Choosing a constant is an ungrateful job, but Sun engineers did not despair and went the other way: instead of specifying the size, the percentage of fragmentation is indicated - part of the heap that we are ready to sacrifice for fast allocations, and JVM will figure it out somehow. It is responsible for this parameter TLABWasteTargetPercent and by default it has the value 1%.
Using the same hypothesis about the uniformity of memory allocation by threads, we get a simple equation: tlab_size * threads_count * 1/2 = eden_size * waste_percent .
If we are ready to donate 10% of eden, we have 50 threads, and eden takes 500 megabytes, then at the beginning of garbage collection 50 megabytes can be free in half-empty TLABs, that is, in our example the TLAB's size will be 2 megabytes.
In this approach, there is a serious omission: it is assumed that all streams allocate identically, which is almost always untrue. It is undesirable to adjust the number to the speed of allocation of the most intensive flows, it is also not to offend their less fast colleagues (for example, scheduled-vorkers). Moreover, in a typical application, there are hundreds of threads (for example, in the treadpulls of your favorite app-server), and there are only a few to create new objects without a serious load, this also needs to be taken into account somehow. And if you recall the question "What to do if you need to allocate 10% of the size of TLAB'a, and free only 9%?", Then it becomes completely unobvious.
The details are too many to just guess or peek in a blog, so it's time to figure out how everything is actually arranged: look at the source of the hotspot.
I used the master jdk9, here is CMakeLists.txt , with which CLion starts to work, if you want to repeat the journey.
Tumbling down the rabbit hole
The file of interest to us is from the first sling and is called threadLocalAllocBuffer.cpp , which describes the structure of the buffer. Although the class describes a buffer, it is created once for each thread and re-used when new TLABs are allocated, as well as various usage statistics for TLABs.
To understand the JIT compiler, you need to think like a JIT compiler. Therefore, immediately skip the initial initialization, create a buffer for the new thread and calculate the default values ​​and look at the resize , which is called for all threads at the end of each assembly:
void ThreadLocalAllocBuffer::resize() {
// ...
size_t alloc =_allocation_fraction.average() *
(Universe::heap()->tlab_capacity(myThread()) / HeapWordSize);
size_t new_size = alloc / _target_refills;
// ...
}
Aha! For each thread, the intensity of its allocations is tracked and, depending on it, and the constant_target_refills (which is carefully signed as "the number of TLABs that would like the thread to request between two assemblies"), a new size is calculated.
_ target_refills is initialized once:
// Assuming each thread's active tlab is, on average, 1/2 full at a GC
_target_refills = 100 / (2 * TLABWasteTargetPercent);
This is exactly the hypothesis that we assumed above, but instead of the TLAB's size, the number of new TLAB requests for the flow is calculated. To ensure that at the time of assembly, all threads had no more than% of free memory, it is necessary that the size of TLAB of each stream be 2x% of the total memory that it normally allocates between assemblies. By dividing 1 by 2x, the desired number of requests is obtained.
The share of thread allocations needs to be updated at some time. At the beginning of each garbage collection, the statistics of all threads are updated, which is in the method accumulate_statistics :

We check if the thread has updated its TLAB at least once. There is no need to recalculate the size for a thread that does nothing (or at least not allocates).
We check whether half of the eden was used to avoid the effect of full GC or pathological cases (for example, an explicit call to System.gc ()) on calculations.
In the end, consider what percentage eden'a spent the flow, and update its share of allocations.
We update the statistics of how the thread used its TLABs, how and how many allocated and how much memory was wasted.

To avoid various unstable effects due to the frequency of the assemblies and different allocation patterns associated with the inconstancy of the garbage collector and the desires of the stream, the share of allocations is not just a number, but exponentially weighted moving average , which maintains the average for the last N assemblies. In JVM for everything there is a key, and this place is no exception, the TLABAllocationWeight flag controls how quickly the average "forgets" the old values ​​(not that someone wanted to change the value of this flag).
Result
The information obtained is enough to answer the question that interests us about the size of TLAB:

JVM knows how much memory it can spend on fragmentation. This value calculates the number of TLABs that the thread should request between garbage collections.
The JVM keeps track of how much memory each thread uses and smoothes these values.
Each thread gets the size of TLAB in proportion to the memory it uses. This solves the problem of uneven allocation between threads and, on average, allallocates quickly, and they waste little memory.

image
If the application has 100 threads, 3 of which serve user requests, 2 are doing some auxiliary activity on the timer, and all others are idle, then the first group of threads will receive large TLABs, the second is quite small, and all the others are default values . And what is most pleasant - the number of "slow" allocations (TLAB requests) for all threads will be the same.
Allocation in C1
With the dimensions of TLAB'ov figured out. In order not to go far, we dig the sources further and see how the TLABs are allocated when it's fast, when it's slow, and when it's very slow.
There is already one class you can not do and you need to look into what the operatornew is compiling. To avoid craniocerebral trauma, we'll look at the client compiler code (C1): it's much simpler and more understandable than the server compiler, it well describes the overall picture of the world, and as the new thing in Java is quite popular, there are enough optimizations in it.
We are interested in two methods: C1_MacroAssembler :: allocate_object , which describes the object allocation in TLAB and initialization and Runtime1 :: generate_code_for which is executed when it was not possible to quickly allocate memory.
It's interesting to see if the object can always be created quickly, and the "find usages" chain leads us to such a comment in instanceKlass.hpp :
// This bit is initialized in classFileParser.cpp.
// It is false under any of the following conditions:
// - the class is abstract (including any interface)
// - the class has a finalizer (if !RegisterFinalizersAtInit)
// - the class size is larger than FastAllocateSizeLimit
// - the class is java/lang/Class, which cannot be allocated directly
bool can_be_fastpath_allocated() const {
return !layout_helper_needs_slow_path(layout_helper());
}
It becomes clear that very large objects (more than 128 kilobytes by default) and finalizeable classes always go through a slow call in the JVM. (Riddle - where are the abstract classes?)
Let's take this as a note and go back to the allocation process:

tlab_allocate is an attempt to quickly allocate an object, exactly the code that we saw when we looked at PrintAssembly. If it turns out, then we end the allocation and go to the initialization of the object.

tlab_refill is an attempt to allocate a new TLAB. Using an interesting test, the method decides whether to allocate a new TLAB (throwing out the old one) or to allocate the object directly in eden, leaving the old TLAB:
// Retain tlab and allocate object in shared space if
// the amount free in the tlab is too large to discard.
cmpptr(t1, Address(thread_reg, in_bytes(JavaThread::tlab_refill_waste_limit_offset())));
jcc(Assembler::lessEqual, discard_tlab);
tlab_refill_waste_limit is responsible for the size of TLAB'a, which we are not ready to sacrifice for the sake of allocating one object. By default, it has a value of 1.5% of the current TLAB size (of course, there is the -TLABRefillWasteFraction parameter, which suddenly has a value of 64, and the value itself is considered the current size of the TLAB divided by the value of this parameter). This limit is raised at each slow allocation to avoid degradation in failed cases, and is reset at the end of each GC cycle. Another issue is less.

eden_allocate - an attempt to allocate memory (object or TLAB) in eden'e. This place is very similar to the allocation in TLAB: we check if there is a place, and if so, atomically, using the cmpxchg's lock clause, we take away the memory, and if not, then we leave in the slow path. The allocation in eden is not wait-free: if two threads try to allocate something in eden simultaneously, then with some probability one of them does not work and will have to repeat everything anew.
JVM upcall
If you can not allocate memory in eden, then a call is made to the JVM, which leads us to the InstanceKlass :: allocate_instance method. Before the call itself, a lot of auxiliary work is done - special structures for GC are set up and the necessary frames are created to fit calling conventions , so operation it's not fast.
There is a lot of code and one can not do without one superficial description, so that I will not tire anyone, I will give only an approximate scheme of work:

First, the JVM attempts to allocate memory through a specific interface for the current garbage collector. There the same chain of calls occurs, as was above: at first attempt to allocate from TLAB ', then attempt to allocate TLAB from a heap and object creation.
In case of failure, garbage collection is invoked. There is also somewhere involved in the error GC overhead limit exceeded, all kinds of notifications about GC, logs and other checks not related to allocation.
If garbage collection does not help, then an attempt is made to allocate directly to the Old Generation (where the behavior depends on the selected GC algorithm), and in case of failure, another assembly and attempt to create the object occurs, and if it does not work, then finally throws OutOfMemoryError .
When the object was successfully created, it is checked whether it is finalizable and if yes, then its registration takes place, which consists in calling the method Finalizer # register (you also always wondered why this class is in the standard library, but never is used explicitly?). The method itself is written very long ago: a Finalizer object is created and under the global (sic!) Lock it is added to the linked list (by means of which objects will later be finalized and collected). This completely justifies the unconditional call in the JVM and (partially) the advice "do not use the finalize method, even if you really want to."

As a result, we now know about allocations almost everything: objects are allocated quickly, TLABs fill up quickly, objects in some cases are allocated immediately in eden, and in some they go through unhurried calls in the JVM.
Monitoring of slow allocations
As memory is allocated, we found out, but what to do with this information - yet.
Somewhere above, I wrote that all statistics (slow allocations, the average number of refilles, the number of allocating streams, the loss of internal fragmentation) are recorded somewhere.
image This is somewhere - perf data, which eventually falls into the hsperfdata file, and You can look at it using jcmd or programmatically using the sun.jvmstat.monitor API.
There is no other way to get at least some of this information, but if you use Oracle JDK, then JFR can show it (using a private API that is not available in OpenJDK), and immediately in the stack-traces section.
Is it important? In most cases, most likely not, but for example, there is an excellent report from the Twitter JVM team, where the slow allocations and twisting the necessary parameters, they were able to reduce the response time of their service by several percent.
Prefetch
While we were walking around the code, there periodically popped up some leveling and additional checks for the prefetch, which I artificially ignored.
Prefetch is a technique for increasing performance, in which data, which we will probably soon (but not right now) turn to, are loaded into the processor's cache. Prefetch is hardwired when the processor itself guesses that iteration over the memory is sequential and starts to load it, and software when the programmer (the compiler, the virtual machine) generates special instructions that give a hint to the processor that it would be nice to start pulling the cache to memory at the given address .
In the hotspot, prefetch is a C2-specific optimization, so we did not see its mention in C1 code. The optimization is as follows: when allocating in TLAB, an instruction is generated, which loads the memory in the cache, which is located just behind the object that is allocated. On average, Java applications allocate a lot or a lot, so beforehand loading the memory for subsequent allocations seems like a very good idea: the next time we create an object, we do not have to wait for it, because it will already be in the cache.
image
Prefetch has several modes that are controlled by the AllocatePrefetchStyle flag: you can do prefetch after each allocation, you can sometimes, after each allocation, and several times. In addition, the AllocatePrefetchInstr flag allows you to change the instruction to which this prefetch is done: you can load data only into the L1 cache (for example, when you allocate something and immediately discard it), only in L3 or all at once: the list of options depends on the processor architecture, and the correspondence values ​​of the flag and instructions can be found in . ad file for the desired architecture.
Almost always these flags in your production are not advisable to touch, unless you suddenly JVM-engineer, which tries to outrun competitors on the SPECjbb-benchmark, write something extremely high-performance on Java, and all your changes are confirmed by reproducible measurements (then you probably do not read up to this place, because you already know everything).
Initialization
With the allocation of memory, everything cleared up, it remains only to find out what the initialization of the object consists of before calling the constructor. We'll see everything in the same C1-compiler, but this time on ARM - there's more simple code, and there are interesting moments.

The required method is called C1_MacroAssembler :: initialize_object and is not very complex:

First, the object is set with a header. The header consists of two parts - mark word ,
which contains information about locks, identity hashcode (or biased locking), and garbage collection, and a klass pointer that points to the object class-the same native class representation that is in the metaspace, and from which you can get java.lang.Class .
image
The pointer to the class is usually compressed and takes 32 bits instead of 64. It turns out that the minimum possible object size is 12 bytes (plus there is a mandatory alignment that increases this number to 16).

All memory is cleared if the ZeroTLAB flag is not enabled. By default it is always off:
zeroing out a large region of memory leads to a washout of caches, it is more efficient to nullify the memory with small parts that will soon be overwritten. In addition, the clever C2-compiler can not do unnecessary work and do not nullify the memory, in which the designer's arguments immediately follow. Here is another answer.

In the end, the StoreStore barrier is placed (for more details about barriers, see article gvsmirnov ), which prohibits (well, almost) the processor further entries until the current ones run out.
// StoreStore barrier required after complete initialization
// (headers + content zeroing), before the object may escape.
membar(MacroAssembler::StoreStore, tmp1);
This is necessary for insecure publication of the object: if there is an error in the code, and somewhere the objects are published through the race, then you still expect to see (and the language specification guarantees this to you) in its fields either default values ​​or what the designer put down , but not random (out of thin air) values, and the virtual machine expects to see the correct header. The x86 has a stronger memory model, and this instruction is not needed there, so we looked at ARM.



Check in practice
Beware of bugs in the above code; I have only proved it correct, not tried it.
So far, everything looks fine: they have mocked the source code, found a couple of funny moments, but it's not enough what the compiler does in fact, maybe we did not look at it at all and everything was in vain. Check it back toPrintAssembly and completely looking at the generated code to call new Long (1023):
0x0000000105eb7b3e: mov 0x60(%r15),%rax
0x0000000105eb7b42: mov %rax,%r10
  0x0000000105eb7b45: add $ 0x18,% r10; Allotsiruem 24 bytes: 8 byte header,
                                                    ; 4 bytes pointer to a class,
                                                    ; 4 bytes for alignment,
                                                    ; 8 bytes on a long field
0x0000000105eb7b49: cmp 0x70(%r15),%r10
0x0000000105eb7b4d: jae 0x0000000105eb7bb5
0x0000000105eb7b4f: mov %r10,0x60(%r15)
0x0000000105eb7b53: prefetchnta 0xc0(%r10) ; prefetch
  0x0000000105eb7b5b: movq $ 0x1, (% rax); Set the title
  0x0000000105eb7b62: movl $ 0xf80022ab, 0x8 (% rax); We set the pointer to the Long class
0x0000000105eb7b69: mov %r12d,0xc(%rax)
  0x0000000105eb7b6d: movq $ 0x3ff, 0x10 (% rax); We put 1023 in the object field
It looks all exactly as we expected, which is quite good.
Summing up, the process of creating a new object is constructed as follows:

An attempt is made to allocate the object in TLAB.
If there is no space in TLAB, then either the new TLAB is allocated from eden or the object is created directly in eden, this time using atomic instructions.
If there is no place in eden, then garbage collection takes place.
If there is not enough space after that, then there is an attempt to allocate in the old generation.
If it did not work, then OOM rushes.
The object is set up with a header and the constructor is called.

On this the theoretical part can be ended and go to practice: whether it becomes much faster, whether prefetch is needed and whether the size of TLAB is affected by anything.
Experiments
Now we know how objects are created and which flags you can control this process, it's time to check it in practice. Let's write a trivial benchmark that simply creates java.lang.Object in several threads, and twirl the JVM options.
The experiments were run on Java 1.8.0_121, Debian 3.16, Intel Xeon X5675. On the abscissa axis - the number of streams, along the ordinate - the number of allocations in a microsecond.
image
It turns out to be quite expected:

By default, the rate of allocations grows almost linearly, depending on the number of threads, and this is exactly what we expect from the new. With the increase in the number of threads, it gets a little worse, but it's not surprising: if between allocations to do at least some useful work (for example, using Blakehole # consumeCPU), then the overlap between allocations will decrease and the growth rate will return to linear.
Turned off prefetch makes allocations a bit slower. In our benchmark, we simply overload JVM with allocations, and in real applications everything can be quite different, so we will not draw any conclusions about the benefits of this optimization. There are no recipes, in the end, you can always turn off this optimization and measure for your particular application.
With TLAB allocations off, everything becomes very bad: a difference of two and a half times for one thread is the cost of calling JIT -> JVM, and with the increase in the number of threads, the competition for one single pointer only increases, and no scalability of speech goes.

Well, and finally about the benefit of finalize, compare the allocation of eden'a with allocacies finalizable-objects:
image
The performance drops by an order of magnitude and by two orders of magnitude compared to fast allocation!
Conclusion
JVM does a lot of things in order to create new objects as quickly and painlessly as possible, and TLABs are the main mechanism by which it provides it. TLABs themselves are possible only through close cooperation with the garbage collector: shifting the responsibility for freeing memory to it, allocations became almost free.
Is this knowledge applicable? Maybe, but in any case it is always useful to understand how [your] instrument is arranged inside and what ideas it uses.

Special thanks to apangin and gvsmirnov for the review, without which you would die of boredom, before reaching the middle of the article, filled with vague wording, code listings and ochepyatkami.
MeLavi 28 september 2017, 8:23
Vote for this post
Bring it to the Main Page
 

Comments

Leave a Reply

B
I
U
S
Help
Avaible tags
  • <b>...</b>highlighting important text on the page in bold
  • <i>..</i>highlighting important text on the page in italic
  • <u>...</u>allocated with tag <u> text shownas underlined
  • <s>...</s>allocated with tag <s> text shown as strikethrough
  • <sup>...</sup>, <sub>...</sub>text in the tag <sup> appears as a superscript, <sub> - subscript
  • <blockquote>...</blockquote>For  highlight citation, use the tag <blockquote>
  • <code lang="lang">...</code>highlighting the program code (supported by bash, cpp, cs, css, xml, html, java, javascript, lisp, lua, php, perl, python, ruby, sql, scala, text)
  • <a href="http://...">...</a>link, specify the desired Internet address in the href attribute
  • <img src="http://..." alt="text" />specify the full path of image in the src attribute