Difference between revisions of "Java Virtual Machine"

Revision as of 12:04, 14 March 2011

Dynamic Compilation

The compilation process for a Java application is different from that of statically compiled languages like C or C++. A static compiler converts source code directly to machine code that can be directly executed on the target platform, and different hardware platforms require different compilers. The Java compiler converts Java source code into portable JVM bytecodes, which are "virtual machine instructions" for the JVM. Unlike static compilers, javac does very little optimization -- the optimizations that would be done by the compiler in a statically compiled language are performed instead by the runtime when the program is executed.

HotSpot dynamic compilation

The HotSpot execution process combines interpretation, profiling, and dynamic compilation. Rather than convert all bytescodes into machine code before they are executed, HotSpot first runs as an interpreter and only compiles the "hot" code -- the code executed most frequently. As it executes, it gathers profiling data, used to decide which code sections are being executed frequently enough to merit compilation.

HotSpot comes with two compilers: the client compiler and the server compiler. The default is to use the client compiler; you can select the server compiler by specifying the -server switch when starting the JVM. The server compiler has been optimized to maximize peak operating speed, and is intended for long-running server applications. The client compiler has been optimized to reduce application startup time and memory footprint, employing fewer complex optimizations than the server compiler, and accordingly requiring less time for compilation.

After interpreting a code path a certain number of times, it is compiled into machine code. But the JVM continues profiling, and may recompile the code again later with a higher level of optimization if it decides the code path is particularly hot or future profiling data suggests opportunities for additional optimization. The JVM may recompile the same bytecodes many times in a single application execution. Try invoking the JVM with the -XX:+PrintCompilation flag, which causes the compiler (client or server) to print a short message every time it runs.

Compilation and Writing Benchmarks

One of the challenges of writing good benchmarks is that optimizing compilers are adept at spotting dead code -- code that has no effect on the outcome of the program execution. But benchmark programs often don't produce any output, which means some, or all, of your code can be optimized away without you realizing it, at which point you're measuring less execution than you think you are. In particular, many microbenchmarks perform much "better" when run with -server than with -client, not because the server compiler is faster (though it often is) but because the server compiler is more adept at optimizing away blocks of dead code.

The following benchmark intended to measure concurrent thread performance, but that instead measures something completely different.

public class StupidThreadTest {
   public static void doSomeStuff() {
       double uselessSum = 0;
       for (int i=0; i<1000; i++) {
           for (int j=0;j<1000; j++) {
               uselessSum += (double) i + (double) j;
           }
       }
   }

   public static void main(String[] args) throws InterruptedException {
       doSomeStuff();
       
       int nThreads = Integer.parseInt(args[0]);
       Thread[] threads = new Thread[nThreads];
       for (int i=0; i<nThreads; i++)
           threads[i] = new Thread(new Runnable() {
               public void run() { doSomeStuff(); }
           });
       long start = System.currentTimeMillis();
       for (int i = 0; i < threads.length; i++)
           threads[i].start();
       for (int i = 0; i < threads.length; i++)
           threads[i].join();
       long end = System.currentTimeMillis();
       System.out.println("Time: " + (end-start) + "ms");
   }
}

The doSomeStuff() method is supposed to give the threads something to do. However, the compiler can determine that all the code in doSomeStuff is dead, and optimize it all away because uselessSum is never used. Once the code inside the loop goes away, the loops can go away, too, leaving doSomeStuff entirely empty. The server compiler does more optimization and can detect that the entirety of doSomeStuff is dead code. While many programs do see a speedup with the server JVM, the speedup you see here is simply a measure of a badly written benchmark, not the blazing performance of the server JVM.

Tips

If you're looking to measure the performance of idiom X, you generally want to measure its compiled performance, not its interpreted performance. (You want to know how fast X will be in the field.) To do so requires "warming up" the JVM -- executing your target operation enough times that the compiler will have had time to run and replace the interpreted code with compiled code before starting to time the execution.

The compiler runs at less predictable times, the JVM switches from interpreted to compiled code at will, and the same code path may be compiled and recompiled more than once during a run. If you don't account for the timing of these events, they can seriously distort your timing results.

How much warmup is enough? You don't know. The best you can do is run your benchmarks with -XX:+PrintCompilation, observe what causes the compiler to kick in, then restructure your benchmark program to ensure that all of this compilation occurs before you start timing and that no further compilation occurs in the middle of your timing loops.

If you run your benchmarks with -verbose:gc, you can see how much time was spent in garbage collection and adjust your timing data accordingly. Even better, you can run your program for a long, long time, ensuring that you trigger many garbage collections, more accurately amortizing the allocation and garbage collection cost.

Garbage Collection

The benefits of garbage collection are indisputable -- increased reliability, decoupling of memory management from class interface design, and less developer time spent chasing memory management errors. The well-known problems of dangling pointers and memory leaks simply do not occur in Java programs. (Java programs can exhibit a form of memory leak, more accurately called unintentional object retention.) However, garbage collection is not without its costs -- among them performance impact, pauses, configuration complexity, and nondeterministic finalization.

The Basic Algorithms

Memory blocks can be reached in one of two ways -- if the user program holds a reference to that block in a root (a reference to an object held in a static variable or in a local variable on an active stack frame), or if there is a reference to that block held in another reachable block. In a Java program. The set of reachable objects is the transitive closure of the root set under the points-to relation. The problem faced by all garbage collection algorithms is the same -- identify blocks of memory that have been dispensed by the allocator, but are unreachable by the user program.

Reference counting

The most straightforward garbage collection strategy is reference counting. Reference counting is simple, but requires significant assistance from the compiler and imposes overhead on the mutator (the term for the user program, from the perspective of the garbage collector). Each object has an associated reference count -- the number of active references to that object. If an object's reference count is zero, it is garbage (unreachable from the user program) and can be recycled. Every time a pointer reference is modified, such as through an assignment statement, or when a reference goes out of scope, the compiler must generate code to update the referenced object's reference count. If an object's reference count goes to zero, the runtime can reclaim the block immediately (and decrement the reference counts of any blocks that the reclaimed block references), or place it on a queue for deferred collection. Many ANSI C++ library classes, such as string, employ reference counting to provide the appearance of garbage collection. Reference counting is rarely used in production garbage collectors for a number of reasons, such as its inability to reclaim unreachable cyclic structures (objects that reference each other directly or indirectly, like a circularly linked list or a tree that contains back-pointers to the parent node).

Tracing collectors

None of the standard garbage collectors in the JDK uses reference counting; instead, they all use some form of tracing collector. A tracing collector stops the world (although not necessarily for the entire duration of the collection) and starts tracing objects, starting at the root set and following references until all reachable objects have been examined. Roots can be found in program registers, in local (stack-based) variables in each thread's stack, and in static variables.

Mark-sweep collectors

The most basic form of tracing collector, first proposed by Lisp inventor John McCarthy in 1960, is the mark-sweep collector, in which the world is stopped and the collector visits each live node, starting from the roots, and marks each node it visits. When there are no more references to follow, collection is complete, and then the heap is swept (that is, every object in the heap is examined), and any object not marked is reclaimed as garbage and returned to the free list.

Mark-sweep is simple to implement, can reclaim cyclic structures easily, and doesn't place any burden on the compiler or mutator like reference counting does. But it has deficiencies -- collection pauses can be long, and the entire heap is visited in the sweep phase, which can have very negative performance consequences on virtual memory systems where the heap may be paged.

The big problem with mark-sweep is that every active (allocated) object, whether reachable or not, is visited during the sweep phase. Because a significant percentage of objects are likely to be garbage, this means that the collector is spending considerable effort examining and handling garbage. Mark-sweep collectors also tend to leave the heap fragmented, which can cause locality issues and can also cause allocation failures even when sufficient free memory appears to be available.

Copying collectors

In a copying collector, another form of tracing collector, the heap is divided into two equally sized semi-spaces, one of which contains active data and the other is unused. When the active space fills up, the world is stopped and live objects are copied from the active space into the inactive space. The roles of the spaces are then flipped, with the old inactive space becoming the new active space.

Copying collection has the advantage of only visiting live objects, which means garbage objects will not be examined, nor will they need to be paged into memory or brought into the cache. The duration of collection cycles in a copying collector is driven by the number of live objects. However, copying collectors have the added cost of copying the data from one space to another, adjusting all references to point to the new copy. In particular, long-lived objects will be copied back and forth on every collection.

Heap compaction

Copying collectors have another benefit, which is that the set of live objects are compacted into the bottom of the heap. This not only improves locality of reference of the user program and eliminates heap fragmentation, but also greatly reduces the cost of object allocation -- object allocation becomes a simple pointer addition on the top-of-heap pointer. There is no need to maintain free lists or look-aside lists, or perform best-fit or first-fit algorithms -- allocating N bytes is as simple as adding N to the top-of-heap pointer and returning its previous value.

---Inexpensive memory allocation in a copying collector
void *malloc(int n) { 
   if (heapTop - heapStart < n)
       doGarbageCollection();

   void *wasStart = heapStart;
   heapStart += n;
   return wasStart;
}

Developers who have implemented sophisticated memory management schemes for non-garbage-collected languages may be surprised at how inexpensive allocation is -- a simple pointer addition -- in a copying collector. This may be one of the reasons for the pervasive belief that object allocation is expensive -- earlier JVM implementations did not use copying collectors. Allocation cost may be significantly cheaper in the Java runtime than in C. Not only is the cost of allocation smaller, but for objects that become garbage before the next collection cycle, the deallocation cost is zero, as the garbage object will be neither visited nor copied.

Mark-compact collectors

The copying algorithm has excellent performance characteristics, but it has the drawback of requiring twice as much memory as a mark-sweep collector. The mark-compact algorithm combines mark-sweep and copying in a way that avoids this problem, at the cost of some increased collection complexity. Like mark-sweep, mark-compact is a two-phase process, where each live object is visited and marked in the marking phase. Then, marked objects are copied such that all the live objects are compacted at the bottom of the heap. If a complete compaction is performed at every collection, the resulting heap is similar to the result of a copying collector -- there is a clear demarcation between the active portion of the heap and the free area, so that allocation costs are comparable to a copying collector. Long-lived objects tend to accumulate at the bottom of the heap, so they are not copied repeatedly as they are in a copying collector.

Note

Most (98%) objects die young. The various garbage collection strategies perform very differently depending on the mix of short-lived and long-lived objects. Copying collectors work very well when most objects die young, because objects that die young never need to be copied at all. However, the copying collector deals poorly with long-lived objects, repeatedly copying them back and forth from one semi-space to another. Conversely, mark-compact collectors do very well with long-lived objects, because long-lived objects tend to accumulate at the bottom of the heap and then do not need to be copied again. Mark-sweep and mark-compact collectors, however, expend considerably more effort examining dead objects, because they must examine every object in the heap during the sweep phase.

Early JDKs used a single-threaded mark-sweep or mark-sweep-compact collector. JDKs 1.2 and later employ a hybrid approach, called generational collection, where the heap is divided into several sections based on an object's age, and different generations are collected separately using different collection algorithms. Generational garbage collection turns out to be very effective, although it introduces several additional bookkeeping requirements at runtime.

The Generational Collection Algorithm

A generational collector divides the heap into multiple generations. Objects are created in the young generation, and objects that meet some promotion criteria, such as having survived a certain number of collections, are then promoted to the next older generation. A generational collector is free to use a different collection strategy for different generations and perform garbage collection on the generations separately.

Minor collections

Generational collection makes garbage collection pauses shorter by not collecting all generations at once. When the allocator is unable to fulfill an allocation request, it first triggers a minor collection, which only collects the youngest generation. Since many of the objects in the young generation will already be dead and the copying collector does not need to examine dead objects at all, minor collection pauses can be quite short and can often reclaim significant heap space. If the minor collection frees enough heap space, the user program can resume immediately. If it does not free enough heap space, it proceeds to collect higher generations until enough memory has been reclaimed.

Intergenerational references

A generational tracing collector starts from the root set, but does not traverse references that lead to objects in the older generation, which reduces the size of the object graph to be traced. What if an object in the older generation references a younger object, which is not reachable through any other chain of references from a root?

To address this problem, generational collectors must explicitly track references from older objects to younger objects and add these old-to-young references into the root set of the minor collection. There are two ways to create a reference from an old object to a young object. Either one of the references contained in an old object is modified to refer to a young object, or a young object that refers to other young objects is promoted into the older generation. The mutator and garbage collector can work together to maintain a comprehensive list of old-to-young references as they are created. When objects are promoted into the older generation, the garbage collector can note any old-to-young references that are created as a result of the promotion, which leaves only the tracking of intergenerational references created by pointer modifications.

The garbage collector needs to have a comprehensive set of old-to-young references when it wants to perform a minor collection. One way to do this would be to trace the old generation, but this clearly has significant overhead. Somewhat better would be to linearly scan the old generation looking for references to young objects. This approach is faster than tracing and has better locality, but is still considerable work.

The garbage collector can track old-to-young references that arise through modifying references held within existing objects in several ways. It could track them in the same manner as maintaining reference counts in reference-counting collectors (the compiler could generate additional instructions surrounding pointer assignments) or could use virtual memory protection on the old generation heap to trap writes to older objects. Another potentially more efficient virtual memory approach would be to use the page modification dirty bits in the old generation heap to identify blocks to scan for objects containing old-to-young pointers.

With a little cleverness, it is possible to avoid the overhead of tracking every pointer modification and inspecting it to see if it crosses generational boundaries. For example, it is not necessary to track stores to local or static variables, because they will already be part of the root set. It may also be possible to avoid tracking pointer stores within constructors that simply initialize fields of newly created objects (so-called initializing stores), as (almost) all objects are allocated in the young generation. In any case, the runtime must maintain an explicit set of references from old objects to young ones and add these references to the root set when collecting the young generation.

In the Figure below, the arrows represent references between objects in the heap. The red arrows represent old-to-young references that must be added to the root set for a minor collection. The blue arrows represent references to old objects, either from the root set or from the young generation, which don't need to be traced when collecting only the young generation.

Card marking

The Sun JDKs use an optimized variant of an algorithm called card marking to identify modifications to pointers held in fields of old-generation objects. In this approach, the heap is divided into a set of cards, each of which is usually smaller than a memory page. The JVM maintains a card map, with one bit (or byte, in some implementations) corresponding to each card in the heap. Each time a pointer field in an object in the heap is modified, the corresponding bit in the card map for that card is set. At garbage collection time, the mark bits associated with cards in the old generation are examined, and dirty cards are scanned for objects containing references into the younger generation. Then the mark bits are cleared. There are several costs to card marking – additional space for the card map, additional work to be done on each pointer store, and additional work to be done at garbage collection time. Card marking algorithms can add as little as two or three machine instructions per non-initializing heap pointer store, and entails scanning any objects on dirty cards at minor collection time.

The JDK GC

By default, the 1.4.1 JDK divides the heap into two sections, a young generation and an old generation. (Actually, there is also a third section, the permanent space, which is used for storing loaded class and method objects.) The young generation is divided into a creation space, often called Eden, and two survivor semi-spaces, using a copying collector.

The old generation uses a mark-compact collector. Objects are promoted from the young generation to the old generation after they have survived copying a certain number of times. A minor collection will copy live objects from Eden and one of the survivor semi-spaces into the other survivor space, potentially promoting some objects to the older generation. A major collection will collect both the young and old generation. The System.gc() method always triggers a major collection, which is one of the reasons you should use System.gc() sparingly, if at all, because major collections can take much longer than a minor collection. There is no way to programmatically trigger a minor collection.

Incremental collection

The incremental collection option has been part of the JDK since 1.2. Incremental collection reduces garbage collection pauses at the expense of throughput, making it desirable only in cases where shorter collection pauses are very important, such as near-realtime systems.

The algorithm used by the JDK for incremental collection, the Train algorithm, creates a new section of the heap between the old and young generations. These heap sections are divided into "trains," each of which is divided into a sequence of "cars." Each car can be collected separately. Effectively, each train car constitutes a separate generation, which means that not only must old-to-young references be tracked, but also references from older trains to younger trains and older cars to younger cars. This creates significant extra work for the mutator and garbage collector, but permits much shorter collection pauses.

Parallel and concurrent collectors

The new collectors in JDK 1.4.1 are all designed to address the issue of garbage collection on multiprocessor systems. Because most garbage collection algorithms stop the world for some period of time, a single-threaded garbage collector can quickly become a scalability bottleneck, because all but one processor are idle while the garbage collector has suspended the user program threads. Two of the new collectors are designed to reduce collection pause time -- the parallel copying collector and the concurrent mark-sweep collector. The other, the parallel scavenge collector, is designed for higher throughput on large heaps.

The parallel copying collector, enabled by the -XX:+UseParNewGC JVM option, is a young-generation copying collector that divides the work of garbage collection among as many threads as there are CPUs. The concurrent mark-sweep collector, enabled by the -XX:+UseConcMarkSweepGC option, is an old-generation mark-sweep collector that stops the world briefly for an initial mark phase (and again later for a brief re-mark phase) and then allows the user program to resume while the garbage collector threads execute concurrently with the user program. The parallel copying collector and concurrent mark-sweep collector are basically concurrent versions of the default copying and mark-compact collectors. The parallel scavenging collector, enabled by -XX:+UseParallelGC, is a young-generation collector optimized for very large (gigabyte and larger) heaps on multiprocessor systems.

Tuning GC Options

The first place to start with tuning garbage collection is to examine the verbose GC output. This will give you information on the frequency, timing, and duration of garbage collection operations. The simplest form of garbage collection tuning is simply expanding the maximum heap size (-Xmx). Copying collection becomes more efficient as the heap grows, so by expanding the heap, you reduce the cost of collection on a per-object basis. In addition to increasing the maximum heap size, you can also use the -XX:NewRatio option to increase the proportion of space assigned to the young generation. You also can specify the size of the young generation explicitly with the -Xmn option.

GC Performance

Object pooling is a straightforward concept -- maintain a pool of frequently used objects and grab one from the pool instead of creating a new one whenever needed. The theory is that pooling spreads out the allocation costs over many more uses. When the object creation cost is high, such as with database connections or threads, or the pooled object represents a limited and costly resource, such as with database connections, this makes sense. However, the number of situations where these conditions apply is fairly small.

Explicit nulling is simply the practice of setting reference objects to null when you are finished with them. The idea behind nulling is that it assists the garbage collector by making objects unreachable earlier. Or at least that's the theory.

Unfortunately, in many cases the above techniques can do more harm than good to your program's performance.

Finalizers

Objects with finalizers (those that have a non-trivial finalize() method) have significant overhead compared to objects without finalizers, and should be used sparingly. Finalizeable objects are both slower to allocate and slower to collect. At allocation time, the JVM must register any finalizeable objects with the garbage collector, and (at least in the HotSpot JVM implementation) finalizeable objects must follow a slower allocation path than most other objects. Similarly, finalizeable objects are slower to collect, too. It takes at least two garbage collection cycles (in the best case) before a finalizeable object can be reclaimed, and the garbage collector has to do extra work to invoke the finalizer. The result is more time spent allocating and collecting objects and more pressure on the garbage collector, because the memory used by unreachable finalizeable objects is retained longer. Combine that with the fact that finalizers are not guaranteed to run in any predictable timeframe, or even at all, and you can see that there are relatively few situations for which finalization is the right tool to use.

If you must use finalizers, there are a few guidelines you can follow that will help contain the damage. Limit the number of finalizeable objects, which will minimize the number of objects that have to incur the allocation and collection costs of finalization. Organize your classes so that finalizeable objects hold no other data, which will minimize the amount of memory tied up in finalizeable objects after they become unreachable, as there can be a long delay before they are actually reclaimed. In particular, beware when extending finalizeable classes from standard libraries.

Explicit Calls

A third category where developers often mistakenly think they are helping the garbage collector is the use of System.gc(), which triggers a garbage collection (actually, it merely suggests that this might be a good time for a garbage collection). Unfortunately, System.gc() triggers a full collection, which includes tracing all live objects in the heap and sweeping and compacting the old generation. This can be a lot of work. In general, it is better to let the system decide when it needs to collect the heap, and whether or not to do a full collection. Most of the time, a minor collection will do the job. Worse, calls to System.gc() are often deeply buried where developers may be unaware of their presence, and where they might get triggered far more often than necessary. If you are concerned that your application might have hidden calls to System.gc() buried in libraries, you can invoke the JVM with the -XX:+DisableExplicitGC option to prevent calls to System.gc() and triggering a garbage collection.

Immutability</color>

Making objects immutable eliminates entire classes of programming errors. One of the most common reasons given for not making a class immutable is the belief that doing so would compromise performance. While this is true sometimes, it is often not -- and sometimes the use of immutable objects has significant, and perhaps surprising, performance advantages.

Many objects function as containers for references to other objects. When the referenced object needs to change, we have two choices: update the reference (as we would in a mutable container class) or re-create the container to hold a new reference (as we would in an immutable container class). Assuming the containing object is small, allocating a new immutable object has some hidden performance advantages that come from the way generational garbage collectors work, having to do with the relative age of objects.

public class MutableHolder {
 private Object value;
 public Object getValue() { return value; }
 public void setValue(Object o) { value = o; }
}

public class ImmutableHolder {
 private final Object value;
 public ImmutableHolder(Object o) { value = o; }
 public Object getValue() { return value; }
}

In most cases, when a holder object is updated to reference a different object, the new referent is a young object. If we update a MutableHolder by calling setValue(), we have created a situation where an older object references a younger one. On the other hand, by creating a new ImmutableHolder object instead, a younger object is referencing an older one. The latter situation, where most objects point to older objects, is much more gentle on a generational garbage collector. If a MutableHolder that lives in the old generation is mutated, all the objects on the card that contain the MutableHolder must be scanned for old-to-young references at the next minor collection. The use of mutable references for long-lived container objects increases the work done to track old-to-young references at collection time. (See last month's article and this month's Resources, which explain the card-marking algorithm used to implement the write barrier in the generational collector used by current Sun JVMs).

References

Dynamic compilation and performance measurement

Garbage collection in the HotSpot JVM