Report from Java Virtual Machine Language Summit 2019

Today ended the twelfth JVM LS Summit. As usual, it was a hardcore event with technical presentations on virtual machines and the languages ​​that run on them. As usual, the summit was held in Santa Clara, on the Oracle campus. As usual, there are much more people wishing to get here than there are places: the number of participants does not exceed 120. As usual, there was no marketing, only offal.

This summit is already the third for me, and every time I visit it with great pleasure, despite the terrible jetlag. Here you can not only listen to reports, but also get to know better people from the world of JVM, take part in informal conversations, ask questions at workshops and generally feel involved in great achievements.

If you did not attend the summit, it does not matter. Most reports are posted on YouTube almost immediately after the summit. Actually they are already available . To make it easier to navigate, I will briefly describe here all the reports and workshops that I managed to attend.

July 29

Ghadi Shayban - Clojure Futures

This is not about the features of Future compilation in the Clojure language, as many thought, but simply about the development of the language, the intricacies of code generation and the problems that they encounter. For example, it turned out that in Clojure it is important to nullify local variables after the last use, because if the head of a list that is lazily generated in a local variable, then when it is traversed, nodes that have already been bypassed may not be collected by the garbage collector, and the program may crash with OutOfMemory . In general, the C2 JIT compiler itself releases variables after the last use, but the standard does not guarantee this and, say, the HotSpot interpreter does not.

It was also interesting to learn about the implementation of dynamic dispatching of function calls. I also learned that until recently, Clojure was targeting JVM 6 and only recently switched to JVM 8. Now the compiler authors look at invokedynamic.

Alan Bateman and Rickard Bäckman - Project Loom Update

The Loom project is a lightweight thread for Java. A year ago, Alan and Ron already talked about this project, and then it seemed that everything was going very well and was about to be ready. However, this project has not yet officially entered Java and is still being developed in a separate fork of the repository. Of course, it turned out that it was necessary to settle a lot of details.

Many standard APIs from ReentrantLock.lock to Socket.accept are already adapted for fibers: if such a call is made inside a fiber, the execution state will be saved, the stack will be unwound and the operating system thread will be freed up for other tasks until an event awakens the fiber (for example, ReentrantLock.unlock). However, for example, the good old synchronized block still does not work and, it seems, there can not do without serious refactoring of all synchronization support in the JVM. Another unwinding of the stack will not work if there are native frames in the stack between the start of the fiber and the breakpoint. In both of these cases, nothing explodes, but the fiber does not release the stream.

There are many questions regarding how Fiber compares with the old java.lang.Thread class. A year ago, there was an idea to make Fiber a subclass of Thread. Now they have abandoned this and make it an independent entity, because emulating all the behavior of an ordinary stream in each fiber is quite expensive. In this case, Thread.currentThread () inside the fiber will return the generated blende, and not the real thread in which everything is executed. But the snag will behave quite well (although it can slow down the job). The important idea is to under no circumstances give out the actual media stream on which the fiber is running inside the fiber. This can be dangerous since a fiber can easily move to another thread. The deception will continue.

It is curious that the project participants have already shoved some preparatory changes into the main JDK repository in order to make their life easier. For example, in Java 13, the doPrivileged method was rewritten from native code entirely in Java, resulting in an approximately 50-fold increase in performance. Why is this a Loom project? The fact is that this method very often appears in the middle of the stack, and while it was native, the fibers with this stack did not stop. One way or another, the project is already benefiting.

On the project page, you can read the documentation and download the source tree, and there are also binary assemblies that you can take and play today. We hope that in the coming years everything will be integrated.

Brian Goetz - Workshop "Project Amber"

In parallel, a workshop was going on about the Loom project, but I went to Amber. Here we briefly discussed the objectives of the project and the main JEPs in which work is going on - Pattern matching , Records and Sealed types . Then the whole discussion fell into the private issue of scoping. I talked about this at the Joker conference last year, in principle, nothing very new was said. I tried to push an idea with implicit union types like if(obj instanceof Integer x || obj instanceof Long x) use(x.longValue()) , but I did not see enthusiasm.

Jean Christophe Beyler, Arthur Eubanks and Man Cao - Thread Sanitizing for Java

In all respects, a wonderful project from Google to search for races using data in the form of reading and writing the same non-volatile field or array element from different streams without setting up a happens-before relationship. The project was originally written as an LLVM module for native code, and now it has been adapted for HotSpot. This is an official OpenJDK project with its own mailing list and repository.

According to the authors, the thing is now quite working, you can assemble and play. In addition, she finds racing not only in Java code, but also in the code of native libraries. Races are not searched for in the code of the virtual machine itself, because there all synchronization primitives are written in their own way, and TSan cannot detect them. According to the authors, TSan does not give false positives.

The main problem is performance. Now only the interpreter is instrumented for Java code, respectively, the JIT compilation is completely disabled, and the interpreter, which is already slow, slows down several times. But if you have enough resources (Google has, of course, enough), you can occasionally drive your test suites using TSan. It is also planned to add instrumentation to the JIT, but this is a much more serious intervention in the JVM.

Someone asked if disabling JIT compilation does not affect the result, because some races may not appear on the interpreter. The speaker did not rule out this possibility, but said that they had already found a huge number of races that would take a very long time to rake. So be careful when running your project under TSan: you may find out the unpleasant truth.

Brian Goetz - Valhalla Update

Everyone is waiting for value types in Java, but no one knows when they will appear. However, the movements are increasingly serious. Already there are test binary assemblies with the current L2 milestone. In the current plans, the full Valhalla will come on the L100 milestone, but the authors are still optimistic and believe that more than two percent have been done.

So, from the language point of view, we have classes with the inline modifier, which are specially processed by the virtual machine. Instances of such classes can be embedded in other objects, and flat arrays containing instances of inline classes are also possible. The instance does not have a header, which means there is no identity, the hash code is calculated by fields, == also by fields, an attempt to synchronize or Object.wait() on such a class will raise an IllegalMonitorStateException. Writing null to a variable of this type, of course, will not work. However, the authors offer an alternative: if you have declared an inline class Point , then you can declare a field or a variable of type (surprise-surprise!) Point? , and then there will be a full-fledged object on the heap (like boxing) with a header, identity, and null fit in there.

Serious open questions remain the specialization of generics and the migration of existing classes (for example, Optional ) to an inline class so as not to break the existing code (yes, people write null into variables of type Optional ). Nevertheless, the picture looms, and the gap is visible.

David Wrighton and Neal Gafter - Value Types in the CLR

It was a surprise for me that the same Neil Gafter, co-author of the original Java puzzlers, now works at Microsoft on .Net runtime. It was also a surprise to see a report about the CLR (the so-called .Net runtime) on the JVM LS. But to get acquainted with the experience of colleagues from other worlds is always useful. The report talks about the varieties of references and pointers in the CLR, about the bytecode instructions used for value types, and about how beautifully specialized generalized functions like reduce. It was interesting to learn that one of the goals of value types in .Net is an interop with native code. Because of this, the location of fields in value types is strictly fixed and can be projected onto a structure without transformations. The JVM has never had such a task, and what to do with the native interop - see below.

Vladimir Ivanov and John Rose - Vectors and the Numerics on the JVM

Again update last year's report . Again, the question of why they still haven’t released anything, if a year ago everything looked quite good.

A vector is a collection of several numbers, which in iron can be represented by a single vector register like zmm0 for AVX512. In vectors, you can load data from arrays, perform operations on them like element-wise multiplication, and throw them back. All operations for which there are processor instructions are intrinsized by the JIT compiler into these instructions. The number of operations is simply huge. If something is missing, an alternative slow implementation is used. Vector intermediate objects are ideally not created; escape analysis works. All standard computing algorithms are vectorized with a bang, using all the power of your processor.

Unfortunately, it’s hard for the authors without valgalla: the escape analysis is fragile and might not work easily. These vectors simply must be inline classes, then all problems will disappear. It is unclear whether this API can even be released before the first version of Valgalla. It seems much more ready. Among the problems called difficulties with the support of the code. There are many repeating pieces for different sizes of registers and different types of data, so most of the code is generated from templates and it hurts to maintain it.

The use is also imperfect. There is no operator overload in Java, so the math looks ugly: instead of max(va-vb*42, 0) you have to write va.lanewise(SUB, vb.lanewise(MUL, 42)).lanewise(MAX, 0) . It would be nice to have access to AST lambdas like in C #. Then it would be possible to generate a custom lambda operation like MYOP = binOp((va, vb) -> max(va-vb*42, 0)) and use it.

July 30th

The second day passed under the flag of compilation.

Mark Stoodley - From AOT to JIT and beyond!

An IBM employee, a member of the JVM OpenJ9 project, talks about their experience with JIT and AOT compilation. There are always problems: JIT is a slow startup, because it’s warming up; CPU costs for compilation. AOT - suboptimal performance due to lack of a profile (it is possible to profile, but non-trivially and not always the profile at compilation matches the profile at execution), it is more difficult to use, bind to the target platform, OS, garbage collector. Some of the problems can be solved by combining approaches: starting with AOT-compiled code and then finishing off with JIT. A good alternative to all of this is caching JIT. If you have many virtual machines (hello, microservices), they all turn to a separate service - the JIT compiler (yes, JITaaS), where everything is like an adult, orchestration, load balancing. This service compiles. Quite often, he can give ready-made code to a certain method, because this method has already been compiled on another JVM. This greatly improves warm-up, removes resource consumption from your JVM service, and generally reduces total resource consumption.

In general, JITaaS could be the next buzzword in the JVM world. Unfortunately, I didn’t catch whether this could be played right now or is it still a closed development.

Christian Wimmer - Improving GraalVM Native Image

GraalVM Native Image is a Java application compiled into native code that runs without the JVM (unlike modules compiled using an AOT compiler like jaotc). More precisely, this is not quite a Java application. To work correctly, he needs a closed world, that is, all code should be visible at the compilation stage, no Class.forName. You can reflection and method handles, but when you compile you have to specifically tell which classes and methods will be used through reflection.

Another fun thing is class initialization. Many classes are initialized during compilation. That is, say, your static fields will be computed by default by the compiler and the result will be written to the assembled image, and when you start the application, it is simply read. This is required in order to achieve better compilation quality: any constant folding can be done if the values ​​of the static fields are known to the compiler. Everything is fine with JIT, the interpreter performs static initialization, and then, knowing the constants, you can compile. And when building a native application, you have to trick. This, of course, leads to fun psychedelic effects. So classes are usually initialized in the order they are accessed, and at compile time this order is unknown and initialization in another is possible. If there are circular references between class initializers, you can see the difference in the behavior of the JVM code and in the native image.

Workshop Schatzl - Hotspot GC.

Sorted out all the pain associated with the garbage collectors. Unfortunately, I listened to most. I remember that OS memory recall was discussed, including the disgusting Xmx to everyone. There is good news: in Java 13 a new option -XX is added: SoftMaxHeapSize. So far, it is supported only by the ZGC collector, but G1 can also catch up. It sets a limit on the size of the heap, which should not be exceeded except in emergency situations, when it does not work out differently. Thus, you can set a large Xmx (say, equal to the size of the entire RAM) and some reasonable SoftMaxHeapSize. Then the JVM will keep itself within the framework most of the time, but at peak load it still will not throw OutOfMemoryError, but will take more memory from the OS. When the load drops, the memory will return.

Mei-Chin Tsai - JIT and AOT in the CLR

Microsoft Mei-Chin Tsai talked about the features of JIT and AOT compilation in the CLR. AOT compilation has been developing for them for a long time, but initially (ngen.exe) it was carried out on the target platform, sort of like the first time it started (if you have Windows, look for the * .ni.dll files in the Windows folder). Files are obtained dependent on the version of the local Windows and even on other DLL-ek. Accordingly, if the dependency is updated, all native modules must be recompiled. In the second generation (crossgen), pre-compilation by authors of applications and modules appeared relatively independent of hardware and OS versions and dependencies. This slowed down the code because dependency calls now had to be made honestly virtual. This problem was solved by connecting JIT and recompiling the hot code during the application. Then we talked about multi-level (tiered) compilation (it seems that in CLR this is in its infancy, while Java has been developing for at least ten years) and about future plans to make AOT truly cross-platform.

Wei Kuai and Xiaoming Gu - Speed ​​JVM Performance with JWarmUp

Alibaba colleagues presented their approach to the JVM warm-up problem. They use the JVM for many web services. In principle, a very fast startup is not so important, because the balancer can always wait until the machine boots up and only then start sending requests to it. However, the problem is that the machine does not warm up without requests: the code that describes the logic for processing requests is not called, which means it does not compile. It will be compiled when the first requests arrive, that is, no matter how much the balancer waits, there will be a performance failure on the first requests. Previously, they tried to solve this by throwing fake requests to the upcoming service before sending real requests to it. The approach is interesting, but it is rather difficult to generate such a fake stream that would cause the compilation of all the necessary code.

A separate problem is deoptimization. In the first thousand queries, one if always went along the first branch, the JIT compiler generally threw the second one, inserting a deoptimization trap there to reduce the code size. But the 1001st request went to the second branch, deoptimization worked and the whole method went to the interpreter. While statistics are being compiled again, while the method is compiled by the C1 compiler, then by the full profile by the C2 compiler, users will experience a slowdown. And then in the same method another if can be deoptimized, and everything will go on a new one.

JWarmUp solves the problem as follows. During the first run of the service, a compilation log is written for several minutes: it records which methods were compiled and the necessary profiling information by branches, types, etc. If this service is restarted, immediately after startup, all classes from the log are initialized and the logged methods are compiled taking into account the previous profile. As a result, the compiler will work well at startup, after which the balancer will start sending requests to this JVM. By this time, all the hot code she has already been compiled.

It is worth noting that the quick launch problem is not solved here. The launch can be even slower because many methods are compiled, some of which may be required only minutes after the start. But the log turns out to be reusable: unlike AOT, you can raise the service on a different architecture or with a different garbage collector and reuse the previous log.

Authors have been trying for a long time to push JWarmUp into OpenJDK. So far unsuccessfully, but work is moving. The main thing is that a full-fledged patch is quite accessible to yourself on the Code Review server, so you can easily apply it to the HotSpot sources and build the JVM yourself with JWarmUp.

Juan Fumero - TornadoVM

This is a research paper from Manchester, but the authors claim that the project has already been implemented in some places. It’s also an add-on for OpenJDK, which makes it quite easy to transfer certain Java code to GPU, iGPU, FPGA or simply parallelize it to the cores of its processor. To compile on the GPU, they use GraalVM in which they built their backend - TornadoJIT. A correctly written Java method transparently goes to the corresponding device. True, they say that compiling to FPGA can take several hours, but if your task is considered a month, then why not. Some benchmarks (for example, the discrete Fourier transform) are more than a hundred times faster than bare Java, which is expected in principle. The project is fully uploaded to GitHub , where you can also find scientific publications on the topic.

Maurizio Cimadomore - Deconstructing Panama

All the same song - a long-standing project, every summit presentation, a year ago everything looked pretty ready, but there was still no release. It turned out that since then the focus has shifted.

The idea of ​​the project is an improved interope with native code. Everyone knows how painful it is to use JNI. Very painful. The Panama project reduces this pain to nothing: using jextract Java classes are generated from the * .h files of the native library, which are quite convenient to use by calling native methods. On the C / C ++ side, there is no need to write a single line at all. In addition, everything became much faster: overhead on calls to Java-> native and native-> Java fell at times. What more could you want?

There is a problem that has been around for quite some time - transferring data arrays to native code. The recommended method so far is DirectByteBuffer, which has a lot of problems. One of the most serious is unmanaged lifetime (the buffer will disappear when the garbage collector picks up the appropriate Java object). Because of this and other problems, people use Unsafe, which, with due diligence, can easily lay down the entire virtual machine.

This means you need new normal memory access outside the Java heap. Allocation, structured accessors, explicit removal. Structured accessors - so that you do not have to calculate the offsets yourself if you need to write, for example, struct { byte x; int y; }[5] struct { byte x; int y; }[5] struct { byte x; int y; }[5] . Instead, you once describe the layout of this structure and then do, for example, VarHandle , which can read all x by jumping over y . In this case, of course, there should always be a border check, as in ordinary Java arrays. In addition, there should be a ban on access to an already closed area. And this turns out to be a non-trivial task if we want to maintain performance at the Unsafe level and allow access from multiple threads. In short, watch the video, very interesting.

Workshop: Vladimir Kozlov - Metropolis Project

The Metropolis project combines all attempts to rewrite parts of the JVM in Java. Its main part today is the Graal compiler. Over the past years, it has developed very well and there is already real talk of a full replacement for the aging C2. There used to be a bootstrap problem: the grail started slowly, because he himself had to be JIT-compiled or interpreted. Then the AOT compilation appeared (yes, the main goal of the AOT compilation project is the bootstrap of the grail itself). But with AOT, the grail eats up a decent portion of the heap of a Java application that may not really want to share its heap. Now we have learned how to turn the grail into a native library using Graal Native Image, which ultimately allowed us to isolate the compiler from the general heap. With the peak performance of the code compiled by the grail, there are still problems on some benchmarks. For example, the grail lags behind C2 in intrinsics and vectorization. However, thanks to the very powerful inlining and escape analysis, it simply breaks C2 on functional code, where many immutable objects and many small functions are created. If you write on the Rock and still do not use the grail, run to use. Moreover, in the latest versions of JDK it is quite trivial to do a couple of keys, everything is already in the kit.

July 31

Kevin Bourrillion - Nullness Annotations for Java

Kevin announced a new project, but asked not to talk publicly and not to post a recording of his speech on YouTube. So sorry. , .

Dmitry Petrashko — Gradual typing for Ruby at Scale with Sorbet

Sorbet (!) Ruby, Ruby . , Stripe Ruby , , . , .

Lightning Talks

- . Remi Forax , , . , :

, - , .

Erik Meijer — Differentiable Programming

ML AI , . , Facebook — getafix , --, , . . , , . , , .

. . OpenJDK Committer Workshop.


All Articles