Thousands of things to fix in Java from version one: a great interview with Oracle's Sergey Kuksenko

Sergey Kuksenko is a performance engineer who has seen Java yet version 1.0. During this time, he managed to participate in the development of mobile, client, server applications and virtual machines. Since 2005, Java has been working on performance and is currently working on Oracle to improve JDK performance. One of the most popular speakers at Joker and JPoint.

This habrapost is a great interview with Sergey, devoted to the following topics:

Cult of Performance;
When and what needs to be optimized, the initial design of the language and library;
Promising areas for further optimization;
How to participate in the development and what can be broken by optimizations;
Compiler tricks, register placement;
Is it possible to assemble a cat from minced meat;
When tests work for five consecutive days and other routine;
How to become a performance engineer;
Preparing a report for the next Joker.

About the cult of Productivity

Oleg: You are our old speaker, and this is not our first interview. Tell me a little, who are you now, what are you doing?

Sergey: I am the same as I did many years ago, and I am doing the same thing. I work in the Java Performance team and am responsible for the performance of Oracle Java machines, OpenJDK.

Oleg: Then I have a somewhat troll question: here you are a performance engineer, and your reports are about all kinds of performance. Don't you think the performance issue is somewhat overrated? Everyone rushes with her, but is this even necessary?

Sergey: This is a good question. It all depends on the other. This kind of audience attention can be considered excessive. Business productivity, on the other hand, is money.

This is the real money that people spend on iron, on some kind of clouds in Amazon. If you do not process your requests fast enough - that's it, you lose customers, lose money, lose everything else. Therefore, the request for performance, of course, is always there. The question is how important it is in each case. I am silent about high-frequency trading.

Oleg: By the way, do you think Java is suitable for this?

Sergey: Did you have the opportunity to meet a person like Peter Lawrey ?

Oleg: This is the CEO of Chronicle Software, the developers of OpenHFT?

Sergey: This is a very famous friend from London who travels a lot on conferences. They work in Java in high-frequency trading, they live perfectly.

Oleg: Are they doing it on Java or is it called native code from Java? Still, there is a difference.

Sergey: I don’t know at this level, he didn’t tell it. In principle, if desired, all that is needed can be achieved in Java itself.

Oleg: Interesting. If you take, for example, a community of pythonists, then they have a much less cult of productivity. How is it that this is what happens in our community? Maybe you provoked the cult of performance with your reports? You, Shipilev, Pangin, Ivanov and so on.

Sergey: I don’t know how it happened. The cult of productivity at the Russian conference is much higher than at the American one. Maybe this reflects the audience itself. At us people want to be more engaged in productivity, it is interesting to them. And in America, they want to do more for what they pay more. But this is a hypothesis, guesswork. It so happened.

When and what needs to be optimized

Oleg: You said that there’s still a request for performance. At what point do you need to start thinking about performance? When will the thunder strike?

Sergey: This is a general abstract question. It is better to turn once again to Alexey Shipilev’s keynote from one of the previous conferences, where he painted all this well enough.

Oleg: Yes, I remember the "curve of the name of Sh".

Sergey: You need to do performance right away, but depending on what level. It is not necessary to write benchmarks immediately. It is known, for example, that the banal restriction of the API architecture level between Set as a set and SortedSet already imposes fundamental algorithmic restrictions on us.

If we shoved a SortedSet into the API (although nobody needs that sorted one), and then it spread all over our system, then this thing will have to be pulled out painfully and hard.

The question starts from the very level of design - this is a question of minimal restrictions. The smallest possible restrictions must be used so that you can play with them later. For example, when I twisted various pieces of Java, extremely bad words came to mind. I would like to do something with one of the base classes, but I can’t do anything, because the API is fixed, you can’t change it anymore, it has already crawled out. But in order to do some trick and overclocking, you need to hide some details.

Case study: I used to squat around the java.math.BigDecimal class. There was a big request from different sides to somehow disperse it. There is a good enough, special case when our BigDecimal is not “Big”, it is just Decimal, and you need to read them.

Now, of course, an appropriate wrapper has been made for this. But if there wasn’t a public constructor sticking out of BigDecimal, but there were some static methods and factories, we could make BigDecimal abstract, and spit out two different implementations that worked as they needed. But this is impossible, because the constructor sticks out. Because of this, you already have to do an unnecessary runtime check inside, which allows you to go on a fast track in some cases.

Oleg: Does it follow from this that when developing a standard library it is worth abandoning the designers and doing builders everywhere?

Sergey: It's getting late.

Oleg: If it weren’t too late, would it be a good idea?

Sergey: She would give more room for maneuver. Look: we are writing new, and this new is outside the constructor. Two operations are obtained: first we create an object, then we call the constructor that fills it. And sometimes it would be very useful to hide the very creation of the object and create not the object that we have outside. This is a language restriction, originally, from the early days of Java.

Oleg: Well, now everyone uses DI-frameworks that allow you to twist proxies as you like and add anything, bypassing this limitation. In the original design of the language, could you add something like this, the built-in dependency injection container?

Sergey: I have a very specific opinion about the initial design of the language. If you recall the history of Java 1.0, it came out quite serious time pressure, it was necessary to do everything quickly.

There are thousands of things that I personally would like to see fixed from the very first version. But I'm afraid that even if one out of this thousand is chosen, one-two-three, and they would have started to be made at the time of the first Java, then Java would not have come out. This is a standard example that the best is the enemy of the good.

What else can be optimized in Java

Oleg: Ordinary people can fix something only in their project, and you, as JDK performance engineers, act immediately on hundreds of thousands of projects. The question arises: over more than 20 years of Java development, have there been any areas in the JDK where intervention by core engineers can lead to a noticeable effect? And how noticeable is this “noticeable effect"?

Sergey: Firstly, now Java does not work at all on the hardware that, say, 10 years ago. Iron now and iron 10 years ago are two big differences, and it is advisable to make various optimizations.

Secondly, it is, of course, wonderful when a performance engineer sits and accelerates something, gets huge numbers, reports to his superiors, knocks out money for a bonus after these overclocks. But a huge amount of work is underway on new projects. A feature is added, and the task of the performance engineer is not to overclock the feature, but to make sure that everything is ok in this feature. Or if not ok, then come up with some kind of correction.

Oleg: How can I be sure? You do not verify the code formally. What is a "make sure"?

Sergey: To make sure that everything is OK from the point of view of performance is the subjective expert opinion of a performance engineer who will write a report and say that “everything is normal in this feature”. Depending on the size of the feature, this implies sometimes quite a bit of action, sometimes a lot of different efforts. Starting from the fact that you just need to sit stupidly, watch what is being done there, benchmark this area, drive benchmarks, see what happens at the exit, and make a reasonable informed decision.

Oleg: And from the point of view of performance and new features - does Java generally move forward? Is there something there? Because our hardware has not changed much, for example, if we talk about Intel.

Sergey: For what period has this not changed?

Oleg: For example, the last 10 years.

Sergey: Yes, is there an AVX-512 on hardware a decade ago?

Oleg: No. He, probably, is not always present in modern?

Sergey: I definitely don’t. We have it in our lab, but it's all occupied by compilers. They are screwing so far, so I have not looked.

Oleg: Can AVX-512 support be considered an example of a typical feature?

Sergey: Probably possible. What exactly do I do: we had a large layer of work on the fact that there are modern requirements for adding new cryptographic algorithms. This is a thing where ten-year-old cryptography algorithms simply cannot be relied on. We need new algorithms, larger keys. And the addition of new cryptographic algorithms occurs, I would say, constantly.

Oleg: Do they somehow accelerate hardware?

Sergey: It all depends on specific algorithms. There are very well accelerated algorithms. By the way, 10 years ago this would not have worked on Intel hardware, but about 5-6 years old how good instructions appeared, up to AES blocks with accelerations. All this was implemented with a minimum time interval.

Oleg: What about the GPU, are they also able to multiply matrices?

Sergey: About the GPU - a separate conversation. We have for this there is a Panama project in which all these work is carried out, and someday it will reach the Java mainline with all the goodies.

Oleg: I have a couple of acquaintances who are engaged, conditionally, in financial mathematics. From some point on, they always switch to C ++ for computing and claim that it is very inconvenient to use all these optimizations and hardware from the managed platform. Can this be improved?

Sergey: We also have a big request for this and there are a number of internal requirements. For example, to make something work better in the field of machine learning. As a rule, this is a banal matrix multiplication, which can be thrown off on the GPU. Work on this is ongoing, let’s say so.

We have two large umbrella projects: Valhalla and Panama, which should collect features like the GPU. At the junction of Valhalla and Panama sits a vector API that works with our SIMD / SSE / AVX instructions directly from Java code, and Valhalla itself with inline types is all big steps in that direction.

What can be broken by optimization, how to participate in development

Oleg: The umbrellas you mentioned are similar to each other. Is it possible that one project affects another, including in terms of code and performance profile? For example, did you refactor something at your place, and the unfortunate Ron Pressler, shedding tears, is fixing his tests in a corner in the evening?

Sergey: This happens all the time. A concrete example is the Vector API. In order for the Vector API to work well, our native vectors must eventually become value types, or as it is now called in Java, inline types. You can make a workaround in hotspot and somehow implement it, but I want to have a general solution. On the other hand, the key feature of inline types is precisely not to worry about the layout of this data, and the layout of this data is extremely important for the Vector API.

Because it, in fact, directly corresponds to the AVX-512 and all that. It is clear that you need to do some squats, some optimizations, which, on the one hand, will make the inline type a normal type, but which will have a hardware-bound layout. Naturally, intersections occur. If you look at the groups of people who move Panama and move Valhalla, they intersect more than half.

Oleg: Purely organizational, here you have a project, some kind of problem with performance, but it is at the junction of several projects. What to do next? How to solve this? It turns out that this is already a trade-off between projects and people, and not between some abstract tasks.

Sergey: Everything is very simple here: if this is a performance problem with a feature that is just being designed, you need to go to the people who are designing and say, “so-and-so, what are we going to do? Let's do it differently. ” The discussion begins, and the problem is solved.

If the code already exists, it already works. In the ideal case, you fix this problem, or if you can’t fix it fully, you get a prototype, then you go to the code owner again and say: “Here’s the prototype, what will we do?” Then we solve this issue specifically for each case.

Oleg: We have interested people here who cannot participate in this process, these are end users.

Sergey: They cannot participate exactly enough that they will not be paid for their salaries in Oracle. If you do not need a salary, come to OpenJDK and participate.

Oleg: How real is it? OpenJDK has some damn geniuses like you, where ordinary people are, and where you are. Suppose something slows down for me, what should I do and how?

Sergey: If you don’t know the problem, this is a separate question, whether someone will search for a solution for you, this is a question as an area, example, and so on. Even if you don’t know the problem, it makes sense, perhaps, to write in OpenJDK and ask. If this is something that someone immediately clicks in the head, the people will grab it. If it is of no interest to anyone, it will hang unanswered.

Oleg: Suppose I know the problem and even know what needs to be fixed.

Sergey: If you know the problem, you come to OpenJDK, sign all the necessary pieces of paper, offer a patch, it is revised and poured.

Oleg: Is it that simple?

Sergey: Well, yes, a little bureaucracy, wait a bit. Yesterday Tagir ( lany ) picked up one small fix that I abandoned. He just wants to be brought to the end. He began to bring it to mind on his own. He says: “Damn, what’s it, I’ve done everything, laid out, no one is reviewing.” Well yes, no one is reviewing. It's July, half of the Java office is on vacation. They’ll come out of vacations and will do it.

Oleg: Vacations in the USA are about the same dates as usually in Russia?

Sergey: No, the vacation system in the USA is completely different from that in Russia. Firstly, they are significantly smaller. And also, in the USA, the vacation system is tied to schools. When you have children on vacation - then holidays. As soon as the holidays begin, all of America begins to move. And since classes here end in mid-June and begin in mid-August, this delta for vacation is not so big - only two months.

Compiler tricks, register placement

Oleg: Has it ever happened that you optimized something at home, and after that users had to write code differently? Relatively speaking, if the operation of selecting a substring used to take a range, and now makes a full copy, then this refactoring changes the way you write code.

Sergey: Surely it was, but I’m not going to give specific examples now. The question is, what are people laying down when writing code. If they need to squeeze out the maximum performance, and for this they do all kinds of compiler-specific tricks, they should be prepared for the compiler to evolve over time, and they will have to constantly modify their code in accordance with the current state of the compiler. And this is still wonderful.

Suppose, suddenly, after 20 years, Graal will come as the main compiler for HotSpot - then these poor guys will have to rewrite everything at all. This only happens if you have undertaken such a technical duty - to track changes in the compiler. It is much easier to write the correct code without direct ties, with more or less normal general implementations.

By the way, about compilers - not just about Java compilers, but in general. There is Moore’s law, which nifiga is not a law, but simply an empirical observation that the number of transistors doubles every year and a half.

And there is exactly the same law ( Proebsting's Law ) that code performance without modification increases by 4 percent every one and a half to two years. This 4 percent is what end users get for free just from the evolution of compilers. Not hardware, namely compilers.

Oleg: I wonder where these percentages come from. Is this some kind of initial inefficiency? But then someday this stock of inefficiencies will end.

Sergey: No, it's just a matter of technology development. I quit compilers when I started working on performance. But once I was engaged, and the biggest discovery for me was made in 2005 or 2006. I found out about it at all in 2008 because I didn’t read the article in time.

A very important task of any code generation is register allocation. It is known that in general terms this problem is NP-complete. It is very difficult to solve it, and therefore all compilers try to drive some kind of approximate algorithm with varying degrees of quality.

And here comes an article where the guys prove that in some cases that cover a huge number of compilers and a huge number of internal representations with certain restrictions, an exact polynomial algorithm exists for the task of allocating register allocation. Hooray, let's go!

This happened in 2005, compilers made earlier did not know this.

Oleg: Now you can make a new allocator for Java?

Sergey: Now that there is a theoretical solution, it can be rewritten. I didn’t go into details, but I know that the guys from Excelsior implemented the algorithm.

Oleg: We recently did an interview with Cliff Click, and he talked about the insanely complex and insane genius allocator he wrote for Java. Don't want to write another one?

Sergey: No.

Oleg: Is there anything normal?

Sergey: No, he is not normal. From my utilitarian point of view, I’ll say that I look in assembler and sometimes I see: “Well, yes, here the registers went bad”. If I resort to kicking our compilers, and we rewrite the allocator, then what will we get? We will get some gain, but I’m unlikely to see it except in those examples where I saw the inefficient allocation of registers. As long as there are no huge failures in this area, there is always something to do and get more winnings.

Oleg: Are there any areas of work in the JDK where all the engine compartment compiler or performance magic breaks to the surface? You say that you need to write normal normal code and everything will be fine, but it sounds suspicious.

Sergey: Everything will be fine until you need a super duper. If you need it really fast, be prepared that you will always rewrite. At the moment, if you take an abstract large application, by and large, as it is written - generally does not play a role in terms of performance.

On the one hand, as soon as the garbage collector is triggered, it eats up its 10-20%, on the other hand, the application architecture begins to pop up. The huge problem that I saw in the heap of applications is that they are shifting data. We took the data from here, transferred it there, made some transformations there. In general, any program does just that. It transfers data from one place to another in some way. But if you have too many shifts in the program, then the compiler will not help.

Oleg: You can try to track some simple things, like: this piece of memory is changing owners and moving between objects in this direction.

Sergey: No, this is a design issue. I’m not just shifting, but shifting with modifications, I’m doing something with them. The biggest benefit in real, massive applications can be obtained if you think about it: is there so much shifting needed. Instead of ten, making seven is already good.

(It might seem that the same video was accidentally duplicated here. In fact, everything is simpler, Habr cached the wrong picture from YouTube)

Putting a cat out of minced meat

Oleg: We just had a Hydra conference on distributed computing. And so many people bother very much with such things as a cost model, determining the cost of each operation - very granular, very accurate. People really want to write out all the instructions, add up the cost of each of them and see how many bars your code will take. I wonder how this approach works in modern reality. And if it doesn’t work, then what should I do?

Sergey: Well, half a percent can be and works. There are thoughts how to try to explain it. I would refer to one of my old presentations, where I showed a meat grinder. Meat comes in, and minced meat comes out. The minced meat comes out in parallel, and if the kitchen is large, then several meat grinders work on it, and all these pieces of meat are distributed there distributed. And how do you count these measures for a second and all that?

Oleg: Task: "how to assemble a cat from minced meat."

Sergey: Something like that, yes. You need stuffing, you got it at the exit. But how did it happen, in what order? There is no order. Modern hardware is a dynamic system, ranging from the processor to clusters. This is a big dynamics that they are trying to tune, they are optimizing something on the fly, and it is impossible to calculate.

Oleg: And if you try to make her work linearly, will this lead to a loss of productivity?

Sergey: If you try to force it linearly, then what will it give us? It will be predictable, but it will affect the speed. How many years do we have no gigahertz gain in processors?

Oleg: Well, somewhere around five?

Sergey: I would say that they haven’t been growing so much for ten years either - fluctuations in the marketing level. And processors for some reason are becoming faster. Due to what? And due to the fact that there are holes in the meat grinder where the meat comes from, and in the next processor there are more holes, the meat climbs faster. Well, if you twist the handle fast enough and serve the meat in the right amount.

Oleg: The number of goals is also growing.

Sergey: The matter is not only in the heads, the performance of processors is growing even for those taken on 1 core. Not as fast as during the race for gigahertz, but it is growing. New chips appear in microarchitectures, some kind of hardware solutions, and something accelerates a little.

Five days for tests and other tasks of a performance engineer

Oleg: You probably compile OpenJDK three times in a meal. How realistic is it to accelerate the assembly of a project using iron? C ++ is probably not fast going. If you buy the most top-end desktop processor - will it help somehow?

Sergey: I honestly don’t know. I never bothered with this question. On my laptop, building OpenJDK takes 15 minutes. And I always have 15 minutes to do in parallel.

Oleg: But then it turns out that you need to think through everything very well, and only what is ready to send for reassembly. You cannot change two characters and restart all tests.

Sergey: So it doesn’t matter! I cannot send for rebuilding because the changes are a little more than two characters. I don’t have such a situation that I changed two characters, assembled the build, started it, looked, after a minute changed two characters again and started collecting. This is possible, but I personally have not come across this. I usually have the following: I make a build, then sit down, write a script (or don’t write, I already have one), and run for three hours - at best. Sometimes he manages to work during the night, but there are things that have to wait five days.

Oleg: Oh, and what could it be?

Sergey: For example, a minimal, very limited run of the entire Java microbenchmark database.

Oleg: When is this needed?

Sergey: For example, this is necessary for the Valhalla project.

Oleg: So you changed something in the project, it doesn’t matter that - how do you understand how far the influence of your changes has spread? How many tests do you need to collect? You cannot collect all tests with all profiles on all hardware in the world.

Sergey: And you need to know what has changed. A local example - I had work in a cryptographic algorithm. I took these algorithms, looked at them, wrote benchmarks and saw that it can be overclocked. He proposed a fix, it was confirmed, that's all. We are spinning around these benchmarks. Here we have 2-3 algorithms that were affected by that piece of code, here are the various long keys of the input data, in order to understand the picture a bit in general terms, they drove - wonderful, short and simple.

Or, for example, our wonderful inline types, which sooner or later will reach the main Java and which I will talk about at Joker. There is a completely different situation. It is known there that some basic Java operations require changes in semantics, where additional runtime checks appear. And this is already a big red flag: runtime checks cannot cost 0. But will it even be visible in real life? A small check, parallel execution of an operation on modern out of order processors is nonsense.

But this is a fortune telling, it must be checked. For example, you need to find regression. And on what to look for regression? Here we have the basic two dozen large benchmarks and about two thousand small ones. Let us run it all and see. Run baseline, run the modified version, and then it will all work for 5-6 days, we will sit for another week and stupidly compare the result - is there regression or not. And well, if you saw the difference that somewhere in 2 times the speed dropped - then urgently need to kick someone and watch what happened. If 3% jumped out on some micro-benchmark, you beat your head against the wall. Where did this 3% come from?

Oleg: Did I understand correctly that you determine the right place intuitively?

Sergey: We have no intuition, we have regression testing. We have regression performance testing, which consists of several levels. It is clear that for frequent commits and promotions we have a small specially selected subset, for rare ones - a large one.

We have a database with all the trends, where you go and see what happened. The routine of a performance developer is when you arrive in the morning, some performance regression in a particular build falls on you compared to another build, on certain benchmarks. We need to figure out what’s the matter, and you sit and figure it out.

It is clear that the search for regressions is automated, rechecking of regressions is automated, even the detection of where the performance problem occurred is in the class libraries or in the hotspot itself, which also speeds up work.

How to enter performance

Oleg: Let's say some Vasya Pupkin wants to become a performance engineer, now everyone wants to become them. Tell me, is there a meaningful formulation of the problem or not, and how to get into this whole topic?

Sergey: This is a very interesting question. What should abstract Vasya do? The question here is why performance engineers are needed and who can afford the luxury of having a performance engineer.

As a rule, this is a matter of entering the domain: those companies that are rigidly tied to a large customer base, whether it be Oracle, Twitter, Netflix or those that need to squeeze some pennies out of iron. Because here you squeezed a couple of cents from your hardware, and for your entire system it will give tens or hundreds of thousands of winnings money - this is a completely normal situation. As a rule, in such companies, these performance engineers train on their own, for the needs of the company.

If this company is not a specific project, but a more generalized one, then it’s simply wasteful to keep a performance engineer there, will it be needed often? You just need to write normal code so that everything works, and this is much more important.

And from the point of view of how to become a performance engineer, you should not be shy to dig. There is such a classic test: if a person is given two pieces of code and asked to conduct some kind of performance review, then more than half of the developers will wrap it in benchmarks, measure, find out what is faster, and stop. If they stopped at that - with performance they feel bad.

And if they say: “This is faster, because ...”, when they dug further and found the correct explanation why it is faster, that’s good. , - — , , , .

Joker

: Joker . ?

: inline-, .

: , value-?

: , , . . , . value Java- . value, value, by value, - by value, value type. , .

. , , , Rémi Forax - , inline-. , , Kotlin inline- , value- Java, .

. value-, , , mvt (minimum value types), LW 1. LW 2 — , . , , , . , , , , performance- , , , , .

: , - -?

: , - . , , , , invokedynamic .

: , , , , .

: , , , . — , , . , ?

: , , , .

: . , , inline- generic- , inline-. , .

: , - ?

: : inline- LW2 . value-, generic, .

« Java -? Valhalla» Joker , - 25-26 2019 . , .

Source: https://habr.com/ru/post/463455/

All Articles