Questioning the reality of generics

July 27th, 2010

I’ve been meaning to write about this for a while, since I keep saying this and people keep getting surprised. Now maybe I’m totally wrong here, and if that’s the case it would be nice to hear some good arguments for that. Here’s my current point of view on the subject anyway.

A specter is haunting the Java community – the specter of generics.

Java introcued a feature called generics in Java 5 (this feature is generally known under the name of parametric polymorphism in the literate). Before Java 5 it wasn’t possible to create a reusable collection that would ensure the type safety at compile time of what you put in to that collection. You could create a collection of for example Strings and have that working correctly, but if you wanted to have a collection of anything, as long as that anything was the same type, you were restricted to doing runtime checks, or just having good tests.

Java 5 made it possible to add type parameters to any other type, which means you could create more specific collections. There are still problems with these – they interact badly with native arrays for example, and wildcards (Java’s way of implementing co= and contravariance) have ended up being very hard for Java developers to use correctly.

Java and C# both added generic types at roughly the same time. The C# version of generics differed in a few crucial ways, though. The most important difference in implementation is that C# generics are reified, while Java generics use type erasure. And this is really the gist of this blog post. Because over and over I hear people lament the lack of reified generics in Java, citing how good C# and the CLR is to have this feature. But is that really the case? Is reified generics a good thing? Of course, that always depends on who is asking the question. Reified might well be good for one person but not another. Here you will hear my view.

Reified? Huh?

So what does reified generics mean, anyway? It is probably easiest to explain compared to the Java implementation that uses type erasure. Slightly simplified: in Java generics doesn’t exist at runtime. It is purely a fiction that the compiler uses to handle type checking and make sure you don’t do anything bad with your collection. After the generics have been type checked, they are used to generate casts and type checks in the code using generics, some metadata is inserted into the class file format, and then the generic information is thrown away.

In contrast, on the CLR, generic classes exist as specific versions of their class. The same class with different generic type arguments are really different classes. There are no casts happening at the implementation level, and the CLR will as a result generate more specific code for the generic code. Reflection and dynamic type checks is also possible on the CLR. Having reified generics means basically that they exist at runtime, that the virtual machine knows about them and handles them correctly.

Multi-language virtual machines

The last twenty years something interesting has happened. Both out hardware and software has gotten mature enough that a new generation of virtual machines have entered the market. Traditionally, virtual machines for languages were made for specific languages, such as Pascal, Lisp and Smalltalk, and possibly except for SECD and the Warren machine, there haven’t really been any virtual machines optimized to running more than one language well. The JVM didn’t start that way either, but it turned out to be more well suited for it than expected, and there are lots of efforts to make it an even better platform. The CLR, Parrot, LLVM and Rubinius are other examples of things that seem to become environments rather than just implementation strategies for languages.

This is very exciting, and I think it’s a really good thing. We are solving very complex problems where the component problems are best solved in different ways. It seems like a weird assumption that one programming language is the best way of solving all problems. But there is also a cost associated with using more than one language. So having virtual machines act as platforms, where a sharked chunk of libraries are available, and the cost of implementation is low, makes a lot of sense.

In summary, I feel that the JVM was the first step towards a real viable multi-language virtual machine, and we are currently in the middle of the evolution towards that point.

Solving the problems

So why not add reified generics to the JVM at this point? It could definitely be done, and using an approach similar to the CLR, where libraries are divided into pre and post reified makes the path quite simple from an implementation standpoint. On the user side, there would be a new proliferation of libraries to learn – but maybe that’s a good thing. There is a lot of cruft in the Java standard libraries that could be cleaned up. There are some sticky details, like how to handle the API’s that were designed for erased generics, but those problems could definitely be solved. It would also solve some other problems, such as making it possible for Scala to pattern match on type parameters and solving part of the problem with abstracting over primitive types. And it’s absolutely possible to do. It would probably make the Java language into a better language.

But is it the only solution? At this point, making this kind of change would complicate the API’s to a large degree. The reflection libraries would have to be completely redesigned (but still kept around for backwards compatibility). The most probable result would be a parallel hierarchy of classes and interfaces, just like in the CLR.

Refified generics are generally being proposed in discussions about three different things. First, performance, second, making it easier for some features in Scala and other statically typed languages on the JVM, and thirdly to handle primitives and primitive arrays a bit better. Of these, the first one is the least common, and the least interesting by far. JVM performance is already nothing short of amazing. The second point I’ll come back to in the last section. The third point is the most interesting, since there are other solutions here, including unify primitives with objects inside the JVM, by creating value types. This would solve many other problems for language implementors on the JVM, and enable lots of interesting features.

The short stick

I believe in a multi language future, and I believe that the JVM will be a core part of that future. Interoperability is just too expensive over OS boundaries – you want to be on the same platform if possible. But for the JVM to be a good environment for more than one language, it’s really important that decisions are made with that in mind. The last few years of fantastic progress from languages like Rhino, Jython, JRuby, Groovy, Scala, Fantom and Clojure have shown that it’s not only possible, but benificial for everyone involved to focus on JVM languages. JSR 223, 292 and several others also means the JVM is more and more being viewed as a platform. This is good.

Generics is a complicated language feature. It becomes even more complicated when added to an existing language that already has subtyping. These two features don’t play very well together in the general case, and great care has to be taken when adding them to a language. Adding them to a virtual machine is simple if that machine only has to serve one language – and that language uses the same generics. But generics isn’t done. It isn’t completely understood how to handle correctly and new breakthroughs are happening (Scala is a good example of this). At this point, generics can’t be considered “done right”. There isn’t only one type of generics – they vary in implementation strategies, feature and corner cases.

What this all means is that if you want to add reified generics to the JVM, you should be very certain that that implementation can encompass both all static languages that want to do innovation in their own version of generics, and all dynamic languages that want to create a good implementation and a nice interfacing facility with Java libraries. Because if you add reified generics that doesn’t fulfill these criteria, you will stifle innovation and make it that much harder to use the JVM as a multi language VM.

I’m increasingly coming to the conclusion that multi language VM’s benefit from being as dynamic as possible. Runtime properties can be extracted to get performance, while static properties can be used to prove interesting things about the static pieces of the language.

Just let generics be a compile time feature. If you don’t there are two alternatives – you are an egoist that only care about the needs of your own language, or you think you have a generic type system that can express all other generic type systems. I know which one I think is more likely.

By Ola Bini | In: blogging | tags: generics, java, jvm, mlvm, multi language vm, parametric polymorphism, programming language design, programming languages, reified generics. | #

17 Comments, Comment or Ping

Debasish Ghosh

Nice post! Just thinking if we could have a pluggable/on-demand reification scheme for cases where we need more performance or better interoperability with statically typed languages like Scala.

July 27th, 2010
John H

Thanks for posting this, a very interesting perspective.

July 27th, 2010
Pedro

Good post.

I am writing a language. I have used scheme as a language host till now to write the prototype. I want to move forward and built it over JVM.

Can you recommend me any book or resource about building language on top of the JVM?

July 27th, 2010
Casper Bang

My personal opinion is that you are mixing the issues regarding VM implementations vs. the Java type-system. While I can understand you from a VM/language geek point of view, it does not do much for the average developer trying to grasp Java’s hybrid implementation of parametric polymorphism and co/contra-variance.

In C# [where the types are reified], it comes as no surprice that you can write code like MyComparer : IComparer, IComparer. Those two types are truly different, so of course you are allowed to implement both of these.

In Java however, erasure causes a clash in the signatures and a weird error message that forces the user to understand implementation details of the runtime underneath. It’s thus hard not to consider erasure a hack! That begs the question, is it more desirable to leak complexities into the type-system or the runtime system beneath?

For me at least, the answer will always be the latter. Unfortunately, Java’s type-system has taken all the beating over the years and, as is apparent by project Coin’s mission-statement and the closure wars, is the root of Java’s staleness.

July 27th, 2010
Osvaldo Pinali Doederlein

Casper’s beat me; it seems to me, too, that you have some bias towards dynamic behavior but that’s a completely unrelated issue. BTW, it doesn’t matter how amazing JVM performance already is: performance is never good enough. Why settle for 100ms when your algorithm could run in 30ms?

For one thing, if you need something like a ArrayList, the overhead of boxing will totally destroy your performance compared to a reified version (like the “manually-reified” GNU Trove4J collections, or even a raw array int[]). Advanced GCs won’t change that, not in a real-world scenario – most microbenchmarks favor optimizations that won’t always happen in the real world, e.g. the boxed objects living only in the young gen, or having dense allocation patterns that don’t create huge cache-miss penalties. A reified collection or primitive array always has top performance, no questions asked.

On multi-language VMs: Java is indeed moving very fast, at least in some aspects but not in everything. The JVM’s core typesystem remains pitiful. Features like richer set of numeric types and operations, lightweight objects (structs / tuples / whatever), by-value arrays, by-value fields of object type, references to primitive types…, are important for many languages (whether we like these languages or not) and they can be proven to provide important performance advantages – even if that’s increasingly for niche codes, and we can debate to death about the tradeoff for other desirable language properties – nothing changes the crude fact that Low Level Code Performs Better.

“Performs” is not always faster; sometimes it’s just lighter. A Java process consumes ungodly amounts of RAM compared to a native process, and a big part of the reason is the enormous space tradeoffs that the VM must do to get speed: ultra-aggressive inlining, fat in-memory code layout and metadata (to support profiling, dynamic loading, OSR etc.), lots of overhead for advanced GCs (plenty free heap space, remsets, multiple regions, compaction, concurrency and parallelism, safepoints, JNI, write barriers…) – you wouldn’t depend so much on this mountain of complexity if the core platform supported some down-and-dirty, C-like programming. (And I’m not talking unsafe operations… these btw are already supported by sun.misc.Unsafe, that is already a de-facto API for core-lib programmers.)

July 27th, 2010
Osvaldo Pinali Doederlein

(In my comment above, “like a ArrayList” should be “like a ArrayList[Integer]”, where the blog system ate the part that looked like an HTML tag.)

July 27th, 2010
Casper Bang

Sanitation system of this blog ate my angle brackets as well. In the second paragraph it should thus say:

MyComparer : IComparer<decimal>, IComparer<int>

July 27th, 2010
Kris Nuttycombe

Another thing about erasure that you didn’t mention, and which I perceive to be a virtually unmitigated good thing (and this from the perspective of someone who loves static types) is the effect that it has on library design.

Simply put, type-casing (i.e. using reflection upon the type of an object at runtime) is a design smell in my book – and this is even more true when you start talking about type parameters. There are essentially two primary situations where type-casing is commonplace – the first is in places where you run into the “expression problem” of wanting to add behavior to an existing class hierarchy in a way that varies by the type parameter, and the second is in cases where type information is “lost” across some sort of hard boundary – say serialization, or ORM of some sort. There is also a third case, less common, which arises when you wish to abstract over some potentially infinite type (like, for example, an arbitrarily nested set of parametrically polymorphic tuples which you want to flatten into a one-dimensional data structure) but in such a case reflecting upon the types doesn’t really buy you anything more than reflecting upon the elements.

The important thing to recognize is that any time you think you might *want* reified generics, what you are actually stating is that there is some piece of type information that you perceive is unknowable at compile time, and I think that in practice this is rarely the truth. To satisfy the first case above, one could use a type disjunction (for example, Either in Scala) so that the correct handling of all the possible members of the set of types involved can be enforced by the compiler. With respect to the second problem, any time you cross some such boundary you’re *already* in the land of runtime casts and reflection, and as such I think that reflecting upon type parameters is unlikely to add substantial value.

In library design, the presence of reified types would lend itself to reliance upon the ability to reflect upon those types, and as such I think that their absence is a good thing. The more one can avoid coupling one’s design to a certain set of assumptions about the runtime behavior of the program, the better, since it is very likely that at some point in a given piece of software’s lifespan, those assumptions will be proven perhaps disastrously wrong.

July 27th, 2010
Martin Odersky

@debashish “Just thinking if we could have a pluggable/on-demand reification scheme for cases where we need more performance or better interoperability with statically typed languages like Scala.”

Such a scheme exists in Scala 2.8: it’s called manifests. The upcoming scala version for .NET will model .NET reified generics as manifests.
That way we avoid forking tha language between the two systems.

For those who do now know them already: Manifests are type descriptors passed as implicit parameters.

Overall I agree with Ola: reified generics can be be represented in a language that does not know it by manifests or something similar, but in the end I prefer a simpler VM architecture with type erasure.

July 27th, 2010
Debasish Ghosh

@Martin: Right! I was thinking of manifests only when I wrote the comment. It’s there in Scala and could be embraced as a general strategy for interoperability amongst statically typed languages that embrace the JVM. That way we have the JVM architecture simple and dynamic and leave it to the languages to plug in their reification scheme.

July 28th, 2010
Jason Zaugg

Manifests are neat, but they necessarily pollute the signature of methods. When you’re implementing a method in an interface, it’s not always possible to add the implicit parameter.

trait Functor[M[_]] { def fmap[A, B](a: M[A], f: A => B): M[B] }

object ArrayFunctor extends Functor[Array] {
def fmap[A](a: Array[A], f: (A => B)) = {
// need a Manifest[B] to create the result array, but can’t get it.
}
}

If you are willing and able to modify the interface you are implementing, you can abstract over the constraint. But it’s tricky to propagate these constraints upwards through the callers.

trait Functor[M[_], ConstraintB[_]] {
def fmap[A, B: ConstraintB](a: M[A], f: A => B): M[B]
}

trait Unbounded[X]
implicit def Unbounded[X] = new Unbounded[X]{}

object ListFunctor extends Functor[List, Unbounded] {
def fmap[A, B: Unbounded](a: List[A], f: (A => B)) = a map f
}

object ArrayFunctor extends Functor[Array, Manifest] {
def fmap[A, B: Manifest](a: Array[A], f: (A => B)) = a map f
}

July 28th, 2010
Joakim Ohlrogge

Thanks for posting this. I was one of those that wanted a more elaborate explanation of why reified generics is not a good thing. I thoroughly enjoyed the read right until the last few lines. I can think of at least one other reason why someone would want reified generics and that is ignorance. If you just run into a problem like not being able to pattern match on type in Scala for instance you may look for solutions already in place somewhere else you may not be able to see the downsides for other languages (or at least think that they are solvable).
The egoist argument sort of swings both ways, there are things that are not possible to do in a statically typed language because the type info is not available in runtime. So that path is a bit crippled on the JVM.

Luckily I’m not a type-taliban (nor a dynamic desperado) so I don’t loose much sleep over the pattern-matching thing. I am way more upset about no tail-call in the JVM which seems to mess things up for dynamic languages as well (but I am as ignorant of the downsides of that too).

I do agree though that reified generics also has problems and that it is possibly not adding enough good to make up for everything else it brings with it. Do you know about other solutions than reified or erasure?

Thanks again for posting this. Very interesting perspective!

July 29th, 2010
Casper Bang

For sake of completeness, it should also be mentioned that reified generics in Java doesn’t mix very well at all with arrays (since generic collections are not co-variant). But again, this post seems to be more about the internals of the JVM than Java and it’s generics impl.

August 2nd, 2010
Richard Cordova

Great discussion, I wish it were more representative of the average level of discourse regarding type-erasure. Thanks for posting.

November 2nd, 2010
Jon Harrop

Generic hash tables are an example of where reified generics and value types can combine to give huge performance improvements.

November 3rd, 2010
Richard Gomes

Actually, type information do exist at runtime, under certain circumstances.

I’ve written an article about this at
http://www.jquantlib.org/index.php/Using_TypeTokens_to_retrieve_generic_parameters

It’s admittedly a trick, sort of dark magic if you will, but possible.

In the end of the article I briefly present a void discussion of pros and cons.
This is void because the correct thing would be having “proper” reified generics in Java, which is not the case, unfortunately.

Thanks

Richard Gomes

January 14th, 2011
Tom Shackell

My personal experience has been that Java’s lacks both reified generics and value types (structs) can be a real performance pain for certain types of problems.

In particular I was recently writing a system that needed to store large amounts of numeric data in hashtables (like gigabytes of it).

So out the window goes the standard hashtable classes. The boxes end up consuming far too much memory. So instead I had to write my own hashtable. Not so bad except that I needed different versions for different types of data I was using. How does one achieve that in Java? Answer, you copy-paste the code multiple times. Now you have multiple versions of virtually identical code to maintain.

Secondly the ideal layout for the type of hashtable I was using would be keys-and-values interleaved. That gives the best cache performance. This is entirely possible in .Net using an array of structs. However, unless your keys and values happen to be same type you are a bit stuck in Java. You can mix longs and doubles in the same array, with a bit of work … but Longs and Object references? … no not possible, you just have to suck up the poor cache performance.

It’s these kinds of things that can make writing high performance code very difficult on the JVM. And ultimately why I believe reified generics and struct types would be worth the extra JVM complexity.

December 26th, 2012

Reply to “Questioning the reality of generics”

« Emerging Languages camp – day 2

Preannouncing Seph »

July 27th, 2010