October 3, 2012

Java Collections API Enhancements: Thanks to Closures – Lambda Expressions



Friends, in the last tutorial we had a detailed introduction to Java 8’s Feature of Closures – Lambda Expressions. During the discussion, we understood the issues with the plain old Anonymous Inner Classes, learnt the Java Closures (Lamba Expressions) Syntax, and also practiced some of our own Java Lambda Expression examples along with the conceptual and practical understanding of Functional Interfaces, Method References, Constructor References and Default Methods.


In this Java Tutorial we are going to discuss about Java 8’s modification to the Java Collections API. The Java Collections Framework is being enhanced in order to get the benefits out of the latest Java 8 Feature that is Closures. If you are new to the concept of Java Closures or Lambda Expressions, I recommend you to go through my previous post: Introduction to Java Closures – Lambda Expressions.

Java Lambda Expressions would surely change some of our programming habits and also the way we look at the language, including the various Java APIs. When a feature like Lambda Expression is added to a programming language, it becomes extremely important to utilize the new feature to empower the overall programming model along with the existing set of libraries. With addition of Closures to Java, the existing Java Collection Framework will start looking weaker and outdated. The Java Collections framework was introduced in Java 1.2, and since then its core interfaces have never been changed. This is because, the Java Collections framework is so widely used, that any changes to it will surely broke many existing functionalities, and that’s why it is not easy to completely rewrite the Java Collections API. There was another option to keep the existing Collections API as is, and add an additional Lambda Expression friendly version of the API, but that would lead to tremendous amount of changes in the existing code, which depends upon the Collections API. Also applications will have to maintain two different versions of the library, and what if somebody wants to use a mix of old and new features? To overcome these challenges, Java 8 has added new set of methods to the existing collection classes and interfaces. Having these methods under the belt, the Java Collections framework will work as it used to be; and will also have an additional potential to support Java’s Lambda Expressions or Closures.







The Existing Behavior: 

No doubt, the existing Java Collections API is nice and we are very much habitual to use it. But as stated above, having Lambda Expressions in the toolbox we can naturally start noticing the some of the shortcoming of the exiting framework. Lets have a look at below problem.

We want to print scores of all the students with name as “Tom” and print their respective scores. To model this, I will iterate through the list of Student and create a new List of Student who have name as “Tom”, which will be iterated over to print scores of individual students.

List<Student> studentsNamedAsTom = new ArrayList<>();
for (Student student : students) {
 if (student.getName().equals("Tom")) {
  studentsNamedAsTom.add(student);
 }
}
  
for (Student student : studentsNamedAsTom) {
 System.out.println("name: " + student.getName() + " -> Score: "+ 
                           student.getScore());
}


I know, I could have combined the functionality in a single for loop, but I intentionally wanted to keep them split across the loops, so that I can anytime change any loop without affecting the other and possibly you may consider, both the for loops belong to two different methods. Now, lets try to identify the problems associated with this code. 


First of all, as a client to the Collection (list of Student), I have to ask for an iterator (through the for-each loop) and iterate through it. I will have to take care of iteration logic and putting conditions between iterations. Ideally a client should only be concerned about What To do with the collection and not about How To do.

The studentsNamedAsTom is just a temporary object and it is only concerned about to pass values from one for loop to the other, or possibly from one method to other.  These temporary objects are overhead for the memory and mostly referred as Garbage Objects. For complex programs we end up creating bunch of such a garbage object, which are just meant to temporarily hold the values.

Now think of a scenario, the student list contains thousands of records, that mens the first for loop will iterate those many number of times. But assume, only 40th and 55th number students are “Tom”. There is no point in iterating the list after 55 elements. Suppose in the second loop, I want to print only those “Tom”s who have scores more than 80, and there might be only one student matching this. As a client I have no control to avoid such an unwanted iterations.

All of these operations take place sequentially (one after the other). If we want to optimize the behavior by creating multiple threads, we will have to take of the concurrency along with the logic of iterations and operation and it will surely make the code look complex.

Now its time to discuss the Java 8’s Collections Framework Features, and how they solve the above mentioned issues.



Mode of Iterations:

As discussed above, when a client wants to operate on a collection, it has to access the iterator, manually iterate through it, and also has to put the functional logic in the iterations. This approach is basically simple and straight, as the operations are sequential and the elements are processed in the order in which they appear in the collection. This kind of iterations is called as External Iterations.


The upcoming additions to the Java 8’s Collections API will make it to support Internal Iterations. In the Internal Iterations, the client abstracts the functional behavior and passes it directly to a method on collections in order to get it applied to all the elements of a collection. The library will be responsible for applying the behavior to the elements of collections. Hence client has to bother about ‘what’ and not about ‘how’. Lets have a look at the below example.


List<Student> studentsNamedAsTom =
           students.filter (student -> student.getName.equals("Tom"))
           .into (new ArrayList<>());


This is just a single statement, but its capable of doing a lot more than what our first for loop did. Before we get into these details, first understand what’s exactly happening here. Client provides the filter method with an implementation of Predicate (a functional interface). Instead providing anonymous inner class, we provide a Lambda Expression implementation for Predicate and pass it to the method. The library will internally iterate through the Collection and apply Predicate on it. This keeps the client from the iteration details and client can only concentrate on the ‘What’ and not ‘How’.

In case of internal iterations the library has full control over the iterations and it becomes possible for the libraries to use parallelism or optimize the memory usage in order to process the elements more efficiently. The client and the library can share the control on behaviors amongst each other and make the operation more efficient. Apart from this, the Internal Iteration makes the program very simple and readable. Below is a set of examples, which shows how easy it is to alter the program behavior without increasing any kind of iterative complexity.


//Set grade = “A” for students with score > 80
students.filter(s -> s.getScore() > 80)
         .forEach(s -> {
               s.setGrade(“A”);
               System.out.println("name: " + s.getName()+" -> Grade:"+s.getGrade());       
          });

//Create sublist of students having grade "A" and name starts with "N"
List <Student> sublist =
                 students.filter(student -> student.getGrade().equals("A") 
         && student.getName().startsWith("N"))
         .into(new ArrayList<>());

Now, in the subsequent sections, we will discuss the potentials of the Java Collection Frameworks internal iteration mechanism.



Benefits of Laziness:

We have seen in the plain collections example, that both of the for loops iterates through the entire collection they have, no matter what exactly we are looking for. When we put conditional statements in the iterations, naturally the condition will be applied from first through last elements in the collection. The condition may holds good only for first few elements and will not be matched for the rest of the iterations. This kind of operations is called as Eagar Processing and often results in a big performance toll for the programs. Following quote is the only solution for this.


Laziness can be a big Performance Advantage - Brian Goetz"

Brian Goetz (Oracle’s Java Language Architect) believes in this and his Java 8’s Project Lambda will surely make us too believe it. (Sometimes I feel proud of myself. No really!! It took 15 years for Java Collections to acquire this property, which I have been successfully holding since my birth). Eagar processing may sometimes sound expensive, because in simple words, when we put a condition the program doesn’t know about how the matched elements are going to be used by next block of the code. In such cases Lazy Processing is quite helpful, where we can process only what we need. In case of our plain collection example the first for loop iterates through the entire list of students and before the ‘for’ loop ends the second list of students is completely ready with all the matching elements populated in it. Below program does the same thing with a newer approach.


List<Student> studentsNamedAsTom = 
        students.filter (student -> student.getName.equals("Tom"))
 .into (new ArrayList<>());


What happens when we simply run the above code?
The answer is NOTHING. 
Because like many of the developers, some of the new methods on the Collections API are ‘Lazy’ and they don’t complete their tasks until the last minute. These developers and methods both are actually smarter, because at the last minute they have the most concrete requirements, and they can do what exactly is required unlike those who work a lot before the requirements are final. 


Now, the serious answer is also, NOTHING.

When we run the above statement, neither the collection is filtered nor the studentsNamedAsTo has anything in it. These things will actually trigger when we start iterating the studentsNamedAsTom. When the first iteration on studentsNamedAsTom is processed the Student collection is actually iterated for those many numbers of iterations, which are sufficient to provide studentsNamedAsTom with its first element. For second iteration of studentsNamedAsTom, the student collection will further be iterated until it gives second element to studentsNamedAsTom. If we decide to stop here, there wont be any extra iteration on Students. This behavior improves the performance significantly. 


This is possible because the studentsNamedAsTom is not actually a concrete collection object but it is a stream of data values, which are Itarable. When an iterator asks for a next element on stream, the stream will request it to the source collection. All the ‘lazy’ methods returns a stream, instead of concrete collection objects, this also reduces number of garbage objects created by the program and improves the memory performance.



With the help of stream, we can actually form pipeline lazy methods, one after the other. Each method takes stream as a kind of input and delivers processed stream as an output, which is taken by next method in the pipeline. This helps us plug-in and out any operation anytime, without affecting the code complexity. The advantage of the pipeline is that the code becomes more compact and readable.




More About Streams and Laziness:


As discussed above, the lazy operating methods produce steams of data values. The most important thing with streams is that, they don’t require storage. When a method returns a stream and next method takes that stream to process further, object is added on the memory. Streams just carry data from source through a pipeline of operations. Streams cannot modify the original source collection.

There are many stream operations, which can be applied lazily, which means we don’t need to iterate through the entire stream. We can just iterate through what we need, this saves the further processing which is required to generate further data in the stream. Also, as the streams are continuous flow of data, there are no bounds applied to it. Streams can hold infinite data. We can even have a stream of infinitely long numbers, which was never possible by the older Collections API. Lets have a look at a example program below, we are calculating sum of the scores of Classroom “A” students.




int sum = students.getFilter(s -> s.getClassRoom.equals("A"))
    .map(s -> s.getScore())
    .sum();

As the filter and map methods are lazy, the source will not be read until the call to sum method and there is no need to maintain intermediate objects.


When normally, we iterate through collections, we cannot change the source collections. While doing so we get ConcurrentModificationException. The same rule applies for the new set of methods. Hence when we pass lambda expressions to the collections methods, we should ensure that the lambda expressions are not modifying the source collection.



Support for Parallelism:

Normal operations on collections – such as Iterating a collection with Iterator, accessing each element, applying some filter and setting a new value to an element or creating sub collection of those elements - are sequential operations. That means all these operations are carried out in series (one-after-other). And for the same, there is a huge scope of performance improvements, if they same operations are carried out in parallel. We can perform the same operations by creating multiple threads, but then it adds complexity to the program. An extra care is required to be taken when we create multiple threads to process a single collection, because there is always a possibility of Concurrent Modification.

The new modification on the Java 8 Collections API makes it quite easier for developers. It has operations that have in-built support for parallelism, it gives control to the client, whether it wants to use parallelism, but most importantly, it keeps client far away from the internal complexities of the implementation of parallelism.

Java SE 7 had introduced a very exciting feature of Fork Join Framework, which works on the Work Stealing Algorithm. It divides a task into multiple subtasks and each subtask to further fine-grained subtasks until it is no more divisible. Then the fine-grained subtasks are performed sequentially and their results are combined to generate the outcome of the task. For more information on the fork join framework, please visit Introduction To Fork Join Framework with examples. The implementation details of division of tasks, subtask operations, and aggregation of the subtasks results are no doubt very complex, but the collection framework hides that behind the ‘parallel’ method. This method is simply a kind of parallelism switch, which you can put, and remove anywhere in the pipeline. Below is the modified, total score calculator program, where you can see, it takes nothing more than a single method call to plug-in parallelism in your operations.

int sum = students.getFilter(s -> s.getClassRoom.equals("A"))
    .parallel()
    .map(s -> s.score)
    .sum();

We have come to the end of this article. We emphasized more on the conceptual understandings than the implementation details of the features, because the Java 8 Collection Framework modification is still under the development and there are chances of changes to the information we have at this point. As the development progresses further, the detailed information of various methods, and Interfaces will be open, and then we can have a much-detailed overview of the Java Collections Framework.