Introduction

This post explores the Java stream distinct method. We use the distinct method to remove duplicates from a stream so it has only unique elements in it.

Stream distinct method

The distinct method is a stateful intermediate operation. Calling the distinct() method will return a new stream, which has only the distinct elements. It uses the object’s (the stream element’s) equals() method to remove duplicates. Hence, the stream elements must have equals() and hashCode() implemented.

Stability

For ordered streams, it performs the distinct operation in a stable fashion i.e., when there are two or more elements with same value (at different positions), it will retain the first and removes the others. But for unordered streams, we cannot make any stability guarantees.

Stream#distinct - A simple example

Let us create a list of strings, create a stream out of it and call the distinct method on the stream.

List<String> list = List.of("a", "b", "a", "b", "ab", "cd", "ab");
list.stream()
        .distinct()
        .forEach(System.out::println);

As you would have guessed, the output is

a
b
ab
cd

Since the stream created from a List is ordered, the selection of distinct elements will be stable. For example, there are two ‘a’s - one at position 0 and one at position 2. It will always preserve the first one and remove the subsequent ones.

Stream#distinct with a custom class

Let us use the distinct method on a stream of custom class instances. Shown below is a simple Person class with name and age. It has a public constructor, getters for the instance variables. Most importantly, it has the equals() and hashCode() methods implemented. I have also overridden the toString() method.

public class Person {
    private String name;
    private int age;

    public Person(String name, int age) {
        this.name = name;
        this.age = age;
    }

    public String getName() {
        return name;
    }

    public int getAge() {
        return age;
    }

    @Override
    public boolean equals(Object obj) {
        if (this == obj) {
            return true;
        }
        if (obj == null) {
            return false;
        }
        if (getClass() != obj.getClass()) {
            return false;
        }
        Person other = (Person) obj;
        return Objects.equals(name, other.name);
    }

    @Override
    public int hashCode() {
        return Objects.hash(name);
    }

    @Override
    public String toString() {
        return name + " " + age;
    }
}
List<Person> people = List.of(
        new Person("Adam", 21),
        new Person("John", 28),
        new Person("Adam", 22),
        new Person("Ben", 45),
        new Person("Ben", 47),
        new Person("Mathew", 43)
);

people.stream()
        .distinct()
        .forEach(System.out::println);

We created a list of persons with some duplicate objects. Note that as per the equals() implementation, two person instances are same if they have the same name irrespective of their age. Hence, this will output:

Adam 21
John 28
Ben 45
Mathew 43

How does Stream’s distinct method work

As we saw earlier, Stream.distinct is a stateful intermediate operation. Hence it will maintain state as it has to remember which elements it already saw so that it can remove the duplicates.

Let us add a peek() statement before and after the call to the distinct method.

people.stream()
        .peek(p -> System.out.println("Before distinct " + p))
        .distinct()
        .peek(p -> System.out.println("After distinct " + p))
        .forEach(System.out::println);

I have shown the output with empty lines added for better understanding.

Before distinct Adam 21
After distinct Adam 21
Adam 21

Before distinct John 28
After distinct John 28
John 28

Before distinct Adam 22

Before distinct Ben 45
After distinct Ben 45
Ben 45

Before distinct Ben 47

Before distinct Mathew 43
After distinct Mathew 43
Mathew 43

First, we have the person Adam. Since it is the first object, it prints the before and after peek statements and finally prints the person. The same applies for John as well. The third person is Adam again and it prints the before peek statement. When it goes through the distinct part of the stream pipeline, it will get filtered out as it has already seen the Adam object. Since it does not get past the distinct step, there will not be a corresponding after peek print.

Though the internal details of the distinct method cannot be guaranteed/relied upon (as it can change in the future), currently it uses a LinkedHashSet internally. A LinkedHashSet extends a HashSet which has a HashMap. So it uses the working mechanism of a map to filter duplicates. It inserts the objects into the underlying HashMap. If the object is already present in the map, it will be filtered out.

If we add a print statement as the first line of the equals and hashCode methods, we can see it being called before inserting into the map. It will call the hashCode to determine the correct bucket and will call equals where there is a bucket collision. This happens either for a new object that resulted in the same hashCode as some other object (as two unequal objects can have the same hash code) or when it gets a duplicate object. See the post on how a HashMap works in Java to learn more about this.

Using distinct on an unordered stream

When we use the distinct() method on an unordered stream, we cannot guarantee which element it will retain. Since unordered streams do not have encounter order, the distinct step is free to choose any one of the element (among the duplicated ones) and retain it. Hence, the selection is not stable.

Let us use the same list of people and create a parallel, unordered stream.

List<Person> peopleList = List.of(
        new Person("Adam", 21),
        new Person("John", 28),
        new Person("Adam", 22),
        new Person("Ben", 45),
        new Person("Ben", 47),
        new Person("Mathew", 43)
);
/*
Can pick any Adam and any Ben
 */
peopleList.parallelStream()
        .unordered() //Make the stream unordered
        .distinct()
        .forEach(System.out::println);

If you run this several times, you can find that it can pick either of the two Adam and Ben. Also, since the stream is parallel and unordered, the order can change each time.

John 28
Adam 21 //or Adan 22
Mathew 43
Ben 45 //or Ben 47

Distinct on a stream with DISTINCT characteristics

What if use distinct on a TreeSet?

Let us create a TreeSet with a comparator that is not consistent with equals. It will order the person by their age and hence two person with same age will be considered equals by the TreeSet (a contrived example).

List<Person> peopleList = List.of(
        new Person("Adam", 21),
        new Person("John", 28),
        new Person("Adam", 22),
        new Person("Ben", 45),
        new Person("Ben", 47),
        new Person("Mathew", 43)
);
Set<Person> peopleSet = new TreeSet<>(new Comparator<Person>() {
    @Override
    public int compare(Person o1, Person o2) {
        return Integer.compare(o1.getAge(), o2.getAge());
    }
});
peopleSet.addAll(peopleList);

System.out.println(peopleSet.size()); //6
peopleSet.stream()
        .distinct()
        .forEach(System.out::println);

If we run this, the output will be:

Adam 21
Adam 22
John 28
Mathew 43
Ben 45
Ben 47

It has given us all the Person objects sorted by age without removing the duplicates. Why didn’t the distinct method work in this case? Why didn’t it use the Object’s equals method and remove the duplicates?

Each stream has a set of characteristics which is derived from the spliterator of the collection. The spliterator of the TreeSet states:

The Spliterator reports Spliterator#SIZED, Spliterator#DISTINCT, Spliterator#SORTED and Spliterator#ORDERED

This says that any stream created out of a TreeMap is Distinct, Sorted, Ordered and is Sized (has a finite size). Since it is already has the DISTINCT characteristic, the distinct step in the stream pipeline will be skipped as an optimization and hence resulting in the above output.

Similarly, the spliterator of a HashSet has Sized and Distinct characteristics.

Distinct on a parallel ordered stream

Earlier we saw an example of using distinct on a parallel, unordered stream. We generally should avoid using distinct on a parallel ordered stream. If we do, since the stream is ordered, it has to preserve the stability we saw earlier (it has to pick the first among the duplicate elements as it appears as per the encounter order). This is very expensive as it requires a full barrier and buffering overhead.

There are two options:

  • When stability (and ordering) is not required, convert the ordered stream to an unordered by calling the unordered() method (as we did).
  • When ordering is needed, switch to a sequential pipeline by calling the sequential() method.

Difference between Stream’s distinct vs collecting into a Set

In all the examples, after calling the distinct, we just printed it. You might think why we need distinct method if we can just collect them as a Set like:

list.stream()
        .collect(Collectors.Set());

Collecting any stream into a Set will have the same effect as using distinct. It will use the Object’s equals and hashCode to remove duplicates.

But the real use of distinct is when you have expensive stream pipeline steps after it. Or, you want the rest of the steps to be applied only once for an element. Example: You have a very expensive map operation on the stream. In this case, you can use the distinct method to remove the duplicates, thus applying the map operation only once per input element.

Conclusion

We saw the Java Stream Distinct method in the post. First, we learned what Stream’s distinct method does with examples. Second, we learnt how it works internally and used it on an unordered stream. Third, we saw that distinct will be skipped if the stream has DISTINCT characteristics. Finally, we also saw the dangers of using it on a parallel, ordered stream.

You can also check out the other Java stream posts.