Java Stream Distinct By Property

Introduction

Java Streams has a distinct method which returns a stream with duplicates filtered out. It uses the instance’s equals method to keep state of the unique elements it sees and remove duplicates. We cannot use the distinct() method if we want to apply the distinct logic by a certain field/property. This post explains various ways to apply distinct by property.

Distinct By a Property

Let us use a simple Person class shown below. It has a name and an age. Its equals method uses the person’s name for comparison and hence two people with the same value for the name (irrespective of their age) will be considered equal.

import java.util.Objects;

public class Person {
    private String name;
    private int age;

    public Person(String name, int age) {
        this.name = name;
        this.age = age;
    }

    public String getName() {
        return name;
    }

    public int getAge() {
        return age;
    }

    @Override
    public boolean equals(Object obj) {
        if (this == obj) {
            return true;
        }
        if (obj == null) {
            return false;
        }
        if (getClass() != obj.getClass()) {
            return false;
        }
        Person other = (Person) obj;
        return Objects.equals(name, other.name);
    }

    @Override
    public int hashCode() {
        return Objects.hash(name);
    }

    @Override
    public String toString() {
        return name + " " + age;
    }
}

Recap of distinct method in Stream

From a stream of Person objects, if we use Stream’s distinct method, it will return a stream of Person objects by removing duplicate Person instances (as per the equals method).

List<Person> people = List.of(
        new Person("Adam", 21),
        new Person("John", 28),
        new Person("Adam", 22),
        new Person("Ben", 45),
        new Person("Mathew", 45),
        new Person("Smith", 47)
);

//removes duplicates by equals
people.stream()
        .distinct()
        .forEach(System.out::println);

This will output:

Adam 21
John 28
Ben 45
Mathew 45
Smith 47

The third Person in the list (Adam) is filtered out as the streams pipeline has already seen a Person object that was equal to it.

Now what if we want to apply distinct by a different property or a field? Say we want a stream of unique Person objects by their age. In other words, out of all people with same age, we want only one among them.
We cannot use distinct() method for this as it always applies the filtering as per the object’s equals method.

Now we will look at various ways to remove duplicates (get distinct objects) by a custom property.

1 – Collectors.toMap

We could use the Collectors toMap method to collect the stream elements into a map keyed by the property/field. Since a map can have only one value for a key, we would choose the first stream object for each key. And from the result map, we call the values() which would give us the list of unique/distinct persons as per the field we grouped by.

List<Person> people = List.of(
        new Person("Adam", 21),
        new Person("John", 28),
        new Person("Adam", 22),
        new Person("Ben", 45),
        new Person("Mathew", 45), 
        new Person("Smith", 47)
);

Collection<Person> uniqPeopleByAge = people.stream()
        .collect(Collectors.toMap(Person::getAge, Function.identity(),
                (person1, person2) -> person1))
        .values();
System.out.println(uniqPeopleByAge);

We use the Person’s age as the map’s key. For two people with same age, we pick the first one. Hence, between Ben and Mathew, it picks Ben.

This thus outputs:

[Adam 21, Adam 22, John 28, Ben 45, Smith 47]

Note that since we didn’t use the distinct method (and the equals method), we have two ‘Adam’ in the result as they have different age. These two objects are ‘equal’ as per the equals method.

2 – Create a wrapper class

We create a wrapper class over the Person class and override the equals and hashCode as per our need.

private static class PersonWrapper {
    private Person person;

    private PersonWrapper(Person person) {
        this.person = person;
    }

    public Person getPerson() {
        return person;
    }

    @Override
    public boolean equals(Object obj) {
        if (this == obj) {
            return true;
        }
        if (obj == null) {
            return false;
        }
        if (getClass() != obj.getClass()) {
            return false;
        }
        PersonWrapper other = (PersonWrapper) obj;
        return Objects.equals(person.getAge(), other.person.getAge());
    }

    @Override
    public int hashCode() {
        return Objects.hash(person.getAge());
    }
}

The PersonWrapper class is composed with a Person object. Its equals and hashCode are re-defined to suit our need (by age of the person). Thus, two PersonWrapper are equal if the two wrapped Person have the same age.

List<Person> people = List.of(
        new Person("Adam", 21),
        new Person("John", 28),
        new Person("Adam", 22),
        new Person("Ben", 45),
        new Person("Mathew", 45),
        new Person("Smith", 47)
);
people.stream()
        .map(PersonWrapper::new)
        .distinct()
        .map(PersonWrapper::getPerson)
        .forEach(System.out::println);

In the stream pipeline, we map a Person to a PersonWrapper (a.k.a wrap), apply the distinct() operation, and unwrap to get the Person object back.

This outputs:
Adam 21
John 28
Adam 22
Ben 45
Smith 47

The disadvantage of this is that we have to create one wrapper class for each property by which we want to apply the distinct operation (i.e., have to create one class for each distinct by property use case). This can quickly can become out of hand with lots of classes.

3 – Use a stateful filter by accumulating distinct objects in a Set

The following three method uses a stateful filter to accumulate distinct objects in a Set. For each element in the stream, we will use the function to extract the key and insert it into the Set. The Set’s add method returns true if the added element was not present before. Hence, if we get back a true, it means that this key is being processed for the first time and we allow the object to pass through the stream pipeline. If not, we will filter it out.

3a – DistinctByKey class

In this option, we will create a class that has 

  1. A mapping function to extract the key/property by which we want to apply the distinct operation by. 
  2. A Set to hold the distinct objects.
public class DistinctByKey<T> {
    private Function<T, Object> function;
    private Set<Object> seenObjects;

    private DistinctByKey(Function<T, Object> function) {
        this.function = function;
        this.seenObjects = new HashSet<>();
    }

    public boolean filterByKey(T t) {
        return seenObjects.add(function.apply(t));
    }
}

The class is generified to make it reusable in many scenarios. The filterByKey method takes an element, applies the mapping function to get the property value (by which we are applying distinct) and returns the result of adding into the set i.e., if already present it would return false; true otherwise.

We would use this as:
people.stream()
        .filter(new DistinctByKey<>(Person::getAge)::filterByKey)
        .forEach(System.out::println);

In the above code, we create a new instance of DistinctByKey in the filter step by passing the method reference Person::getAge since we want to get persons with distinct age. We call the filterByKey method on the returned instance to form the Predicate that is passed to the filter method.

 
As expected, this will result in:
Adam 21
John 28
Adam 22
Ben 45
Smith 47

Note that we do not create one DistinctByKey instance for each element in the stream. Only one DistinctByKey will be created and this is necessary since we are accumulating state in the DistinctByKey instance.

The above can be written as (simplified):

DistinctByKey<Person> distinctByKey = new DistinctByKey<>(Person::getAge);
people.stream()
        .filter(distinctByKey::filterByKey) //person -> distinctByKey.filterByKey(person)
        .forEach(System.out::println);

Now, all the distinct logic to filter duplicates resides in the DistinctByKey class.

A major drawback of this approach is that this violates the contract of a filer method that it must be stateless.

3b – Making DistinctByKey implement a Predicate

We can slightly simplify the above code by making the DistinctByKey class implement a Predicate. Hence, we can avoid an explicit call to the filterByKey method. 

public class DistinctByKey<T> implements Predicate<T> {
    private Function<T, Object> function;
    private Set<Object> seenObjects;

    private DistinctByKey(Function<T, Object> function) {
        this.function = function;
        this.seenObjects = new HashSet<>();
    }

    public boolean test(T t) {
        return seenObjects.add(function.apply(t));
    }
}
people.stream()
        .filter(new DistinctByKey<>(Person::getAge))
        .forEach(System.out::println);

3c – Higher order function

Actually, we don’t have to create a separate class like DistinctByKey. Instead, we can create a static (generic) function and pass the function to it.

public static <T> Predicate<T> distinctByKey(Function<T, Object> function) {
    Set<Object> seen = new HashSet<>();
    return t -> seen.add(function.apply(t));
}

It accepts a function to extract the property from an object and returns a new Predicate. This is very important here. We will pass the returned Predicate into the filter method. Hence, during the stream pipeline execution, when it uses the filter method, it will insert the extracted key into the same set (seen variable).

people.stream()
        .filter(distinctByKey(Person::getAge))
        .forEach(System.out::println);

Another important thing to note here is that distinctByKey method is called only once. If we called it for each element in the stream, each will have a Set of its own, and filtering wouldn’t have worked.

Parallel streams

We used a HashSet in the above example. For parallel streams, we have to use a data structure that supports concurrent insertions. Hence, we can use the keySet from a ConcurrentHashMap.

private static <T> Predicate<T> distinctByKey(Function<? super T, ?> function) {
        Set<Object> seen = ConcurrentHashMap.newKeySet();
        return t -> seen.add(function.apply(t));
}

The type parameter is changed to ? super T for better compatibility.

Some of the disadvantages in this are:

  1. The ConcurrentHashMap is more expensive and adds overhead. It will be used even for sequential streams.
  2. As stated earlier, accumulating state in a filter violates the stream contract and is not recommended.
  3. For a parallel ordered stream, among the duplicates, this can preserve any element from the stream. But a stream’s distinct method guarantees that it will maintain the stability by picking the first element and removing the others among duplicates.

Conclusion

In this post, we learnt how to apply the distinct by a custom property or field. Make sure you check out the Stream distinct method if you haven’t already.

References

https://stackoverflow.com/questions/27870136/java-lambda-stream-distinct-on-arbitrary-key/27872086

Stream package summary

This Post Has 2 Comments

  1. Rajesh

    Very Helpful, Thanks you..!!

  2. TurekBot

    Very helpful! Thank you!

Leave a Reply