Apache Commons Lang CharSet and CharSetUtils

The Apache Commons Lang CharSet class represents a set of characters. The CharSetUtils class has some utility methods for working with a CharSet. In this post, we shall learn about the CharSet and the CharSetUtils class in the Apache Commons Lang.

Common CharSets

It exposes some commonly used CharSet instances as public static fields.

  • EMPTY - an empty CharSet with no characters
  • ASCII_ALPHA - a CharSet which represents ASCII alphabetic characters (a-z and A-Z)
  • ASCII_ALPHA_LOWER - a CharSet which represents ASCII lower-cased alphabetic characters (a-z)
  • ASCII_ALPHA_UPPER - a CharSet which represents ASCII upper-cased alphabetic characters (A-Z)
  • ASCII_NUMERIC - a CharSet which represents numeric characters (0-9)
System.out.println(CharSet.EMPTY); //[]
System.out.println(CharSet.ASCII_ALPHA); //[a-z, A-Z]
System.out.println(CharSet.ASCII_ALPHA_LOWER); //[a-z]
System.out.println(CharSet.ASCII_ALPHA_UPPER); //[A-Z]
System.out.println(CharSet.ASCII_NUMERIC); //[0-9]

CharSet - contains method

Once we have a CharSet instance, we can check if a particular character is part of the CharSet by using the contains method.

Let us get the CharSet for the ASCII alphabets and check if it has lower and upper case ‘b’ in it. Both the below calls print true.

CharSet asciiAlpha = CharSet.ASCII_ALPHA;
System.out.println(asciiAlpha.contains('b')); //true
System.out.println(asciiAlpha.contains('B')); //true

If we check if it has the character ‘2’, then it prints false, as the ASCII alphabetic characters do not have numbers in them.

System.out.println(asciiAlpha.contains('2')); //false

Another example using ASCII lower case alphabets is shown below.

System.out.println(asciiLower.contains('a')); //true
System.out.println(asciiLower.contains('A')); //false 
System.out.println(asciiLower.contains('6')); //false
System.out.println(asciiLower.contains('@')); //false

CharSet instance - construction

To construct a CharSet instance with custom characters (other than the ones available as public static fields), we can use the getInstance static factory method. We pass a varargs of strings each representing the set of characters (example: a-z, A-Z, 0-9).

Basic examples of using CharSet

If we pass an empty string, it builds and returns a CharSet with no characters in it.

CharSet emptyCharSet = CharSet.getInstance("");
System.out.println(emptyCharSet); //[]

To build a CharSet with just one character, we can pass the character as a string. The CharSet will then have only that character and nothing else.

CharSet singeChar = CharSet.getInstance("a");
System.out.println(singeChar); //[a]
System.out.println(singeChar.contains('a')); //true
System.out.println(singeChar.contains('b')); //false
System.out.println(singeChar.contains('A')); //false

For multi-characters (say a range), we pass the first character, a hyphen and the last character. In the below example, the CharSet has characters ‘a’ to ‘d'.

CharSet multiChars = CharSet.getInstance("a-d");
System.out.println(multiChars); //[a-d]
System.out.println(multiChars.contains('a')); //true
System.out.println(multiChars.contains('d')); //true
System.out.println(multiChars.contains('e')); //false

For an arbitrary list of characters, we pass them as a string in any order. In the below CharSet, we have the characters ‘a’, ‘d’, ‘e’ and ‘f’ only.

CharSet multiChars = CharSet.getInstance("adef");
System.out.println(multiChars); //[d, f, a, e]
System.out.println(multiChars.contains('a')); //true
System.out.println(multiChars.contains('b')); //false
System.out.println(multiChars.contains('e')); //true

Negations

We can also specify negations by using the caret symbol (^). In the below code, we have a CharSet which has all the characters except ‘a’, ‘b’, ‘c’ and ‘d’.

CharSet negated = CharSet.getInstance("^a-d");
System.out.println(negated); //[^a-d]

System.out.println(negated.contains('a')); //false 
System.out.println(negated.contains('d')); //false
System.out.println(negated.contains('e')); //true
System.out.println(negated.contains('E')); //true
System.out.println(negated.contains('3')); //true
System.out.println(negated.contains('@')); //true

CharSet - Specifying Combinations

We can even combine multi characters (ranges) with individual characters. Here, we have specified two sets:

  1. First set has characters [a-d] i.e., ‘a’, ‘b’, ‘c’ and ‘d’
  2. Second set has only characters ‘x’ and ‘y’.
CharSet combinations = CharSet.getInstance("a-dxy");
System.out.println(combinations); //[a-d, x, y]

System.out.println(combinations.contains('a')); //true
System.out.println(combinations.contains('d')); //true
System.out.println(combinations.contains('x')); //true
System.out.println(combinations.contains('y')); //true
System.out.println(combinations.contains('z')); //false

Since it takes a varargs, we can even specify the above equivalently as:

CharSet combinations = CharSet.getInstance("a-d", "xy");
System.out.println(combinations); //[a-d, x, y]

Matching order of the specified strings

It uses the following matching order to split the character groups and it processes them left to right.

  1. Negated multi character range (like ^a-d).
  2. Normal multi character range (like a-d).
  3. Negated single character range (like ^a)
  4. Normal single character range (like a).

Union of specified ranges

When we pass multiple rules/ranges, it does a union on them. For example, as shown below, there are two ranges

  1. All characters except ‘a’, ‘b’, ‘c’ and ‘d’
  2. Characters ‘l’, ‘m’ and ’n'

The final resultant CharSet has all characters except ‘a’, ‘b’, ‘c’ and ‘d’. Hence, here specifying the second set wasn’t needed.

CharSet combinations = CharSet.getInstance("^a-dl-n");
System.out.println(combinations); //[[^a-d, l-n]
System.out.println(combinations.contains('a')); //false
System.out.println(combinations.contains('l')); //true
System.out.println(combinations.contains('e')); //true

As another example, consider ranges [^a-da-e],

CharSet combinations = CharSet.getInstance("^a-da-e");
System.out.println(combinations); //[^a-d, a-e]
System.out.println(combinations.contains('a')); //true
System.out.println(combinations.contains('d')); //true
System.out.println(combinations.contains('f')); //true
System.out.println(combinations.contains('1')); //true
System.out.println(combinations.contains('@')); //true

The first rule/set has all characters except ‘a’, ‘b’, ‘c’ and ‘d’ and the second has characters a to e. Together, we have a CharSet which has all the characters.

Some specific cases when building a CharSet

If we specify the same range more than once, only one will be kept.

CharSet c = CharSet.getInstance("a-ea-e");
System.out.println(c); //[a-e]

Also, if we swap the start and end, it will restore the proper order as shown below.

CharSet c = CharSet.getInstance("e-a");
System.out.println(c); //[a-e]

To add the negation character itself into the CharSet, we could either put it at the last or pass it as a separate element.

CharSet negationCharAndAsciiLower = CharSet.getInstance("a-z^");
System.out.println(negationCharAndAsciiLower); //[^, a-z]
System.out.println(negationCharAndAsciiLower.contains('^')); //true
System.out.println(negationCharAndAsciiLower.contains('a')); //true
System.out.println(negationCharAndAsciiLower.contains('A')); //false

negationCharAndAsciiLower = CharSet.getInstance("^", "a-z"); 
System.out.println(negationCharAndAsciiLower); //[^, a-z]
System.out.println(negationCharAndAsciiLower.contains('^')); //true
System.out.println(negationCharAndAsciiLower.contains('a')); //true
System.out.println(negationCharAndAsciiLower.contains('A')); //false

CharSet equals

The CharSet equals method compares two CharSet instances and returns true only if they represent the same set of characters and in the same way.

CharSet c1 = CharSet.getInstance("a-c");
CharSet c2 = CharSet.getInstance("a-c");
System.out.println(c1.equals(c2)); //true

CharSet c3 = CharSet.getInstance("a-d");
System.out.println(c1.equals(c3));//false
CharSet c4 = CharSet.getInstance("abc");
System.out.println(c1.equals(c4));

In the last example, though Charsets c1 and c4 represent the same set of character (‘a’, ‘b’ and ‘c’), they are defined in different ways and hence it returns false.

CharSet hashCode and toString

We have already seen the toString implicitly as I’ve printed CharSets in the previous examples. Internally, it uses a Set(HashSet) to maintain the set of definitions and it invokes the toString on it. It also provides an implementation for hashCode.

CharSetUtils

Now let us look at the following methods in the CharSetUtils:

  • containsAny
  • count
  • delete
  • keep
  • squeeze

All these methods take a string and a varargs of charset in the set syntax (as used to build a CharSet instance) and not the actual CharSet instance.

CharSetUtils#containsAny

It checks if any of the characters from the passed varargs of charset are present in the string - if yes, it returns a true and false otherwise.

System.out.println(CharSetUtils.containsAny("abcd", "a-b")); //true
System.out.println(CharSetUtils.containsAny("abcd", "e-g")); //false
System.out.println(CharSetUtils.containsAny("abcd", "a-b", "e-g")); //true

In the first line, the charset has the characters ‘a’, ‘b’, ‘c’ and ‘d’. Since at least one of them was present in the passed string, it returns a true. In the second case, the charset has ‘e’, ‘f’ and ‘g’ and none of them are present in the string and hence it returns a false. Finally, we pass two char sets and characters ‘a’ and ‘b’ are present in the string and it returns a true.

Some more examples are as follows:

System.out.println(CharSetUtils.containsAny("12", "a-b")); //false
System.out.println(CharSetUtils.containsAny("12", "2-5")); //true

System.out.println(CharSetUtils.containsAny("abc@", "a", "@")); //true

CharSetUtils#count

The CharSetUtils#containsAny method returned only a boolean i.e., whether any of the characters from the passed charset(s) are present in the string. The count method returns how many characters are present. Shown below are the usage of the count method for the same set of examples from above.

System.out.println(CharSetUtils.count("abcd", "a-b")); //2
System.out.println(CharSetUtils.count("abcd", "e-g")); //0
System.out.println(CharSetUtils.count("abcd", "a-b", "e-g")); //2

System.out.println(CharSetUtils.count("12", "a-b")); //0
System.out.println(CharSetUtils.count("12", "2-5")); //1

System.out.println(CharSetUtils.count("abc@", "a", "@")); //2

CharSetUtils#delete

The delete method takes a string and varargs of charset and deletes the characters (in the specified string), if it is present in the charset.

System.out.println(CharSetUtils.delete("abcd", "a-b")); //cd
System.out.println(CharSetUtils.delete("abcd", "e-g")); //abcd
System.out.println(CharSetUtils.delete("abcd", "a-b", "e-g")); //cd

System.out.println(CharSetUtils.delete("12", "a-b")); //12
System.out.println(CharSetUtils.delete("12", "2-5")); //1

System.out.println(CharSetUtils.delete("abc@", "a", "@")); //bc
  1. The charset is ‘a’ and ‘b’ and hence it deletes those characters from the string - the result being cd.
  2. We have the charset as ‘e’, ‘f’ and ‘g’ and none of those characters are present in the string and hence it deletes nothing.
  3. It removes characters ‘a’ and ‘b’.
  4. None of the characters from the charset is present in the string and hence it deletes nothing.
  5. The string has only the character 2, and the result is “1”.
  6. Finally, it deletes the character ‘a’ and ‘@‘ from the string.

CharSetUtils#keep

The keep method is the inverse of the delete method. Rather than deleting the characters, it keeps only the characters that are present in the charset.

System.out.println(CharSetUtils.keep("abcd", "a-b")); //ab
System.out.println(CharSetUtils.keep("abcd", "e-g")); //""
System.out.println(CharSetUtils.keep("abcd", "a-b", "e-g")); //ab

System.out.println(CharSetUtils.keep("12", "a-b")); //""
System.out.println(CharSetUtils.keep("12", "2-5")); //2

System.out.println(CharSetUtils.keep("abc@", "a", "@")); //a@

From the output, we can see that the result is the exact opposite of the outputs from the delete method.

In the first call, since the charset has characters ‘a’ and ‘b’, it retains only those characters in the string. In the second call, none of the characters from the charset are present in the string and hence it retains none of the characters resulting in an empty string. I’ll skip the explanation of the other calls.

CharSetUtils#squeeze

The squeeze method squeezes the repetitions of characters in the charset.

System.out.println(CharSetUtils.squeeze("aabccd", "a-d")); //abcd
System.out.println(CharSetUtils.squeeze("aabccd", "a-b")); //abccd
System.out.println(CharSetUtils.squeeze("abbceffg", "a-b", "e-g")); //abcefg

System.out.println(CharSetUtils.squeeze("1223", "1-3")); //123

System.out.println(CharSetUtils.squeeze("abbc@@", "a", "@")); //abbc@

The repetition of characters ‘a’ and ‘b’ have been removed in the first call. In the second call, the repetition of character ‘a’ is removed but not ‘c’ as ‘c’ is not in the charset.

Conclusion

This concludes the post on the Apache Commons Lang CharSet and CharSetUtils. Check out the other useful utilities in the Apache Commons Lang.