Apache Commons Lang CharSet and CharSetUtils
The Apache Commons Lang CharSet class represents a set of characters. The CharSetUtils class has some utility methods for working with a CharSet. In this post, we shall learn about the CharSet and the CharSetUtils class in the Apache Commons Lang.
Common CharSets
It exposes some commonly used CharSet instances as public static fields.
- EMPTY - an empty CharSet with no characters
- ASCII_ALPHA - a CharSet which represents ASCII alphabetic characters (a-z and A-Z)
- ASCII_ALPHA_LOWER - a CharSet which represents ASCII lower-cased alphabetic characters (a-z)
- ASCII_ALPHA_UPPER - a CharSet which represents ASCII upper-cased alphabetic characters (A-Z)
- ASCII_NUMERIC - a CharSet which represents numeric characters (0-9)
System.out.println(CharSet.EMPTY); //[]
System.out.println(CharSet.ASCII_ALPHA); //[a-z, A-Z]
System.out.println(CharSet.ASCII_ALPHA_LOWER); //[a-z]
System.out.println(CharSet.ASCII_ALPHA_UPPER); //[A-Z]
System.out.println(CharSet.ASCII_NUMERIC); //[0-9]
CharSet - contains method
Once we have a CharSet instance, we can check if a particular character is part of the CharSet by using the contains method.
Let us get the CharSet for the ASCII alphabets and check if it has lower and upper case ‘b’ in it. Both the below calls print true.
CharSet asciiAlpha = CharSet.ASCII_ALPHA;
System.out.println(asciiAlpha.contains('b')); //true
System.out.println(asciiAlpha.contains('B')); //true
If we check if it has the character ‘2’, then it prints false, as the ASCII alphabetic characters do not have numbers in them.
System.out.println(asciiAlpha.contains('2')); //false
Another example using ASCII lower case alphabets is shown below.
System.out.println(asciiLower.contains('a')); //true
System.out.println(asciiLower.contains('A')); //false
System.out.println(asciiLower.contains('6')); //false
System.out.println(asciiLower.contains('@')); //false
CharSet instance - construction
To construct a CharSet instance with custom characters (other than the ones available as public static fields), we can use the getInstance static factory method. We pass a varargs of strings each representing the set of characters (example: a-z, A-Z, 0-9).
Basic examples of using CharSet
If we pass an empty string, it builds and returns a CharSet with no characters in it.
CharSet emptyCharSet = CharSet.getInstance("");
System.out.println(emptyCharSet); //[]
To build a CharSet with just one character, we can pass the character as a string. The CharSet will then have only that character and nothing else.
CharSet singeChar = CharSet.getInstance("a");
System.out.println(singeChar); //[a]
System.out.println(singeChar.contains('a')); //true
System.out.println(singeChar.contains('b')); //false
System.out.println(singeChar.contains('A')); //false
For multi-characters (say a range), we pass the first character, a hyphen and the last character. In the below example, the CharSet has characters ‘a’ to ‘d'.
CharSet multiChars = CharSet.getInstance("a-d");
System.out.println(multiChars); //[a-d]
System.out.println(multiChars.contains('a')); //true
System.out.println(multiChars.contains('d')); //true
System.out.println(multiChars.contains('e')); //false
For an arbitrary list of characters, we pass them as a string in any order. In the below CharSet, we have the characters ‘a’, ‘d’, ‘e’ and ‘f’ only.
CharSet multiChars = CharSet.getInstance("adef");
System.out.println(multiChars); //[d, f, a, e]
System.out.println(multiChars.contains('a')); //true
System.out.println(multiChars.contains('b')); //false
System.out.println(multiChars.contains('e')); //true
Negations
We can also specify negations by using the caret symbol (^). In the below code, we have a CharSet which has all the characters except ‘a’, ‘b’, ‘c’ and ‘d’.
CharSet negated = CharSet.getInstance("^a-d");
System.out.println(negated); //[^a-d]
System.out.println(negated.contains('a')); //false
System.out.println(negated.contains('d')); //false
System.out.println(negated.contains('e')); //true
System.out.println(negated.contains('E')); //true
System.out.println(negated.contains('3')); //true
System.out.println(negated.contains('@')); //true
CharSet - Specifying Combinations
We can even combine multi characters (ranges) with individual characters. Here, we have specified two sets:
- First set has characters [a-d] i.e., ‘a’, ‘b’, ‘c’ and ‘d’
- Second set has only characters ‘x’ and ‘y’.
CharSet combinations = CharSet.getInstance("a-dxy");
System.out.println(combinations); //[a-d, x, y]
System.out.println(combinations.contains('a')); //true
System.out.println(combinations.contains('d')); //true
System.out.println(combinations.contains('x')); //true
System.out.println(combinations.contains('y')); //true
System.out.println(combinations.contains('z')); //false
Since it takes a varargs, we can even specify the above equivalently as:
CharSet combinations = CharSet.getInstance("a-d", "xy");
System.out.println(combinations); //[a-d, x, y]
Matching order of the specified strings
It uses the following matching order to split the character groups and it processes them left to right.
- Negated multi character range (like ^a-d).
- Normal multi character range (like a-d).
- Negated single character range (like ^a)
- Normal single character range (like a).
Union of specified ranges
When we pass multiple rules/ranges, it does a union on them. For example, as shown below, there are two ranges
- All characters except ‘a’, ‘b’, ‘c’ and ‘d’
- Characters ‘l’, ‘m’ and ’n'
The final resultant CharSet has all characters except ‘a’, ‘b’, ‘c’ and ‘d’. Hence, here specifying the second set wasn’t needed.
CharSet combinations = CharSet.getInstance("^a-dl-n");
System.out.println(combinations); //[[^a-d, l-n]
System.out.println(combinations.contains('a')); //false
System.out.println(combinations.contains('l')); //true
System.out.println(combinations.contains('e')); //true
As another example, consider ranges [^a-da-e],
CharSet combinations = CharSet.getInstance("^a-da-e");
System.out.println(combinations); //[^a-d, a-e]
System.out.println(combinations.contains('a')); //true
System.out.println(combinations.contains('d')); //true
System.out.println(combinations.contains('f')); //true
System.out.println(combinations.contains('1')); //true
System.out.println(combinations.contains('@')); //true
The first rule/set has all characters except ‘a’, ‘b’, ‘c’ and ‘d’ and the second has characters a to e. Together, we have a CharSet which has all the characters.
Some specific cases when building a CharSet
If we specify the same range more than once, only one will be kept.
CharSet c = CharSet.getInstance("a-ea-e");
System.out.println(c); //[a-e]
Also, if we swap the start and end, it will restore the proper order as shown below.
CharSet c = CharSet.getInstance("e-a");
System.out.println(c); //[a-e]
To add the negation character itself into the CharSet, we could either put it at the last or pass it as a separate element.
CharSet negationCharAndAsciiLower = CharSet.getInstance("a-z^");
System.out.println(negationCharAndAsciiLower); //[^, a-z]
System.out.println(negationCharAndAsciiLower.contains('^')); //true
System.out.println(negationCharAndAsciiLower.contains('a')); //true
System.out.println(negationCharAndAsciiLower.contains('A')); //false
negationCharAndAsciiLower = CharSet.getInstance("^", "a-z");
System.out.println(negationCharAndAsciiLower); //[^, a-z]
System.out.println(negationCharAndAsciiLower.contains('^')); //true
System.out.println(negationCharAndAsciiLower.contains('a')); //true
System.out.println(negationCharAndAsciiLower.contains('A')); //false
CharSet equals
The CharSet equals method compares two CharSet instances and returns true only if they represent the same set of characters and in the same way.
CharSet c1 = CharSet.getInstance("a-c");
CharSet c2 = CharSet.getInstance("a-c");
System.out.println(c1.equals(c2)); //true
CharSet c3 = CharSet.getInstance("a-d");
System.out.println(c1.equals(c3));//false
CharSet c4 = CharSet.getInstance("abc");
System.out.println(c1.equals(c4));
In the last example, though Charsets c1 and c4 represent the same set of character (‘a’, ‘b’ and ‘c’), they are defined in different ways and hence it returns false.
CharSet hashCode and toString
We have already seen the toString implicitly as I’ve printed CharSets in the previous examples. Internally, it uses a Set(HashSet) to maintain the set of definitions and it invokes the toString on it. It also provides an implementation for hashCode.
CharSetUtils
Now let us look at the following methods in the CharSetUtils:
- containsAny
- count
- delete
- keep
- squeeze
All these methods take a string and a varargs of charset in the set syntax (as used to build a CharSet instance) and not the actual CharSet instance.
CharSetUtils#containsAny
It checks if any of the characters from the passed varargs of charset are present in the string - if yes, it returns a true and false otherwise.
System.out.println(CharSetUtils.containsAny("abcd", "a-b")); //true
System.out.println(CharSetUtils.containsAny("abcd", "e-g")); //false
System.out.println(CharSetUtils.containsAny("abcd", "a-b", "e-g")); //true
In the first line, the charset has the characters ‘a’, ‘b’, ‘c’ and ‘d’. Since at least one of them was present in the passed string, it returns a true. In the second case, the charset has ‘e’, ‘f’ and ‘g’ and none of them are present in the string and hence it returns a false. Finally, we pass two char sets and characters ‘a’ and ‘b’ are present in the string and it returns a true.
Some more examples are as follows:
System.out.println(CharSetUtils.containsAny("12", "a-b")); //false
System.out.println(CharSetUtils.containsAny("12", "2-5")); //true
System.out.println(CharSetUtils.containsAny("abc@", "a", "@")); //true
CharSetUtils#count
The CharSetUtils#containsAny method returned only a boolean i.e., whether any of the characters from the passed charset(s) are present in the string. The count method returns how many characters are present. Shown below are the usage of the count method for the same set of examples from above.
System.out.println(CharSetUtils.count("abcd", "a-b")); //2
System.out.println(CharSetUtils.count("abcd", "e-g")); //0
System.out.println(CharSetUtils.count("abcd", "a-b", "e-g")); //2
System.out.println(CharSetUtils.count("12", "a-b")); //0
System.out.println(CharSetUtils.count("12", "2-5")); //1
System.out.println(CharSetUtils.count("abc@", "a", "@")); //2
CharSetUtils#delete
The delete method takes a string and varargs of charset and deletes the characters (in the specified string), if it is present in the charset.
System.out.println(CharSetUtils.delete("abcd", "a-b")); //cd
System.out.println(CharSetUtils.delete("abcd", "e-g")); //abcd
System.out.println(CharSetUtils.delete("abcd", "a-b", "e-g")); //cd
System.out.println(CharSetUtils.delete("12", "a-b")); //12
System.out.println(CharSetUtils.delete("12", "2-5")); //1
System.out.println(CharSetUtils.delete("abc@", "a", "@")); //bc
- The charset is ‘a’ and ‘b’ and hence it deletes those characters from the string - the result being cd.
- We have the charset as ‘e’, ‘f’ and ‘g’ and none of those characters are present in the string and hence it deletes nothing.
- It removes characters ‘a’ and ‘b’.
- None of the characters from the charset is present in the string and hence it deletes nothing.
- The string has only the character 2, and the result is “1”.
- Finally, it deletes the character ‘a’ and ‘@‘ from the string.
CharSetUtils#keep
The keep method is the inverse of the delete method. Rather than deleting the characters, it keeps only the characters that are present in the charset.
System.out.println(CharSetUtils.keep("abcd", "a-b")); //ab
System.out.println(CharSetUtils.keep("abcd", "e-g")); //""
System.out.println(CharSetUtils.keep("abcd", "a-b", "e-g")); //ab
System.out.println(CharSetUtils.keep("12", "a-b")); //""
System.out.println(CharSetUtils.keep("12", "2-5")); //2
System.out.println(CharSetUtils.keep("abc@", "a", "@")); //a@
From the output, we can see that the result is the exact opposite of the outputs from the delete method.
In the first call, since the charset has characters ‘a’ and ‘b’, it retains only those characters in the string. In the second call, none of the characters from the charset are present in the string and hence it retains none of the characters resulting in an empty string. I’ll skip the explanation of the other calls.
CharSetUtils#squeeze
The squeeze method squeezes the repetitions of characters in the charset.
System.out.println(CharSetUtils.squeeze("aabccd", "a-d")); //abcd
System.out.println(CharSetUtils.squeeze("aabccd", "a-b")); //abccd
System.out.println(CharSetUtils.squeeze("abbceffg", "a-b", "e-g")); //abcefg
System.out.println(CharSetUtils.squeeze("1223", "1-3")); //123
System.out.println(CharSetUtils.squeeze("abbc@@", "a", "@")); //abbc@
The repetition of characters ‘a’ and ‘b’ have been removed in the first call. In the second call, the repetition of character ‘a’ is removed but not ‘c’ as ‘c’ is not in the charset.
Conclusion
This concludes the post on the Apache Commons Lang CharSet and CharSetUtils. Check out the other useful utilities in the Apache Commons Lang.