Regular Expressions in Java

1) Introduction

Regular Expressions are basically patterns of characters which are used to perform certain useful operations on the given input. The operations include finding particular text, replacing the text with some other text, or validating the given text. For example, we can use Regular Expression to check whether the user input is valid for a field like Email Id or a telephone number.

also read:

2) java.util.regex Package

The java.util.regex package provides the necessary classes for using Regular Expressions in a java application. This package has been introduced in Java 1.4. It consists of three main classes namely,

  • Pattern
  • Matcher
  • PatternSyntaxException

3) Pattern Class

A regular expression which is specified as a string, should be first compiled into an instance of Pattern class. The resulting pattern can then be used to create an instance of Matcher class which contains various in-built methods that helps in performing a match against the regular expression. Many Matcher objects can share the same Pattern object.

Let us now discuss about the important methods in Pattern class.

a) compile() method

Before working with, we need a compiled form of regular expression pattern, by calling the Pattern.compile() method which returns a new
Pattern object. Note that compile() is a static method, so we dont an instance of the Pattern class.

There are two forms of compile() method,

  • compile(String regex)
  • compile(String regex, int flags)

In the first form of compile() method, we pass the regular expression that would be compiled. In the second form of this method, we have an additional parameter which is used to specify the match flags that has to be applied. The flags can be either CASE_INSENSITIVE, MULTILINE, DOTALL, UNICODE_CASE or CANON_EQ based on which matching would be done.

b) matcher() method

The matcher() method is used to create new Matcher object for an input for a given pattern, which can be used to perform matching operations. The syntax is as follows,

matcher(String input)

c) matches() method


The Pattern class provides a matches() method, which is a static method. This method returns true only if the entire input text matches the pattern. This method internally depends on the compile() and matcher() methods of the Pattern object. The syntax for this static matches() method is,

Pattern.matches(pattern, inputSequence);

Let us see a simple example,

RegExpTest.java

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegExpTest {
    public static void main(String[] args) {

        String inputStr = "Computer";
        String pattern = "Computer";
        boolean patternMatched = Pattern.matches(pattern, inputStr);
        System.out.println(patternMatched);

    }
}

The matches() method returns true in this case and hence we get the output as true. If in case, we had given the input string as "ComputerScience", then the matches() method would have returned false.

In the static matches() method, when we specify an input string and a pattern to the Pattern.matches() method, the pattern gets compiled into a Pattern object which is used for matching operation. This is inefficient because every time we specify an input string and pattern, compilation of the pattern is done. Hence, its better to use the non static matches() method in Matcher class. (This matches() method would be discussed when dealing with Matcher class in the forth-coming section).

d) pattern() method

Returns the regular expression as a string from which this pattern was compiled.

e) split() method

This method is used to split the given input text based on the given pattern. It returns a String array.

There are two forms of split() method,

  • split(String input)
  • split(String input, int limit)

In the second form, we have an argument called limit which is used to specify the limit i.e the number of resultant strings that have to be obtained by split() method.

Let us see a simple example for the split() method,

Pattern pattern = Pattern.compile("ing");
Matcher matcher = pattern.matcher("playingrowinglaughingsleepingweeping");
String[] str = pattern.split(input, 4);
for(String st : str)
{
    System.out.println(st);
}

In the above code, we had specified the limit of number of Strings to be returned as 4. Hence we would get 4 strings as the result.

The output for the above code is,

play
row
laugh
sleepingweeping

f) flag() method

This method returns this pattern’s match flags which would have been specified when the pattern was compiled.

4) Matcher Class

The Matcher class which contains various in-built methods such as matches(), find(), group(), replaceFirst(), replaceAll() etc., that help us to check whether the desired pattern occurs in the given text or search the desired pattern in the text or to replace the occurrence of the pattern in the text with some other set of characters as per the requirement.

Let us now discuss about the important methods in Matcher class.

a) matches() method

The matches() method available in the Matcher class is used to match an input text against a pattern. This method returns true only if the entire input text sequence matches the pattern. Consider the following example,

String input = "Java1.4, Java1.5, Java1.6";
Pattern pattern = Pattern.compile("Java");
Matcher matcher = pattern.matcher(input);
boolean patternMatched = matcher.matches();

In this case the value of patternMatched will be false since the entire input string "Java1.4, Java1.5, Java1.6" does not match the regular expression pattern "Java" and hence matches() method returns false. Let us use the same Pattern object in another Matcher object and see how it works. Consider the following code,

String input = "Java";
Matcher matcher1 = pattern.matcher(input);
boolean patternMatched1 = matcher1.matches();

Here, the matches() method returns true since the entire input sequence matches the pattern "Java". The matches() method finds appropriate use in searching for particular whole words in a given text.

Let us see another example to know about some other methods available in Matcher class,

String input = "Java1.4, Java1.5, Java1.6";

Pattern pattern = Pattern.compile("Java");
Matcher matcher = pattern.matcher(input);

while (matcher.find()){
    System.out.println(matcher.group() + ": " +matcher.start() + ": " + matcher.end());
}

The output of the above code is,

Java: 0: 4
Java: 9: 13
Java: 18: 22


In the above example code, we see the usage of find(), group(), start and end() methods.

Now, let us know about the purpose of those methods.

b) find() method

The find() method in Matcher Class returns true if the pattern occurs anywhere in the input string.
It has two forms,

  • find()
  • find(int start)

In our example, we used the first form of find() method. It searches for all occurrences of the pattern "Java" in the given input String "Java1.4, Java1.5, Java1.6" and then returns true if a subsequence in the input matches the desired pattern.

In the second form of find() method, we have an argument that is used to specify the start index of find operation.

c) group() method

The group() method in Matcher Class returns the piece of input that has matched the pattern.

d) start() method and end() method

The start() and end() methods in Matcher Class return the start and end indexes respectively, for each occurrence of the subsequences in the input text that has matched the defined regular expression pattern.

5) PatternSyntaxException Class

PatternSyntaxException is an unchecked exception which is thrown when there is any syntax error in a regular expression pattern. It has various methods like getDescription(), getIndex(), getMessage() and getPattern() which enable us to get the details of the error.

We have just seen very simple examples to understand the basics of Regular Expressions in java, and about the purpose of few often used methods. With this basic knowledge, we shall discuss in depth about Regular Expression in the following sections.

6) Matching any single character

The '.' character is used to match any single character. If suppose we use a pattern 'ca.' then this pattern would match string inputs like 'car', ‘cat’, 'cap' etc.. because they start with ca and then followed by another single character. Consider an example to understand this,

Pattern patternObj = Pattern.compile("ca.");
Matcher matcher = patternObj.matcher("cap");
if(matcher.matches()){
    System.out.println("The given input matched the pattern");
}

The output of the above code is,

The given input matched the pattern

7) Matching Special characters

Suppose we need to specify '.' character in our pattern to indicate that the string input should contain the '.' character. But, in Regular Expression '.' has a specific meaning. So we have to specify it to the compiler that we don’t mean the Regular expression '.' by escaping it with a '\' (backslash character) which is a metacharacter. Consider the following example,

Matcher matcher1 = patternObj.matcher("test.java");
Matcher matcher2 = patternObj.matcher("nest.java");

if(matcher1.matches() && matcher2.matches()){
    System.out.println("Both the inputs matched the pattern");
}

The output of this code is,

Both the inputs matched the pattern

Hence, the use of '.' means to match any character, and the use of ‘\.’ means the normal '.' character. We can use this backslash character wherever we need to specify a special character for some other purpose.
(Note : In Java, the compiler expects a backslash character '\' to be always prefixed with another backslash character '\' when used within a String literal.)

8) Matching particular characters

In a given text, we may need to match specific desired characters. We have already seen that the '.' character will match any single character but now our requirement is to match only 'c' or 's' along with other desired set of characters. In such situations, we can enclose the desired characters in parenthesis [] which is a metacharacter used to indicate a character set from which any one character should be available in the given text. Consider the following piece of code that illustrates the same,

Pattern patternObj = Pattern.compile("[CcSs]at");
Matcher matcher = patternObj.matcher("Sat");
if(matcher.matches()){
    System.out.println("The given input matched the pattern");
}

The output of this code is,

The given input matched the pattern

In the above example, we have used the pattern '[CcSs]at' wherein Cc is used to match C and c, and Ss matches S and s. Hence this would match inputs such as Cat, cat, Sat and sat.

9) Matching range of characters

In Regular expressions, we use the metacharacter '-' i.e a hyphen symbol to specify a range of characters. For example, we can specify the range of lowercase alphabets as '[a-z]' and '[A-Z]' in case of uppercase alphabets.

Consider a situation where we need the input to start with a number from 0 to 3, then followed by any alphabet from a to z and then followed by another number that might range from 7 to 9. The following code can be used to validate the input against such a pattern,

Pattern patternObj = Pattern.compile("[0-3][a-z][7-9]");
Matcher matcher = patternObj.matcher("2a8");
if(matcher.matches()){
    System.out.println("The given input matched the pattern ");
}

10) Matching characters apart from a specific list

We can use the '^' metacharacter to specify one or more characters that we want don’t expect to match. Let us achieve the same requirement in our previous example by using a different pattern that makes use of the metacharacter '^' ,

Pattern patternObj = Pattern.compile("[^3-9][a-z][7-9]");

The use of '^' character inside [] indicates that those characters specified in it are not expected in the input.

This pattern will match the input "2a8" as in our previous example.

11) Use of other Metacharacters in Regular Expression

We have already discussed about the use of '.' and '^' metacharacters. Let us see the purpose of other metacharacters.

\d

This matches a numeric digit. It is the same as using the character set [0-9].

\D

This matches any character which is non-numeric. It is the same as the use of [^0-9].

\s

This matches a single whitespace character.

\S

This matches any character which is not a whitespace character.

\w

This matches a word character. It is equivalent to the character class [A-Za-z0-9_].

\W

This matches a character that is not a word character. It is equivalent to the
negated character class [^A-Za-z0-9_].

[...]

This matches a single character present in between the square paranthesis.

[a-f[s-z]]

This specifies the union of two sets of characters, which is the same as [a-fs-z], i.e it
matches any character that might be either from a to f and from s to z.

[a-m[f-z]]

This specifies the intersection of two sets of characters, which is the same as [f-m], i.e it matches any character from f to m.

[^...]

This matches any single character, except those characters that are specified inside the square parentheses [].

12) POSIX Character Classes

The java.util.regex package provides a set of POSIX character classes, which are indeed shortcuts to be used in regular expressions that make it easier for us to use instead of specifying the entire pattern.

\p{Lower}

It can be used to match any single lowercase alphabetic character. It is the same as using [a-z].

\p{Upper}

It can be used to match any single uppercase alphabet character. It is the same as using [A-Z].

\p{Alpha}

It is used to match any alphabetic character. It serves the same purpose as [A-Za-z].

\p{Digit}


It is used to match any single digit. It serves the same purpose as [0-9].

\p{Punct}

It is the same as using [!"#$%&'()*+,- ./:;?@[\]^_`{|}~].

\p{Graph}

It is the same as using [\p{Alpha}\p{Punct}].

\p{Print}

It is the same as using [\p{Graph}].

\p{ASCII]

It can be used to match any of the ASCII characters. It serves the same purpose as U+0000 through U+007F.

\p{XDigit}

It matches a single hexadecimal digit. It is the same as using [0-9a-fA-F].

\p{Space}

It is used to match a single whitespace character. It is the same as using [ \t\n\x0B\f\r].

\p{Blank}

It matches a single space character or a tab character.

\p{Cntrl}

It matches a control character. It is the same as using [\x00-\x1F\x7F].

13) Conclusion

also read:

This article is just an introduction to Regular expressions. We have seen the basics on how to use regular expressions to perform useful operations such as search, replace and validation for a given input. Regular Expressions can be effectively used to suit our application needs that involve text manipulation.

Comments

comments

About Krishna Srinivasan

He is Founder and Chief Editor of JavaBeat. He has more than 8+ years of experience on developing Web applications. He writes about Spring, DOJO, JSF, Hibernate and many other emerging technologies in this blog.

Speak Your Mind

*