Java Regex 

Regex (REGular Expressions) in full are strings that define a pattern for searching ,manipulating other strings.They are useful when we have a format of what we want to look for/process in an input string but not sure what the actual input is.

They are widely used in software development in areas such as;

  1. Validations-email,phone no and Zip code validations
  2. Searching-extracting information e.g people contacts in large texts

  3. Compilers-used to check that source code conforms to a programming language’s syntax and highlighting errors.

Java provides support for Regex through the java.util.regex package via Pattern class,Matcher class , PatternSyntaxException class and the MatchResult interface.

Pattern class- Is a compiled representation of a regular expression

Matcher class -implements the MatchResult interface.Performs matching operations on a given string .

PatternSyntaxException - a checked exception throw when an invalid regex is passed.

Whereas the String class has methods like split() that process regex ,the Pattern and Matcher class methods are optimized for performance where heavy regex operations are required.

 

Regex Symbols-some of the commonly used symbols in regular expressions

Symbol

Description

^expression

Matches expression at beginning of line.

expression$

Matches expression at end of line.

.

Matches any single character (except the newline character).e.g ‘.m’ means any single character then m so there is a match as em,4m

[abc]

matches either a, b, or c

[^abc]

‘^’ as first character inside a bracket negates the pattern; Find one character NOT between the brackets.

Ab

Matches a followed by b

a | b

Matches either a or b

[0-9]

Specifies a range.Find one character from the range 0 to 9

[a-z0-9]

Find one character from range a-z OR any digit from 0-9

 

Metacharacters/meta symbols in regex-used to ease writing regex

Symbol

Description

\d

Matches any  digit (equivalent to [0–9])

\D

Matches  non-digits.

\w

Matches any word character

\W

Matches any non-word character

\s

Matches any whitespace character (equivalent to [\t\r\f\n]).

\S

Matches any nonwhite-space character

\b

Matches word boundary when outside bracket. Matches backslash when inside bracket.

\B

Matches non-word boundary.

\A

Matches beginning of string

\Z

Matches end of string.

 

 

Quantifiers in regex-used to achieve occurrence count of the desired characters

 

Symbol

Description

X?

Matches 0 or 1 occurrence of X (equivalent to X{0,1}).

X*

Matches 0 or more occurrences of X (equivalent to X {0,})

X+

Matches 1 or more occurrences of X (equivalent to X {1,}

X {n}

Matches exactly n occurrences of X

X {m, n}

Matches between m and n occurrences of X

X {n,}

Matches n or more occurrences of X

 

 

Example -using word boundaries and grouping

Suppose we the end of term report for John summarized as follows.

“John's end of term scores were as follows;Math 72%,English 65%,Science 85% and  Commerce 45%.”

Our task is to find John’s average score  at the end of the term. Simple task?Yes.

Marks are represented as a percentage and are 3 digits betwen 0 – to 100 i.e 0% 20% or even 100% so we can be sure the regex \d{1,3} will work. Let’s see below code and the output we get.

package devsought;

import java.util.regex.Matcher;

import java.util.regex.Pattern;

public class Regex1 {

    public static void main(String... args) {

        String str = "John's end of term scores were as follows;Math 72%,English 65%,Science 85% and Commerce 45%.";

        Pattern pattern = Pattern.compile("\\d{1,3}");

        Matcher matcher = pattern.matcher(str);

        int sum = 0;

        int count = 0;

        while (matcher.find()) {           

            String subsequence = matcher.group();

            sum += Double.valueOf(subsequence);

            count++;

            System.out.println(subsequence);

        }

        System.out.println("Total score=" + sum + " ,avg=" + (sum / count));

    }

}

The above code outputs

72

65

85

45

Total score=267 ,avg=66

 

We are able to extract the scores from the string ,add them up by use of a running sum variable while at the same time keeping count of the number of mark string we are processing the finally computing the sum and average.

Awesome.But is this a correct implementation?Definitely NO.

What if our string had more information like the overall position of John in the class?i.e a String as below

"John's end of term scores were as  follows;Math 72%,English 65%,Science 85% and Commerce 45% . Overall class position number 12 ".

With our earlier regex,the number 12 which represents the position rank would be computed as a mark which is NOT correct.

We therefore need to enhance our regex to below.

(\b\d{1,3}\b)(%)-this means that we get all words in our string that end with a % as this denotes  a mark as a percentage. The overall class position 12 will not be matched since it does not end with % which is our format for marks.It is also of no interest in our program since it holds other information which we are filtering out in marks calculation. The \b is a word-boundary matcher meta-character. i.e.  in the portion 72% 72 is a word but the % is not a word character(word characters are a-z,A-Z and _) hence our regex which means all words followed by a % character. However, specifically in our cases our words are made of integers only. So if we had a word like d88% it would not be matched as a valid mark.

We also employ grouping so as to extract the integer portion of our marks.

Groups enable us to group parts of our regex together and we can apply quantifiers to that group, and reference that group later to manipulate our string or extract specific portions of our match results. We do grouping by use of parenthesis.

The first group represents the integer portion/marks portion, which is our interest whereas the second group contains the percentage sign which differentiates marks from other words in our string.

Below is our improved  program.

 

package devsought;

import java.util.regex.Matcher;

import java.util.regex.Pattern;

public class Regex2 {

    public static void main(String... args) {

        String str = "John's end of term scores were as  follows;Math 72%, English 65%,Science 85% and Commerce 45% .Overall class position number 12";

        Pattern pattern = Pattern.compile("(\\b\\d{1,3}\\b)(%)");

        Matcher matcher = pattern.matcher(str);

        int sum = 0;

        int count = 0;

        while (matcher.find()) {

            String subsequence = matcher.group();          

            sum += Double.valueOf(matcher.group(1));

            count++;

        }

        System.out.println("Total score=" + sum + " ,avg=" + (sum / count));

    }

}

The above code still gives same total score and average per subject as below.

 

Total score=267 ,avg=66

About the Author - John Kyalo Mbindyo(Bsc Computer Science) is a Senior Application Developer currently working at NCBA Bank Group,Nairobi- Kenya.He is passionate about making programming tutorials and sharing his knowledge with other software engineers across the globe. You can learn more about him and follow him on  Github.