development

Matching multiline regular expressions in Java

Regular expressions are one of those things that I understand but don’t use often enough to have really mastered. I use them even less in my Java programming. Today I found myself banging my head into my cubicle wall trying to parse out a file that looked something like this:
—-START
something=foo
anotherthing=bar
—-END
—-START
something=baz
anotherthing=ran outta foo words
—-END
I needed to parse this out and create objects representing each section. Initially I started trying to read this in line by line, keeping track of where I was in relation to the markers, concatenating string buffers, etc. Then I realized how retarded that approach was, and how using a regular expression would make it a lot simpler.

Brushing aside the mental cobwebs I looked up a couple of references of the Java API and studied up on my friends Pattern and Matcher. I had trouble finding any examples for my specific case, where what I wanted to match spanned multiple lines. At first I thought I’d found the answer with the promising sounding Pattern.MULTILINE argument to Pattern.compile(), but that has to do with matching ^ and $. Without that option, those operators only match at the beginning or end of the text being parsed, with them it will allow them to work within the text at newline boundaries.

Turned out what I was looking for was the Pattern.DOTALL argument. By default, the dot operator does not match newlines, with this argument it does. An alternative is to prefix the regex pattern with (?s). It has the same effect, and the mnemonic stands for “single-line” mode, which is what its called in Perl.

So, to extract the relevant sections from the input above, you can do the following:

Pattern pattern = Pattern.compile(“—-START\n(.*?)\n—-END\n”, Pattern.DOTALL);
Matcher matcher = pattern.matcher(theInput);

while(matcher.find()) {
System.out.println(matcher.group(1) + “\n”);
}

That will print out:

something=foo
anotherthing=bar

something=baz
anotherthing=ran outta foo words

You can get the same result with the alternate method, using the embedded (?s) operator in the Pattern declaration:

Pattern pattern = Pattern.compile(“(?s)—-START(.*?)—-END”);

For further reference, have a look here.

Leave a Reply

Your email address will not be published. Required fields are marked *