Friday, January 28, 2011

Java and regular expressions

Lately I needed a solution for the following in Java:
in a String replace everything from the beginning up to a certain pattern, the String being the content of a text file which had been read in before (imagine you want to remove the header in html code until and including the <body> tag).

I had expected this to be an easy case (since I consider myself to be quite familiar with regular expressions, though I do not program in Java as a main job) but the following
string.replaceFirst(".*<body>","")
did not work. What it did was (just) to remove the line containing <body> .

Looking for an explanation I found this very nice page about flags in regular expressions and the solution is highlighted there "By default . does not match line terminators.". So one needs to use some special pattern flags to achieve that newlines in the string are embedded and found by the regular expression. Here is the solution
string.replaceFirst("(?s).*<body>","")
(?s) makes dot match any character including line terminators.
Read the page above to find out more about the other flags (this is very likely also described in the Java documentation but I couldn't find it quickly and as easily described as on the page above).

No comments:

Post a Comment