First of all, regular what? Regular expressions are based on a formal language that is used to specify patterns for matching strings or sub-strings in text.
In order to find simple information in a text file, many people still use search functions available in any up-to-date editor. But what if your search scenario is not so simple? What if you are looking not for one specific digit, for example, but for all digits in your text file? Or what if you want to find and remove all tags from your document? Good news! You not have to be a programmer to do all that stuff. You simply need a wee bit of knowledge about regular expressions and a clear picture of what you are looking for in your text file. Today’s blog will give you an idea about regular expressions and after reading it you will be ready to write your first regular expression, promise!
So I can write my own RegExs?
Sure, there are a number of online tools which help you to write, test and fix your regular expressions. RegEx Pal is one of them, for example. The way such tools function is rather straightforward. There usually is a field where you write your regular expression and another field where you insert the text to which you want to apply your regular expression. Normally, the sub-strings that are matched by your regular expression will be highlighted in the text field (see the example below).
Warm up with some RegEx basics
Below you find some basic rules to help you start writing your first regular expression. The table contains divers examples of regular expressions. Bold text shows the sub-strings that will be matched by a regular expression and thus found by the regular expression.
|[a-z]||das Faultier, die Faultiere|
|[A-Z]||das Faultier, die Faultiere|
|^da||das Faultier hängt an einem Ast|
|t$||das Faultier hängt an einem Ast|
|oo*w||woow wow wooooow|
|wo+w||woow wow wooooow|
|b.cken||backen bocken bücken|
square brackets  match a set of characters
^ symbol matches the beginning of the line
$ matches the end of the line
? makes the preceding character in the regular expression optional
* matches the preceding character zero or many times
+ matches one or more occurrences of the preceeding character
. matches any character
In order to practice, chose a regular expression from the table above and paste or type it into the RegEx Pal tool. Don’t worry, you’ll do great!
Time to put RegEx to use!
Now that you are warmed-up you can solve your first task with the help of regular expressions. We like to use the Notepad++ editor. If you like you can download a free version here: Notepad++
Please, follow the steps below:
- open a text file in your NotePad++ Editor or use our ‘sloth’ sample text
- click the Search function
- click Find
- tick Regular expression field below
GREAT! You are now in the Regular expression search mode. It means that you can write a regular expression in the find what field and Notepad++ finds all strings and sub-strings that match your regular expression pattern.
Today’s task is to find all capital letters and all digits in your file. Using the information from the section “Some RegExp basics”, write your regular expression in the Find what field and click find all in current document.
Tool: Notepad++, poem "Faultier" by Ingrid Drewing
If everything went well, you will see something like our example in our screenshot below. Notepad++ has found all the lines that match the following pattern: [A-Z0-1]
Real Life Scenarios for RegEx
Sure, regular expressions syntax is full of other interesting and more complicated stuff that we cannot describe within a short how-to blog.
The computational linguists in our company like to use regular expressions in machine translation scenarios for data cleansing, data anonymization and other tasks connected with information extraction.
Want to find out more about using regular expressions or other ways of data preparation for machine translation? Check out our digital qualification section or contact us directly.
Cover picture: by Javier Mazzeo on Unsplash