Day 15: Regular Expressions (Regex) in Python

Day 15: Regular Expressions (Regex) in Python

·

2 min read

On Day 16 of our Python blog series we'll learn Regular Expressions, commonly known as regex, in Python. Regular expressions are powerful tools for pattern matching and string manipulation.

What are Regular Expressions?

Regular expressions, often abbreviated as regex or regexp, are sequences of characters that define a search pattern. They are used to match, search, and manipulate text based on specific patterns. Regular expressions are widely used in various programming languages and text processing tools for tasks.

Regular expressions consist of both literal characters and special characters (metacharacters) that have special meanings. These special characters allow for powerful pattern matching capabilities, including matching specific characters, character classes, repetitions, and more.

However, they can also be challenging to understand and master due to their syntax and versatility.

Using Regular Expressions in Python:

Python provides the re module for working with regular expressions. That means that if you want to use them you have to import this module with the import:

import re

Example :

Pattern to Match URL

import re

url_pattern = r'https?://(?:www\.)?[\w-]+\.\w+'

urls = [
    "https://www.learnpython.com",
    "http://learn.com",
    "invalid-url.com",
]

for url in urls:
    if re.match(url_pattern, url):
        print(f"{url} is a valid URL.")
    else:
        print(f"{url} is not a valid URL.")

breaking r'https?://(?:www\.)?[\w-]+\.\w+' into its components:

  • r: indicates a raw string literal in Python

  • https?: matches the protocol part of a URL, where http is followed by an optional s making it https. The ? quantifier makes the s optional, allowing the pattern to match both http and https.

  • :// matches the colon and two forward slashes that typically follow the protocol part of a URL.

  • (?:www\.)? matches an optional "www." subdomain. The ? quantifier makes the entire group optional, allowing URLs with or without the "www." prefix.

  • [\w-]+: Matches one or more word characters (letters, digits, or underscores) or hyphens. This part matches the domain name.

  • \.: Matches a literal dot (.), separating the domain name from the top-level domain (TLD).

  • \w+: Matches one or more word characters (letters, digits, or underscores). This part matches the top-level domain (TLD), such as .com, .org, .net, etc.

Putting it all together, the regular expression r'https?://(?:www\.)?[\w-]+\.\w+' matches URLs with optional "http://" or "https://" protocols, optional "www." subdomains, followed by a domain name and a top-level domain (TLD). This regex pattern is useful for validating and extracting URLs from text data.