1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Definitive guide to Apache mod_rewrite regular expressions

Discussion in 'Apache' started by Ladadadada, Oct 6, 2007.

  1. #1
    As there appears to be much confusion on this board as to how regular expressions work in Apache mod_rewrite I thought I'd write up a decent guide to regular expressions. I will not be covering Apache directives or configuration files, this is (almost) purely about regular expressions.

    A regular expression (known as a regex) consists of two parts, a pattern and an input string. In the case of Apache mod_rewrite, the pattern is the first part after the keyword RewriteRule and the input string is most frequently the URI that your user is requesting. There are examples of both below.

    The purpose of a regex is to describe a subset of all possible strings. In Apache mod_rewrite we take the input string and test it against the regex to see if the regex describes the input string. We can take some action depending on whether the input string matches the regex or not. At it's most simple, this consists of determining whether one string exists inside the other.

    Input string: "/page2.html"
    Regex : "/page2.html"
    Matches : TRUE

    Input string: "/page2.html"
    Regex : "age"
    Matches : TRUE

    Input string: "/page3.html"
    Regex : "/page2.html"
    Matches : FALSE

    Inside a regex pattern a character can be one of two sorts: a literal character or a metacharacter. In the above examples, all the characters were literal characters except one. The dot (.) is a metacharacter. Metacharacters affect the rest of the regex pattern in varying ways.

    Metacharacters:
    . * + { } ? [ ] - ^ $ | \ ( )

    A dot (.) will match any single character.

    Input string: "/page2.html"
    Regex : "..........."
    Matches : TRUE

    Input string: "/aPageWithMoreOrLessThanElevenCharacters.html"
    Regex : "..........."
    Matches : FALSE

    A star (*) will modify the pattern to mean zero or more of the previous character. The star modifier works with both literal and meta characters preceding it. The pattern (.*) will match any input string.

    Input string: "/page2.html"
    Regex : ".*"
    Matches : TRUE

    Input string: ""
    Regex : ".*"
    Matches : TRUE

    A plus (+) is much like the star (*) except that it matches one or more of the previous character. The plus modifier works with both literal and meta characters preceding it. The pattern (.+) will match any input string other than the empty string.

    Input string: "/page2.html"
    Regex : ".+"
    Matches : TRUE

    Input string: ""
    Regex : ".+"
    Matches : FALSE

    The curly brackets ({}) with a number inside them are much like the star and the plus except that they match the number inside the curly brackets of the previous character. The curly brackets modifier works with both literal and meta characters preceding it. Curly brackets can define a range using the start and end values of the range inside the curly brackets separated by a comma.

    Input string: "abba"
    Regex : "ab{2}a"
    Matches : TRUE

    Input string: "abbbba"
    Regex : "ab{2,4}a"
    Matches : TRUE

    Input string: "abbbbbbbba"
    Regex : "ab{2,4}a"
    Matches : FASLE

    A question mark (?) is much like the star and the plus except that it matches zero or one of the previous character. The question mark modifier works with both literal and meta characters preceding it.

    Input string: "a"
    Regex : ".?"
    Matches : TRUE

    Input string: ""
    Regex : "."
    Matches : FALSE

    Input string: ""
    Regex : ".?"
    Matches : TRUE

    The square brackets ([]) will match any of the characters inside them. Square brackets can contain both literal and meta characters. Any square bracket sequence with a dot (.) in it will match any input string.

    Input string: "a"
    Regex : "[ab]"
    Matches : TRUE

    Input string: "b"
    Regex : "[ab]"
    Matches : TRUE

    Input string: "c"
    Regex : "[ab]"
    Matches : FALSE

    Input string: "c"
    Regex : "[ab.]"
    Matches : TRUE

    The star and plus modifiers act on the entire contents of the square brackets.

    Input string: "abababbbaab"
    Regex : "[ab]+"
    Matches : TRUE

    The dash (-) character inside square brackets ([]) describes ranges if it is in between the two literal characters at the start and end of the range. A dash as the very first or very last character in the square brackets will be interpreted as a literal dash and not as a range.

    Input string: "lotsoflowercasealphabeticcharacters"
    Regex : "[a-z]+"
    Matches : TRUE

    Input string: "UPPERCASE AND SPACES"
    Regex : "[a-z]+"
    Matches : FALSE

    Input string: "UPPERCASElowercase123456789"
    Regex : "[0-9a-zA-Z]+"
    Matches : TRUE

    A caret or circumflex (^) when used inside square brackets as the first character describes the negation or opposite of what it normally would describe.

    Input string: "UPPERCASE AND SPACES"
    Regex : "[^a-z]+"
    Matches : TRUE

    Input string: "UPPERCASElowercase123456789"
    Regex : "[^0-9a-zA-Z]+"
    Matches : FALSE

    A caret or circumflex (^) when used as the first character of a pattern matches the start of the input string.

    Input string: "page2.html"
    Regex : "^page"
    Matches : TRUE

    Input string: "page2.html"
    Regex : "^html"
    Matches : FALSE

    The dollar sign ($) when used as the last character of a pattern matches the end of the input string.

    Input string: "page2.html"
    Regex : "page$"
    Matches : FALSE

    Input string: "page2.html"
    Regex : "html$"
    Matches : TRUE

    The pipe character (|) means the value on the left or the value on the right. It can be used on individual characters or entire strings. It works with both literal and meta characters.

    Input string: "page2.html"
    Regex : "(page2.html|page3.html)"
    Matches : TRUE

    Input string: "page3.html"
    Regex : "(page2.html|page3.html)"
    Matches : TRUE

    Input string: "page4.html"
    Regex : "(page2.html|page3.html)"
    Matches : FALSE

    Input string: "lowercase"
    Regex : "^([a-z]|[A-Z])+$"
    Matches : TRUE

    Input string: "UPPERCASE"
    Regex : "^([a-z]|[A-Z])+$"
    Matches : TRUE

    Input string: "lowercaseANDUPPERCASE"
    Regex : "^([a-z]|[A-Z])+$"
    Matches : FALSE

    The backslash (\) enables you to turn any metacharacter into a literal character by placing the backslash in front of it. This is known as escaping. If you want to match the backslash itself, you can precede it with another backslash. A backslash-character sequence is treated as a single character by any modifiers following it. A backslash preceding a literal character is the same as just the literal character.

    Input string: "..........."
    Regex : "\.+"
    Matches : TRUE

    Input string: "Any other string"
    Regex : "\.+"
    Matches : FALSE

    Input string: "\"
    Regex : "\\"
    Matches : TRUE

    The parentheses () allow you to group parts of the pattern together. You can see an example of this in the section on the pipe character above. As a bonus, they also allow you to reference the parts of the input that were matched inside the parentheses later.

    Input string: "/blog/category/apache"
    RewriteRule ^/blog/category/([a-zA-Z]+) /blog/index.php?category=$1
    Result: /blog/index.php?category=apache​

    Tricky bits:
    Some common pitfalls that people run into when dealing with regexes.

    Input string: "ccccccccc"
    Regex : "[ab]*"
    Matches : TRUE

    Why ? Because the star matches ZERO or more... and the are zero 'a's and 'b's in the input string.

    Input string: "ccccccccc"
    Regex : "[ab.]+"
    Matches : TRUE

    Why ? Because the dot in the character class matches anything, and hence the whole string.

    Input string: "/blog/2007/10/06/some-title"
    RewriteRule /(.*)/(.*)/(.*) /index.php?var1=$1&var2=$2&var3=$3

    Why is this a problem ? Because (.*) matches everything, including the slashes ! var1 will actually equal "/blog/2007/10", var2 will equal "06" and var3 will equal "some-title". A safer way to match variables delimited by slashes is to use ([^/]+) instead of (.*) From our rules earlier, this means "match one or more of anything that is not a slash".

    Input string: "cccccccabbaccccccccc"
    Regex : "[ab]+"
    Matches : TRUE

    Why ? Because the pattern is not anchored with a caret or a dollar and hence it does match the "abba" in the middle of the "c". Changing the regex to "^[ab]+", "[ab]+$" or "^[ab]+$" would mean that it would not match this input string anymore.

    For a very good mod_rewrite reference, I recommend http://www.ilovejackdaniels.com/cheat-sheets/mod_rewrite-cheat-sheet/

    (Sorry, I'm still too new to be allowed live links. I'll edit this post once I'm allowed...)​
     
    Ladadadada, Oct 6, 2007 IP
    hans and drnibbles like this.
  2. hans

    hans Well-Known Member

    Messages:
    2,923
    Likes Received:
    126
    Best Answers:
    1
    Trophy Points:
    173
    #2
    thanks for your detailed regex guide - something i needed since years. I'll study it and may come back with an question if needed.
     
    hans, Oct 6, 2007 IP
  3. Ladadadada

    Ladadadada Peon

    Messages:
    382
    Likes Received:
    36
    Best Answers:
    0
    Trophy Points:
    0
    #3
    hans: Feel free to ask questions. I have a feeling that I haven't quite covered everything yet anyway...

    Now that I can create links... that cheatsheet is clickable.
     
    Ladadadada, Oct 15, 2007 IP
  4. Dehisce

    Dehisce Peon

    Messages:
    234
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Actually this is the best .htaccess tutorial I have seen.

    Wicked !
     
    Dehisce, Oct 15, 2007 IP
  5. drnibbles

    drnibbles Peon

    Messages:
    346
    Likes Received:
    12
    Best Answers:
    0
    Trophy Points:
    0
    #5
    WOWOWOWOWOW.. this is great, I have been looking for something like this. Straight forward plain english. Thanks a million - rep for you
     
    drnibbles, Oct 15, 2007 IP
  6. hans

    hans Well-Known Member

    Messages:
    2,923
    Likes Received:
    126
    Best Answers:
    1
    Trophy Points:
    173
    #6
    question about my hotlink protection with replacing by a standard logo
    my hotlink protection/image repalcing works for all verified hotlinkers
    except the following below rule

    RewriteCond %{HTTP_REFERER} ^http://([^.]+\.)?blogspot\.com/ [NC,OR]

    I would expect all sub/sub.sub domains/users of blogspot.com to be replaced by my logo - the above rule works for all like:

    it works for: ( without the spaces in URL )

    http: // nature-wallpaper . blogspot . com/
    http: // elblogdesephiroth . blogspot . com/

    BUT NOT for a variation of the already exisitng above working referrer:

    http: // www . elblogdesephiroth . blogspot . com/

    any hint why or what to change ??

    would the replacing of the + by the * in the rule work for any number of subdomains or sub.sub.domains of that above domain ??
     
    hans, Oct 26, 2007 IP
  7. Ladadadada

    Ladadadada Peon

    Messages:
    382
    Likes Received:
    36
    Best Answers:
    0
    Trophy Points:
    0
    #7
    No, in this case, the reason it doesn't match is down to the question mark. The question mark tells the regex to match 1 or 0 of the preceding block, which is everything inside the parentheses ().

    If you change the ? to a * the regex will now match anything followed by a dot any number of times (including 0 times) and then followed by blogspot.com
     
    Ladadadada, Oct 27, 2007 IP
  8. hans

    hans Well-Known Member

    Messages:
    2,923
    Likes Received:
    126
    Best Answers:
    1
    Trophy Points:
    173
    #8
    thanks for your fast solution,
    hot link protection working perfectly now :)

    would it be save to have all such hot-link rules with a * instead of the ? just to be save that sites creating funny sub.domains or sub.sub.accounts always are fully included ??
     
    hans, Oct 27, 2007 IP
  9. Ladadadada

    Ladadadada Peon

    Messages:
    382
    Likes Received:
    36
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Yeah, as long as you have blogspot\.com at the end of the regex then only people on blogspot.com will be able to hotlink your images. :)
     
    Ladadadada, Oct 27, 2007 IP
  10. hans

    hans Well-Known Member

    Messages:
    2,923
    Likes Received:
    126
    Best Answers:
    1
    Trophy Points:
    173
    #10
    ??
    i miss your point here

    i want all blogspot ppl NOT to hotlink or to have the hotlinked images to be replaced by my logo
    just as it works NOW after the * changes made

    i have ONE such rule for EACH hotlinker domain out there - like hi5, myspace, spaces, blogger, etc

    for all i have the solution originally posted by me - hence with ? instead of * as no following your solution.

    hence my question for ALL other domains where i want to prevent hot-linking was: whether I can safely use your above * solution shown by you for ALL hotlinking domains to include all/any/every possible sub.sub.domain or sub.useraccount instead of the ? solution ???
     
    hans, Oct 27, 2007 IP
  11. Ladadadada

    Ladadadada Peon

    Messages:
    382
    Likes Received:
    36
    Best Answers:
    0
    Trophy Points:
    0
    #11
    Heh... if I'd thought about it for very long I would have realised that's what you were doing. For some reason I assumed you wanted to allow blogspot users to hotlink and deny everyone else. At the time, all I was thinking about was how to match the blogspot users, not what to do with them once we have separated them out from the rest of the internet.

    I noticed that you have [OR] after the RewriteCond so presumably you have some more RewriteConds after that one and eventually you will have a RewriteRule that runs if one of the RewriteConds matches. The RewriteRule will either look something like:
    RewriteRule /images/.* - [F,L]
    Code (markup):
    to send a 403 Forbidden http response code or:
    RewriteRule /images/.* /images/logo.png [L]
    Code (markup):
    To send your logo instead of the image they were hotlinking.
     
    Ladadadada, Oct 27, 2007 IP
  12. hans

    hans Well-Known Member

    Messages:
    2,923
    Likes Received:
    126
    Best Answers:
    1
    Trophy Points:
    173
    #12
    :)
    yes that's what i do - now u got it!

    exclude a bunch of hot link domains successfully since years - each on a line
    allow the world to hotlink because of my mRSS feeds used by some (smart) site owners and the major SE with all their variations of image search and caching files ..

    and replace the hotlink-protected images for the domains ruled by mod_rewrite by my own banner ( at least i get free advertisement by the ten thousand of times each months.

    the reason why i replace by banner instead of simply denying hot-linking is that my server has to work anyway and work needs some reward :) - free banner advertisement across the web is my reward for my apache's work ... :) without hotlink protection i had some 100'000+ hotlinked images - mostly fullsize wallpapers - across the web each months sucking xx GB of server traffic/bandwidth/m.
     
    hans, Oct 27, 2007 IP