What Every Developer Should Know About URLs

I have recently written about the value of fundamentals in software development. I am still firmly of the opinion that you need to have your fundamentals down solid, if you want to be a decent developer. However, several people made a valid point in response to that post, in that it is often difficult to know what the fundamentals actually are (be they macro or micro level). So, I thought it would be a good idea to do an ongoing series of posts on some of the things that I consider to be fundamental – this post is the first instalment.

Being a developer this day and age, it would be almost impossible for you to avoid doing some kind of web-related work at some point in your career. That means you will inevitably have to deal with URLs at one time or another. We all know what URLs are about, but there is a difference between knowing URLs like a user and knowing them like a developer should know them.

As a web developer you really have no excuse for not knowing everything there is to know about URLs, there is just not that much to them. But, I have found that even experienced developers often have some glaring holes in their knowledge of URLs. So, I thought I would do a quick tour of everything that every developer should know about URLs. Strap yourself in – this won't take long :).

The Structure Of A URL

Structure

This is easy, starts with HTTP and ends with .com right :)? Most URLs have the same general syntax, made up of the following nine parts:

<scheme>://<username>:<password>@<host>:<port>/<path>;<parameters>?<query>#<fragment>

Most URLs won't contain all of the parts. The most common components, as you undoubtedly know, are the scheme, host and path. Let's have a look at each of these in turn:

  • scheme - this basically specifies the protocol to use to access the resource addressed by the URL (e.g. http, ftp). There are a multitude of different schemes. A scheme is official if it has been registered with the IANA (like http and ftp), but there are many unofficial (not registered) schemes which are also in common use (such as sftp, or svn). The scheme must start with a letter and is separated from the rest of the URL by the first : (colon) character. That's right, the // is not part of the separator but is infact the beginning of the next part of the URL.
  • username - this along with the password, the host and the port form what's known as the authority part of the URL. Some schemes require authentication information to access a resource this is the username part of that authentication information. The username and password are very common in ftp URLs, they are less common in http URLs, but you do come across them fairly regularly.
  • password - the other part of the authentication information for a URL, it is separated from the username by another : (colon) character. The username and password will be separated from the host by an @ (at) character. You may supply just the username or both the username and password e.g.:
    ftp://[email protected]/
    ftp://some_user:[email protected]/
    

    If you don't supply the username and password and the URL you're trying to access requires one, the application you're using (e.g. browser) will supply some defaults.

  • host - as I mentioned, it is one of the components that makes up the authority part of the URL. The host can be either a domain name or an IP address, as we all should know the domain name will resolve to an IP address (via a DNS lookup) to identify the machine we're trying to access.
  • port - the last part of the authority. It basically tells us what network port a particular application on the machine we're connecting to is listening on. As we all know, for HTTP the default port is 80, if the port is omitted from an http URL, this is assumed.
  • path - is separated from the URL components preceding it by a / (slash) character. A path is a sequence of segments separated by / characters. The path basically tells us where on the server machine a resource lives. Each of the path segments can contain parameters which are separated from the segment by a ; (semi-colon) character e.g.:
    http://www.blah.com/some;param1=foo/crazy;param2=bar/path.html

    The URL above is perfectly valid, although this ability of path segments to hold parameters is almost never used (I've never seen it personally).

  • parameters - talking about parameters, these can also appear after the path but before the query string, also separated from the rest of the URL and from each other by ; characters e.g.:
    http://www.blah.com/some/crazy/path.html;param1=foo;param2=bar

    As I said, they are not very common

  • query - these on the other hand are very common as every web developer would know. This is the preferred way to send some parameters to a resource on the server. These are key=value pairs and are separated from the rest of the URL by a ? (question mark) character and are normally separated from each other by & (ampersand) characters. What you may not know is the fact that it is legal to separate them from each other by the ; (semi-colon) character as well. The following URLs are equivalent:
    http://www.blah.com/some/crazy/path.html?param1=foo&param2=bar
    
    http://www.blah.com/some/crazy/path.html?param1=foo;param2=bar
  • fragment - this is an optional part of the URL and is used to address a particular part of a resource. We usually see these used to link to a particular section of an html document. A fragment is separated from the rest of the URL with a # (hash) character. When requesting a resource addressed by a URL from a server, the client (i.e. browser) will usually not send the fragment to the server (at least not where HTTP is concerned). Once the client has fetched the resource, it will then use the fragment to address the relevant part.

That's it, all you need to know about the structure of a URL. From now on you no longer have any excuse for calling the fragment – "that hash link thingy to go to a particular part of the html file".

Special Characters In URLs

Special Character

There is a lot of confusion regarding which characters are safe to use in a URL and which are not, as well as how a URL should be properly encoded. Developers often try to infer this stuff from general knowledge (i.e. the / and : characters should obviously be encoded since they have special meaning in a URL). This is not necessary, you should know this stuff solid – it's simple. Here is the low down.

There are several sets of characters you need to be aware of when it comes to URLs. Firstly, the characters that have special meaning within a URL are known as reserved characters, these are:

";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","

What this means is that these characters are normally used in a URL as-is and are meaningful within a URL context (i.e. separate components from each other etc.). If a part of a URL (such as a query parameter), is likely to contain one of these characters, it should be escaped before being included in the URL. I have spoken about URL encoding before, check it out, we will revisit it shortly.

The second set of characters to be aware of is the unreserved set. It is made up of the following characters

"-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

The characters can be included as-is in any part of the URL (note that they may not be allowed as part of a particular component of a URL). This basically means you don't need to encode/escape these characters when including them as part of a URL. You CAN escape them without changing the semantics of a URL, but it is not recommended.

The third set to be aware of is the 'unwise' set, i.e. it is unwise to use these characters as part of a URL. It is made up of the following characters

"{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

These characters are considered unwise to use in a URL because gateways are known to sometimes modify such characters, or they are used as delimiters. That doesn't mean that these characters will always be modified by a gateway, but it can happen. So, if you include these as part of a URL without escaping them, you do this at your own risk. What it really means is you should always escape these characters if a part of your URL (i.e. like a query param) is likely to contain them.

The last set of characters is the excluded set. It is made up of all ASCII control characters, the space character as well the following characters (known as delimiters)

"<" | ">" | "#" | "%" | '"'

The control characters are non-printable US-ASCII characters (i.e. hexadecimal 00-1F as well as 7F). These characters must always be escaped if they are included in a component of a URL. Some, such as # (hash) and % (percent) have special meaning within the context of a URL (they can really be considered equivalent to the reserved characters). Other characters in this set have no printable representation and therefore escaping them is the only way to represent them. The <, > and “ characters should be escaped since these characters are often used to delimit URLs in text.

To URL encode/escape a character we simply append its 2 character ASCII hexadecimal value to the % character. So, the URL encoding of a space character is %20 – we have all seen that one. The % character itself is encoded as %25.

That's all you need to know about various special characters in URLs. Of course aside from those characters, alpha-numerics are allowed and don't need to be encoded :).

A few things you have to remember. A URL should always be in its encoded form. The only time you should decode parts of the URL is when you're pulling the URL apart (for whatever reason). Each part of the URL must be encoded separately, this should be pretty obvious, you don't want to try encoding an already constructed URL, since there is no way to distinguish when reserved characters are used for their reserved purpose (they shouldn't be encoded) and when they are part of a URL component (which means they should be encoded). Lastly you should never try to double encode/decode a URL. Consider that if you encode a URL once but try to decode it twice and one of the URL components contains the % character you can destroy your URL e.g.:

http://blah.com/yadda.html?param1=abc%613

When encoded it will look like this:

http://blah.com/yadda.html?param1=abc%25613

If you try to decode it twice you will get:

  1. http://blah.com/yadda.html?param1=abc%613

    Correct

  2. http://blah.com/yadda.html?param1=abca3

    Stuffed

By the way I am not just pulling this stuff out of thin air. It is all defined in RFC 2396, you can go and check it out if you like, although it is by no means the most entertaining thing you can read, I'd like to hope my post is somewhat less dry :).

Absolute vs Relative URLs

Absolut

The last thing that every developer should know is the difference between an absolute and relative URL as well as how to turn a relative URL into its absolute form.

The first part of that is pretty easy, if a URL contains a scheme (such as http), then it can be considered an absolute URL. Relative URLs are a little bit more complicated.

A relative URL is always interpreted relative to another URL (hence the name :)), this other URL is known as the base URL. To convert a relative URL into its absolute form we firstly need to figure out the base URL, and then, depending on the syntax of our relative URL we combine it with the base to form its absolute form.

We normally see a relative URL inside an html document. In this case there are two ways to find out what the base is.

  1. The base URL may have been explicitly specified in the document using the HTML <base> tag.
  2. If no base tag is specified, then the URL of the html document in which the relative URL is found should be treated as the base.

Once we have a base URL, we can try and turn our relative URL into an absolute one. First, we need to try and break our relative URL into components (i.e. scheme, authority (host, port), path, query string, fragment). Once this is done, there are several special cases to be aware of, all of which mean that our relative URL wasn't really relative.

  • if there is no scheme, authority or path, then the relative URL is a reference to the base URL
  • if there is a scheme then the relative URL is actually an absolute URL and should be treated as such
  • if there is no scheme, but there is an authority (host, port), then our relative URL is likely a network path, we take the scheme from our base URL and append our "relative" URL to it separating the two by ://

If none of those special cases occurred then we have a real relative URL on our hands. Now we need to proceed as follows.

  • we inherit the scheme, and authority (host, port) from the base URL
  • if our relative URL begins with /, then it is an absolute path, we append it to the scheme and authority we inherited from the base using appropriate separators to get our absolute URL
  • if relative URL does not begin with / then we take the path of from base URL, discarding everything after the last / character
  • we then take our relative URL and append it to the resulting path, we now need to do a little further processing which depends on the first several characters of our relative URL
  • if there is a ./ (dot slash) anywhere in a resulting path we remove it (this means our relative URL started with ./ i.e. ./blah.html)
  • if there is a ../ (dot dot slash) anywhere in the path then we remove it as well as the preceding segment of the path i.e. all occurrences of "<segment>/../" are removed, keep doing this step until no more ../ can be found anywhere in the path (this means our relative path started with one or more ../ i.e. ../blah.html or ../../blah.html etc.)
  • if the path ends with .. then we remove it and the preceding segment of the path, i.e. "<segment>/.." is removed (this means our relative path was .. (dot dot))
  • if the path ends with a . (dot) then we remove it (this most likely means our relative path was . (dot))

At this point we simply append any query string or fragment that our relative URL may have contained to our URL using appropriate separators and we have finished turning our relative URL into an absolute one.

Here are some examples of applying the above algorithm:

1)
base: http://www.blah.com/yadda1/yadda2/yadda3?param1=foo#bar
relative: rel1

final absolute: http://www.blah.com/yadda1/yadda2/rel1

2)
base: http://www.blah.com/yadda1/yadda2/yadda3?param1=foo#bar
relative: /rel1

final absolute: http://www.blah.com/rel1

3)
base: http://www.blah.com/yadda1/yadda2/yadda3?param1=foo#bar
relative: ../rel1

final absolute: http://www.blah.com/yadda1/rel1

4)
base: http://www.blah.com/yadda1/yadda2/yadda3?param1=foo#bar
relative: ./rel1?param2=baz#bar2

final absolute: http://www.blah.com/yadda1/yadda2/rel1?param2=baz#bar2

5)
base: http://www.blah.com/yadda1/yadda2/yadda3?param1=foo#bar
relative: ..

final absolute: http://www.blah.com/yadda1/

Now you should be able to confidently turn any relative URL into an absolute one, as well as know when to use the different forms of relative URL and what the implications will be. For me this has come in handy time and time again in my web development endeavours.

There you go that's really all there is to know about URLs, it's all relatively simple (forgive the pun :)) so no excuse for being unsure about some of this stuff next time. Talking about next time, one of the most common things you need to do when it comes to URLs is recognise if a piece of text is infact a URL, so next time I will show you how to do this using regular expressions (as well as show you how to pull URLs out of text). It should be pretty easy to construct a decent regex now that we've got the structure and special characters down. Stay tuned.

Images by jackfre2, saucebomb and Steve Nilsen

  • http://www.dutor.net/ dutor

    Thanks a lot.
    Anticipating the following posts.

  • http://www.michevan.id.au/ Evan

    You might want to look at RFC3986 which obsoletes RFC2396. http://www.ietf.org/rfc/rfc3986.txt

    The encoding aspect starts getting a little more complicated if you also look at section 17.13.4 of the HTML 4 specification – http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4

    Specifically, you’ll very often see the practice of using a plus sign instead of “%20″ to represent a space. I’ve not seen a browser or web framework that does not handle this.

    Another little gotcha when putting URLs into HTML code (e.g. as the “href” attribute of an “a” element) is replacing the ampersands separating your parameters with the HTML entity “&” (see http://www.w3.org/TR/html4/appendix/notes.html#ampersands-in-uris ). You more or less have to do HTML escaping after doing the URL escaping. :-)

    I never new about the ability to have parameters on a path component. Weird! Has anyone ever seen this used in real life? I wonder what software would break if given one of those?

    Regarding pulling URLs out of text using regular expressions, you might want to have a look at the code twitter recently released, which is what they use to do just that.
    Java version: http://github.com/mzsanford/twitter-text-java
    Ruby version: http://github.com/mzsanford/twitter-text-rb

    I’m sure it would contain some interesting edge cases.

  • Nishith Desai

    Great post!!! Carry on great work.

  • http://daniel.hahler.de/ Daniel

    I’ve only skimmed the article, but it appears that you are not considering “scheme relative, but absolute” URLs, like “//google.com/foo/”; they’re meant to be relative to the current scheme (e.g. “https” – “http” by default for browsers).

    • http://www.skorks.com Alan Skorkin

      Hey Daniel,

      I actually do mention those as a special case:

      “…if there is no scheme, but there is an authority (host, port), then our relative URL is likely a network path, we take the scheme from our base URL and append our “relative” URL…”

      Is that what you meant?

      • Malcolm

        I think he means when a URL in a webpage begins with //, it is relative to the scheme. A network path begins with \\ (forward slash vs backslash).

        Using this format allows you to use the same URL on an https page as an http page. For instance, if I want to include jQuery from the Google Ajax library, but my page may or may not be secure, I would set the src of the script tag to //ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js. That way, if the page is encrypted, so is the script and vice versa. I only need one URL.

  • http://www.rooftopsolutions.nl/ Evert

    I hope unicode is also on your list of fundamentals.

    • http://www.skorks.com Alan Skorkin

      Hi Evert,

      It certainly is. I think that these days you can get away without knowing unicode (or a bit about encoding in general) for quite a while, but it will catch up with you eventually and in the meantime you’ll have no idea what you’re doing when it comes to things like string encoding, which you do use every day even if you don’t realise. That’s what happened with me, so I have first hand experience :).

  • http://blog.est.im/ est
    • http://www.skorks.com Alan Skorkin

      Cheers, for that was very useful to actually find out in what context the ‘network style’ (i.e. ones that include the hostname) relative urls come into play.

  • Jan Aagaard
    • http://www.skorks.com Alan Skorkin

      Ahh yes very true, thanks for the correction.

  • http://lebenplusplus.de/ Gabriel

    I looked at the specification to see if there is a standard way to encode non-US-ASCII values (German Umlauts for example). Section “2.1 URI and non-ASCII characters” answered my question, however not the answer I was looking for: It’s not defined and can’t be specified in the URL in a standardized way. The character encoding used is application specific. Meh.

    • http://www.skorks.com Alan Skorkin

      Hi Gabriel,

      Yeah that is actually a historical issue, a lot of stuff on relating to the web is very anglo-centric. This is why for example, we have such a mess when it comes to character encodings. Urls is just another example of that.

  • blueshifter

    You missed a mention of the potential pitfalls, e.g. if you have http://example.com/foo as the base url and ../../../../../../etc/passwd as the relative one.

    • http://www.skorks.com Alan Skorkin

      Very good point, something like that should be considered an error.

  • Pingback: Learn to be a computer programmer with these free online lectures – The Blogs at HowStuffWorks

  • Pingback: KafeKafe » What Every Developer Should Know About URLs

  • Jesse R. Castro

    As a general rule, while you *CAN* send a password in a URL, you never should.

    • http://www.skorks.com Alan Skorkin

      Hi Jesse,

      That is sensible advice if you’re coding it, but often you can’t avoid it when you’re the user and some server is making you do it :).

  • Proogie

    Please note that just b/c a URL starts with “http” does *not* make it absolute. Example:

    http%3A%2F%2Fexample.org%2F

    This is *relative* url.

    See the RFC on relative URLs: http://www.ietf.org/rfc/rfc1808.txt

    • http://www.skorks.com Alan Skorkin

      Thanks for that, that’s three RFCs now that have been mentioned that deal just with URLs, it’s a crazy world :).

  • nolochemical

    Great work, its good to see knowledge refreshed with clartiy. x2 @Evert good point.

  • iama
    • http://www.skorks.com Alan Skorkin

      I try to make things as concise as I can (given that I am normally pretty verbose :)), unfortunately some things – even in summary form – will be pretty long. The only thing to do in that case is break them up into a series of articles.

  • http://www.localvox.com Trevor Sumner

    It’s more business level, but I think it’s also important to discuss URLs in the context of SEO. How to use “-” to separate keywords and make URLs keyword dense. How to prevent duplicate content penalties. Smart folder organization. etc.

    • http://www.skorks.com Alan Skorkin

      Hi Trevor,

      I agree with you, it is very relevant, but then again I like learning about internet marketing :), many developers seem to have an intrinsic hate of anything to do with SEO including the word itself.

  • http://www.littledetails.co.uk simon

    good info on urls, learned a few new technical things, although i was expecting something more on how best to include urls in a website for maximum search engine impact. here is something i wrote on that subject: http://www.littledetails.co.uk/winning-website-designs/seo-url-structure.php

  • Pingback: What Every Developer Should Know About URLs

  • lee doolan

    I would like to point out that one use for the fragment identifier is to cause a CDN refresh of, for instance, an updated .css or .js file. You might load a javascript file like this:

    where the fragment id is changed every time the contents of the somescript.js file changes.

    • lee doolan

      This is what I meant:
      <script type=”text/javascript” src=”http://cdn.mysite.com/path/jsfile.js#12345678″></script>

      • http://www.skorks.com Alan Skorkin

        Ahh yes that makes sense, was a little confusing before :). Although, if you version your js and css files and increment every time you release, would you ever need to do this.

      • Dejay Clayton

        I don’t think it works like that. The fragment identifier is not sent to the server during the HTTP request, therefore it should not affect the caching of anything.

        Depending upon how the CDN is setup, the query string parameters might cause the cache to refresh. For instance, Akamai is typically configured to return 304 Not Modified for query string parameters it hasn’t seen recently. Thus, if Akamai is serving:

        image.jpg

        The first time that Akamai sees “image.jpg?” or “image.jpg?foo=bar”, Akamai will refresh the cache, but not for subsequent requests with the same parameters. You can verify this with Live HTTP Headers.

        • Dejay Clayton

          Sorry for the confusion, I mean to say that Akamai typically returns 304 Not Modified for query string parameters it has seen recently, and 200 OK for query string parameters it hasn’t seen recently.

  • Anon

    The character lists are really hard to read. How about just separating them with spaces instead of throwing in quotation marks and pipes?

    “{” | “}” | “|” | “\” | ???

    • http://www.skorks.com Alan Skorkin

      That’s how they do it in the RFCs, it seems to be the most familiar format for most people.

  • http://softwarepdp.com Matt
  • Pingback: Webs Developer » Alan Skorkin’s Blog: What Every Developer Should Know About URLs

  • Pingback: What Every Developer Should Know About URLs

  • Matthew W

    Someone already pointed at the newer RFC I think, but a more detailed heads-up: your info about classes of reserved characters is out of date and somewhat incorrect – might wanna check out “2.2. Reserved Characters” in http://www.ietf.org/rfc/rfc3986.txt, in particular the two different reserved character classes gen-delims and sub-delims.

    Knowing about sub-delims is useful if you want to delimit sub-components of URI path components for your own application-specific purposes, and you still want to use a single level of percent-encoding for escaping your delimiter character(s) within these subcomponents. In this case you are pretty much limited to using sub-delims as your delimiters. They’re characters where the percent-encoded version must not be assumed equivalent to the non-percent-encoded character.

  • David Beaudoin

    Nice article with lots of useful info. If I may add something to the pile, it’s to your benefit to always add the ending “/” in your URLs. A little known fact is that when you go to a website without the trailing slash — i.e. google.com instead of google.com/ — your browser makes two requests instead of one.

    I know it’s probably super nitpicky and doesn’t matter in the grand scheme of things, but I thought it was a neat factoid when someone told it to me. :)

    • Dejay Clayton

      Be careful when you use the word “always” :)

      This all depends upon how the web server is configured. It’s usually good practice for web servers to 301 REDIRECT PERMANENT when a URL is requested that doesn’t end in “/”, when the URL represents a directory or other non-file resource. But this is not mandatory, and in fact, doesn’t make sense when you’re requesting a file resource or parameterized request:

      http://www.google.com/search?hl=en&q=hello/

      is different than (and incorrect compared to):

      http://www.google.com/search?hl=en&q=hello

  • Dejay Clayton

    Excellent summary, you wouldn’t believe how many web developers don’t know these basic facts.

    One useful point about the “base” tag – you can download HTML documents from the web, insert a “base” tag at the top with the original site’s URL, and you’ll be able to render the document properly with all of the CSS and images being served from the original site. This is useful if you want to debug web pages or change them locally for design purposes.

    • http://www.skorks.com Alan Skorkin

      Hi Dejay,

      Now that is very useful tip, thanks for sharing. I didn’t even consider this, but it makes perfect sense that this would work, definitely handy.

  • Rob Whelan

    I’ve seen the semicolon-separated params before — that’s how Tomcat (and other Java servlet containers, I believe) stick a session key into an existing URL (impl of URL rewriting instead of cookies).
    You’ll see something like http://example.com/path/page.jsp;jsessionid=12345?query=string

    I always figured that was some kind of obscure & little-used option in the standard (but never bothered to look it up), and so it is….

    If you google for “semicolon jsessionid” or something similar you’ll find a ton of people confused by it and/or frustrated by software somewhere that didn’t correctly implement the standard, and so chokes on these URLs.

    • http://www.skorks.com Alan Skorkin

      Hi Rob,

      Of course you’re right, considering how much java web development I’ve done I should have picked up on that. I guess that’s what you get for doing a bunch of Ruby for the last year :).

  • Mark

    There is a difference (at least in normal usage) between ‘escaping’ a character and ‘encoding’ a character. Escaping a character usually means to precede it with a special escape character, like a \ or a ~, to prevent the normal special handling of that character from occuring, allowing you to use a character that normally has a special meaning.

    When you started talking about escaping characters above, I thought that there was some means to escape these characters that I had never seen. But, alas, you were only saying escape when you meant encode.

    • http://www.skorks.com Alan Skorkin

      You’re right, I used the two terms interchangeably, but it should be URL encoding every time I used the word escaping.

  • Pingback: links for 2010-05-04 « BarelyBlogging

  • http://www.boolean.co.nz Boolean Value

    An interesting article, something that a lot of people deal with on a daily basis, and yet never really think twice about.

    Thanks for taking the time to make a concise summary with a well written style.

  • Pingback: links for 2010-05-04 « riverrun by meaghn | beta

  • Pingback: links for 2010-05-04 « Mandarine

  • Dejay Clayton

    Semi-colons are not equivalent to query string parameters, and most sites won’t work properly when you replace ampersand with semi-colon. Try the following:

    http://www.google.com/search?hl=en&q=hello (works)

    http://www.google.com/search?hl=en;q=hello (doesn’t work)

    • http://www.skorks.com Alan Skorkin

      They are equivalent according to the spec it is just that most sites don’t treat them as such.

  • Dejay Clayton

    URLs such as the following make it seem that URL fragments are sent to server:

    http://www.google.com/#hl=en&q=hello

    For example, how could the above URL return search results for “hello” if the URL fragment is never sent to the server?

    In such cases, the sites usually have JavaScript that examine the URL, including its fragment. Load the above URL and then type the following into the address bar of your browser:

    javascript:alert (document.location)

    Web servers can send back the received URL in a hidden field such as:

    Then, JavaScript on the page can perform the following logic:

    if (document.location != document.all["document.location"].value) { transformFragmentsIntoQueryStringAndRedirect(); }

    The sneaky thing about this is that Google can create links to affiliate sites in which the affiliate site servers won’t ever get the information denoted by the URL fragment, but yet Google’s JavaScript can then send that information to their own servers.

    • Dejay Clayton

      Sorry, my above post had a hidden input field that was stripped from my example. It should be:

      [input type="hidden" name="document.location" value="http://www.google.com/" /]

      Replace the brackets with angles.

  • http://qtp.blogspot.com Sachin

    this is absolutely incredible info on URL I have ever read..thanks

  • Hitesh Chavda

    Simple and to the point article.
    Waiting for the next fundamental thing.

    Please, keep it up.

  • Tom

    Very informative, I’ll bookmark this to forward it to my clients when needed.

  • http://denova.com Dan Powell

    An article on url basics good enough to attract dozens of comments on the advanced aspects is amazing, Alan. It made me look at your other posts, and I realized you’re the same person who wrote “Software Development And The Sunk Cost Fallacy”. You’ve got a new reader.

    • http://www.skorks.com Alan Skorkin

      Hi Dan,

      Thanks very much, hope you keep enjoying the stuff I’ll write in future.

  • http://www.newviewit.com Website Design

    Excellent review! It’s amazing how much you forget after a few years of neglecting a certain area such as URL structure. Bookmarked!

  • Pingback: What Every Developer Should Know About URLs | Dev Loom

  • Pingback: Everything You Ever Wanted to Know About URLs but were Afraid to Ask | The Minority Report

  • Pingback: Everything You Ever Wanted to Know About URLs but were Afraid to Ask | The Minority Report

  • Pingback: Find out why a program can't seem to access to a file | Amit Agarwal

  • tom
  • Pingback: The Full Details on URLs

  • Jeff L

    Something else nice to know about URL’s is that there is a special one set aside for use in posts like this (some of your other comments have used it):
    http://example.com/

  • Geeks in Minutes

    Very nice article and great insight in to URL structure.

    Thanks

  • Pingback: davidvoegtle.net » Blog Archive » Daily links 05/07/2010

  • Pingback: Frank Carver's Punch Barrel / What Every Developer Should Know About URLs

  • stinky

    I don’t think the part about having to escape plus signs is right. As someone already pointed out, plus signs are widely used to encode spaces in query params.

    And you should never use semicola. Remember when Rails tried to introduce these with urls like http://example.org/posts/1;edit ? Hell broke loose because Webkit (I think) didn’t understand the semicolon and just stripped that part.

    You shouldn’t use semicola to separate query params either, because many sites and frameworks just pick querys apart by a simple .split(‘&’) , so all params would end up baked into one.

    All that may be in the standard, but if browsers don’t implement it and there’s not a chance they ever will, we shouldn’t use or teach it. Much of what the RFCs say can be part of an email address, won’t work with actual mail servers/clients either. And it shouldn’t, because it’s pure craziness. Or look at SGML syntax — I’m glad we’re only using a small subset of that in HTML, even if pre-5 HTML accepts much more in theory.

    • http://www.moonfruit.com adam lounds

      @stinky sorry, but if the framework I’m using supports semicolons then I’ll use them. Why should the fact that there are broken frameworks out there (that I’m not using) affect the code I write?

      I’m not sure that a semicolon is valid outside the query string – http://example.org/posts?id=1;action=edit is valid tho

  • http://blog.izs.me/ Isaac Z. Schlueter

    This is a good write-up. URL parsing is relatively trivial when you know what you’re dealing with, but URL resolution is definitely not. (You’ve covered the “standard” case well, but there are SO many more non-standard cases.)

    Note that in practice, there are a LOT of edge cases, and the defacto standard deviates from the specification in a few of them. If you’re going to write a URL resolution library, the best course of action is to get a very thorough suite of tests.

    NodeJS has an extensive collection of URL resolution tests (and a library that passes them.) The goal was to resolve URLs exactly the same way that Firefox and WebKit do it. The url parsing library uses the (arguably weird) naming conventions of the browser, rather than the naming conventions of the spec (which you use here), but the logic is sound.

    It’s very liberally licensed, so feel free to use it.

    • http://www.skorks.com Alan Skorkin

      Hi Isaac,

      That is sound advice and I agree. I am a big fan of spending the time to learn how things work under the hood, but this does not normally require you to be exhaustive. If someone has already gone to the trouble of building exactly what you need, production quality, then use it if you can. Plus if it doesn’t do what you need, since you know how things work under the hood you can always fix it :).

  • http://blog.izs.me/ Isaac Z. Schlueter
  • Kishore Mylavarapu

    Hey skorks.I am a student of Computer Science.Want to develop an application using JSP and Servlets with MySql.Can you please make me(To Whole my class of 60 students) so that we know how to make a project step by step.We are well at programming.We can code Java.But our problem is we don’t know how to start? Hope you will help me.
    Thanks in advance.

    • http://www.skorks.com Alan Skorkin

      Hi Kishore,

      There is all sorts of implications to what you ask. What kind of application do you want to develop? Do you want to just learn the basics, without regard for current best practice or do you want to know how things are done in industry? There are values to both approaches depending on your goals. I’ve actually been planning to blog a bit more about various interesting bits from the Java world, so you would certainly learn something from that once I write it.

      If you need help with something specific, I would be happy to see how I can help, but you have to be more specific regarding what you need. What are you trying to build, why are you trying to build it, why have you made the technology choices that you have made (java, jsp, etc.), what specifically do you not understand, what have you tried already and what hasn’t worked?

  • Pingback: phunculist

  • Pingback: Weekly Link Post 144 « Rhonda Tipton's WebLog

  • Pingback: Markus Tamm » Blog Archive » Links 11.05.2010

  • Pingback: URLs: all there is to know.. - intotheweb

  • Pingback: Tips On Submitting Your Articles | Lead Marketer

  • Pingback: Pedro Newsletter 01-06.05.2010 « Pragmatic Programmer Issues – pietrowski.info

  • http://www.e-deshi.com MukheModhu

    Hi Alan Skorkin, Thanks for your nice article. Though I have a website but still a lot of things don’t know about the url and how it works. I am having a problem with my url address. There are two address for my website one is http://www.e-deshi.com and another one is http://e-deshi.com. I am wondering why it is like that and how to get only one address. For the both address all the contents are same but the hit counter I have placed there is showing two different result.

    • http://www.skorks.com Alan Skorkin

      Hi, you’re welcome. Your issue is more of an http server configuration issue, if you’re using apache you can configure it to 301 redirect one of your urls to the other, this way only one will be the definitive url. I am sure you can do the same with any other server as well.

  • http://indiegamereviewer.com Indie Game Eddie

    Another huge post from you Alan. Been lurking this site for a while, and really, could leave this comment on any number of entries, but seeing that this one will appeal even to my non-developer friends, I figured it was a good place as any to leave kudos.

    Thanks for blowing another hole in my bubble of thinking I know a lot about anything :)

    • http://www.skorks.com Alan Skorkin

      You’re welcome, glad you liked it.

  • http://urlparser.com/ sberhan

    Great article. Interesting you pointed out the use of a semicolon as a separator in a query string. According to the W3C, it seems to be simply a recommendation to use a semicolon instead of “&” which is perhaps why a number of parsing functions in various languages (eg. php’s , .Net) are hard-coded to parse with “&” in the query string:
    http://www.w3.org/TR/1999/REC-html401-19991224/appendix/notes.html#h-B.2.2

    “We recommend that HTTP server implementors, and in particular, CGI implementors support the use of “;” in place of “&” to save authors the trouble of escaping “&” characters in this manner.”
    For example, see the results of parsing any URL via:

    http://urlparser.com/

    Besides, the purpose of the recommendation to not use “&” doesn’t help much when in fact the “&” needs to be escaped through out the value of html tag attributes. For example,
    Blah is invalid. It needs to be escaped as: Blah

  • Pingback: Al surfend over het web 12 juli | Contentgirls

  • http://www.breezetree.com/ Nicholas Hebb

    The format that has always puzzled me are WordPress URL’s, e.g.,

    http://www.somesite/blog/index.php/wtf-is-this-part-called/

    What do you call that last part of the URL?

  • Pingback: Lo que todo programdor debería saber sobre… | Vientos de Libertad

  • Dejay Clayton

    Nicholas,

    That last part of the path is usually referred to as the “Extra Path” or “Path Info”. It’s not a part of the URL spec, but rather, part of the spec of servers that process URLs. For example, CGI:

    http://www.w3.org/Daemon/User/CGI/Overview.html#PATH_INFO

    It’s not part of the URL spec because only the server can determine if the extra path info is in fact extra information not. In your example, one server might implement “index.php” as a PHP script, and another server might have “index.php” as a regular directory that contains files to be served.

  • NK Roto Moulding Mould

    it was really good.

  • http://www.thedirecttree.com John

    World of HTML is so complex now days that even experienced programmer has a hard time to follow latest tags and ways of doing things…

  • http:// 

    I ACTUALLY appeared to be very pleased to find this web-site.I wanted to thank you for your precious time for the wonderful learn!! I certainly having fun with every little bit of this and I’ve you book-marked to check out latest stuff you post.

  • Private Jet Hire

    great topic, i ve been particulary interested in the part concerning rare symbols

  • Nisaea

    Thanks a lot, this is a great article, gathering all the important stuff that is usually completely diluted in various docs. Totally bookmarked!

  • Sridhar

    Thanks for this article. I liked the way you presented the content in a simple to read and understand manner. Seriously, I did not know URL had so much behind it.

  • Pingback: Four short links: 7 May 2010 - O'Reilly Radar

  • Pingback: Can I have an URL query with numerical, possibly equal (non-unique) keys?

  • http://www.facebook.com/deft.infosys Deft Infosys

    gdfgdfg

  • http://castle-soft.com/ Leandro Camaño

    Thanks a lot!