< Chap 5 TOC: URIs | Main | C#, VB .NET & C++ .NET URIs Examples >


 

Chapter 5 Part 1:

Developing with Uniform Resource Identifiers (URIs)

 

 

 

What do we have in this chapter 5 Part 1?

  1. Introduction

  2. Key Components of a URI

  3. Scheme Component

  4. Authority Component

  5. Path Component

  6. Query Component

  7. URI Types

  8. Working with System.Uri

  9. Canonicalization

  10. Comparing URIs

  11. Working with Schemes

  12. Parsing Host Names

  13. Using System.Uri in Your Application

  14. When to Use System.Uri

  15. When Not to Use System.Uri

 

 

Introduction

 

As the Web began to take shape, many problems associated with its massive growth arose. Being able to identify objects or resources in a way that avoided conflicts was one key problem for the creators of the Web. These objects were often files such as documents, graphics, or programs. Uniform Resource Identifiers (URIs) were created to solve the problem of unique object identification by specifying a universal set of namespaces that can be used to identify all resources. URIs play a critical role in network development because users often need to either interact with or refer to resources that are represented by URIs.

This chapter covers the components of a URI and introduces you to System.Uri, the Microsoft Windows .NET Framework class used to represent a URI. We’ll discuss the most common techniques used when manipulating URIs with System.Uri and then delve into the aspects that developers often struggle with, such as understanding the escaping logic, comparing URIs, exposing URIs in your application, and working with different URI schemes.

 

Key Components of a URI

 

A URI is defined in Request for Comments (RFC) 2396 as a "compact string of characters for identifying an abstract or a physical resource." A URI in general is made up of two parts:

 

  1. The scheme and
  2. The scheme-specific part.

 

In Figure 5-1, you can see that the scheme is often associated with protocols seen on the Web today, where the scheme-specific part identifies the resource. Many scheme-specific parts of a URI also have an authority, a path, and a query, but these parts are not required.

 

Syntax and examples of the required URI parts

 

Figure 5-1: Syntax and examples of the required URI parts

 

In Figure 5-2, you’ll notice that the authority is often used to contain what is commonly considered to be the host name or host address. The path might contain file names, and the query is used to specify name/value pairs of information.

 

Syntax and examples of common (but not required) URI parts

 

Figure 5-2: Syntax and examples of common (but not required) URI parts

 

Scheme Component

 

The scheme determines the logic that’s used for parsing and, in cases where possible, for resolving the resource specified in the scheme-specific part. Scheme names are defined in lowercase. However, because URIs are not always machine generated, most applications will accept a scheme in a case- insensitive manner.

 

Authority Component

 

The authority component of a URI is defined as the top hierarchical element of the URI that governs the remainder of the namespace defined by the URI. The authority component is often preceded by a double slash (//). For example, consider the following URI:

 

http://www.contoso.com/products/list.aspx?name=soap

 

The authority in this example is www.contoso.com, and it’s responsible for the remainder of the namespace.

Many schemes designate protocols that include a default port number to be used when resolving the resource. If the port number is omitted from the authority, the default is used. In this example, the default port number for the HTTP protocol is 80. An authority could specify a non-default port number, as follows:

 

http://www.contoso.com:8080/products/list.aspx?name=soap

 

Path Component

 

The path is used to further identify the resource within the scope of the specified scheme and authority (when an authority is present). The path is often preceded by a single slash (/) and might contain multiple path segments separated by a slash. For example, the following URI contains three path segments:

 

http://www.contoso.com/products/new/list.aspx?name=soap

 

In this example, the segments are products, new, and list.aspx.

 

Take note that list.aspx is a path segment. The general syntax for a URI does not define the notion of a file and an extension. Developers should be careful about making any assumptions based on the notion that a file name and an extension can be reliably parsed from a URI. There’s no guarantee that a segment that looks like a file and an extension is not just a directory with a dot in the middle.

 

Query Component

 

The query, also called the query string, represents a string of information that’s interpreted by the resource. Although the query string is often used to provide information to be interpreted by the resource, developers should be aware of the fact that the query component is considered part of the URI, so its contents can be used to determine the resource that’s obtained when the URI is resolved.

Note that some URI schemes, such as HTTP, support the notion of a fragment. A fragment is a string of information that’s meant to be interpreted by the program resolving the URI and is not considered part of the URI. The fragment is separated from the URI by a crosshatch (#) character. For example, suppose the following URI and fragment are entered into a browser:

 

http://www.contoso.com/products/new/list.htm#newproducts

 

In this example, the newproducts fragment might be interpreted by the browser as a bookmark telling it which portion of the page it should display. The resource in this case is represented by http://www.contoso.com/products/new/list.htm. Because the fragment is not part of the URI, its contents are not used to determine the resource that’s obtained if the resource can be resolved.

 

URI Types

 

You’ve probably noticed by now that we’ve been careful to point out that not all resources represented by a URI can be resolved. In fact, there are two principal types of URIs:

 

  1. URLs and
  2. URNs.

 

The type that most people are familiar with is the Uniform Resource Locator (URL). A URL is a subset of a URI that identifies a resource by indicating how the resource can be accessed. For example, most of the URIs displayed thus far in this chapter fit into the URL category because you can use them to access a resource. However, URIs can also be used to name things without necessarily describing how they are accessed. In the case where the URI defines a name, it’s called a Uniform Resource Name (URN). The benefit of having a URN is that it can be used to identify a resource even after the resource ceases to exist or becomes unavailable. For example, the URI urn:people:santaclaus could be used to name Santa Claus, but it does not give us the information necessary to locate him.

The one other type distinction that’s important to understand for URIs is that of an absolute URI versus a relative URI. An absolute URI contains a scheme and a scheme-specific part. So far in this chapter, we’ve mostly been talking about absolute URIs. It’s also possible to have a relative URI, which is a URI reference that’s related to some base URI. A relative URI does not contain a scheme. In most cases, a relative URI contains only the path and query components. Because the scheme and the authority are not present, they must be known through some other means. For example, an application downloading an HTML page might find an absolute URI in the document and then find relative links within the embedded HTML. If the absolute URI is not defined in the document, the application might assume that the original URI for the downloaded HTML page is the base URI.

 

Working with System.Uri

 

System.Uri is the .NET Framework class used to represent a URI. Using the System.Uri class, you can validate, parse, combine, and compare URIs. You construct an instance of System.Uri by supplying a string representation of the URI in the constructor.

 

C#

 

try

{

    Uri uri = new Uri("http://www.contoso.com/list.htm#new");

    Console.WriteLine(uri.ToString());

}

catch(UriFormatException uex)

{

    Console.WriteLine(uex.ToString());

}

 

Visual Basic .NET

 

Try

Dim uri As New Uri("http://www.contoso.com/list.htm#new")

    Console.WriteLine(uri.ToString)

Catch uex As UriFormatException

    Console.WriteLine(uex.ToString)

End Try

 

The code in this sample constructs a new URI instance and displays the URI to the console. Although constructing a URI is a very simple task, there are a number of things to consider when working with a URI.

 

Canonicalization

 

Canonicalization is the process of converting a URI into its simplest form. This process is important because there are multiple ways to express a URI in its raw or string form that ultimately canonicalize into the same URI. Consider the following example:

 

C#

 

try

{

    // Note the raw URIs are all different

    Uri uriOne = new Uri("http://www.contoso.com/Prod list.htm");

    Uri uriTwo = new Uri("http://www.contoso.com:80/Prod%20list.htm");

    Uri uriThree = new Uri("http://www.contoso.com/Prod%20list.htm");

    // The Canonical representation is the same for all three

    Console.WriteLine("uriOne = " + uriOne.ToString());

    Console.WriteLine("uriTwo = " + uriTwo.ToString());

    Console.WriteLine("uriThree = " + uriThree.ToString());

}

catch(UriFormatException uex)

{

    Console.WriteLine(uex.ToString());

}

 

 

Visual Basic .NET

 

Try

' Note the raw URIs are all different

Dim uriOne As New Uri("http://www.contoso.com/Prod list.htm")

Dim uriTwo As New Uri("http://www.contoso.com:80/Prod%20list.htm")

Dim uriThree As New Uri("http://www.contoso.com/Prod%20list.htm")

' The Canonical representation is the same for all three

    Console.WriteLine("uriOne = " + uriOne.ToString())

    Console.WriteLine("uriTwo = " + uriTwo.ToString())

    Console.WriteLine("uriThree = " + uriThree.ToString())

    Catch uex As UriFormatException

    Console.WriteLine(uex.ToString)

End Try

 

In this example, the space can be either a literal value or an “escaped” form. Also, the :80 port number can be excluded in the canonical form because it’s the default port for the scheme for this URI. The canonical representation of a URI can be obtained by calling the ToString() method of System.Uri.

 

Comparing URIs

 

It’s often useful to compare two URIs. However, it’s important to understand that System.Uri compares URIs in their canonical form instead of in their raw form. Consider the following example:

 

C#

 

try

{

    // Note the raw URIs are different

    Uri uriOne = new Uri("http://www.contoso.com/Prod list.htm");

    Uri uriTwo = new Uri("http://www.contoso.com:80/Prod%20list.htm");

 

    // Comparison is based on the canonical representation

    // so uriOne and uriTwo will be equal.

    Console.WriteLine(uriOne.Equals(uriTwo));

}

catch(UriFormatException uex)

{

    Console.WriteLine(uex.ToString());

}

 

Visual Basic .NET

 

Try

' Note the raw URIs are different

Dim uriOne As New Uri("http://www.contoso.com/Prod list.htm")

Dim uriTwo As New Uri("http://www.contoso.com:80/Prod%20list.htm")

 

'Comparison is based on the canonical representation

'so uriOne and uriTwo will be equal.

    Console.WriteLine(uriOne.Equals(uriTwo))

 

Catch uex As UriFormatException

    Console.WriteLine(uex.ToString)

End Try

 

Another interesting point to note is that because the fragment isn’t considered part of the URI, it’s omitted from the URI comparison in System.Uri. For example, a comparison of http://www.contoso.com/Prodlist.htm and http://www.contoso.com/Prodlist.htm#newItems will return true because #newItems is ignored.

 

Working with Schemes

 

As described earlier in this chapter, the scheme part of a URI is the element at the beginning of the URI that defines how the URI can be parsed and, in the case of a URL, resolved. Most schemes define a scheme-specific part that follows the general guidelines listed earlier in this chapter of having an authority, a path, and (potentially) a query component. However, schemes are not required to follow this pattern. In fact, some schemes define their own logic that does not correspond to these common parts. For example, consider the following URIs:

 

http://www.contoso.com/Prodlist.htm

mailto:cdo@contoso.com?meg=kate

 

The first URI is an example of the HTTP scheme. It defines an authority (www.contoso.com) and a path (Prodlist.htm). The second URI is an example of the MAILTO scheme. MAILTO does not define authority and path components. Rather, it defines a to component and a headers component. In this example, the to value is cdo@contoso.com, the header name is meg, and the header value is kate.

In general, System.Uri will simply look for the colon to parse the scheme from the scheme-specific part. There is one exception to this rule that developers should understand. Because it’s common for URIs of the file: scheme to be entered without the scheme, as in c:\test\test.htm, System.Uri supports the automatic conversion of local paths (c:\test\test.htm) to file: scheme URIs (file:///c:/test/test.htm). So, if you have a single character scheme, the System.Uri class will treat it as a file: scheme.

System.Uri has an in-depth understanding of a number of the most commonly used schemes so that it can take these special cases into account. The following list represents the schemes understood by System.Uri in version 1.1 of the .NET Framework:

 

  1. FILE
  2. HTTP
  3. HTTPS
  4. FTP
  5. GOPHER
  6. MAILTO
  7. NEWS
  8. NNTP
  9. UUID
  10. TELNET
  11. LDAP
  12. SOAP
  13. Many more…

 

Although this list is expected to grow over time, the fact that schemes can be defined at any time ensures that there will be cases in which System.Uri encounters a scheme that it does not recognize. In those cases, System.Uri will fall back to using parsing logic based on the general URI components described at the beginning of this chapter. If that URI scheme follows these general component recommendations, the URI will parse just fine. However, if that unknown scheme has defined its own scheme-specific part that does not follow the common pattern, such as with MAILTO, System.Uri does not have a way of knowing how to parse out the components and will throw a UriFormatException if it can’t map the scheme into the common pattern. For example, consider the following example:

 

C#

 

try

{

    Console.WriteLine("Unknown scheme general pattern");

    Uri uriUnknown = new Uri("unknown://authority/path?query");

    Console.WriteLine("scheme:" + uriUnknown.Scheme);

    Console.WriteLine("authority:" + uriUnknown.Authority);

    Console.WriteLine("path and query:" + uriUnknown.PathAndQuery);

 

    Console.WriteLine();

 

    Console.WriteLine("Unknown scheme that uses a custom pattern");

    Uri uriUnknownCustom = new Uri("unknown:path.authority.query");

    Console.WriteLine("scheme:" + uriUnknownCustom.Scheme);

    Console.WriteLine("authority:" + uriUnknownCustom.Authority);

    Console.WriteLine("path and query:" + uriUnknownCustom.PathAndQuery);

 

}

catch(UriFormatException uex)

{

    Console.WriteLine(uex.ToString());

}

 

Visual Basic .NET

 

Try

    Console.WriteLine("Unknown scheme general pattern")

Dim uriUnknown As New Uri("unknown://authority/path?query")

 

    Console.WriteLine("scheme: " + uriUnknown.Scheme)

    Console.WriteLine("authority: " + uriUnknown.Authority)

    Console.WriteLine("path and query: " + uriUnknown.PathAndQuery)

 

    Console.WriteLine()

 

    Console.WriteLine("Unknown scheme that uses a custom pattern")

Dim uriUnknownCustom As New Uri("unknown:path.authority.query")

 

    Console.WriteLine("scheme: " + uriUnknownCustom.Scheme)

    Console.WriteLine("authority: " + uriUnknownCustom.Authority)

    Console.WriteLine("path and query: " + uriUnknownCustom.PathAndQuery)

 

Catch ex As Exception

    Console.WriteLine(ex.ToString)

End Try

 

This sample outputs the following to the console:

 

 

 

Unknown scheme that follows the general pattern

scheme:unknown

authority:authority

path and query:/path?query

 

Unknown scheme that uses a custom pattern

scheme:unknown

authority:

path and query:path.authority.query

 

 

In this sample, System.Uri is able to correctly parse the unknown scheme that uses the general scheme pattern. However, in the case where the scheme-specific part is based off a custom pattern, the authority is not parsed because the logic for parsing the component parts is not defined.

In version 1.1 of the .NET Framework, there’s no way to specify custom parsing logic so that a URI scheme that does not follow the general URI pattern and is not known by System.Uri can “plug in” and provide its own parsing implementation. This lack of support for custom URI scheme parsing is expected to the change in the next major release of the .NET Framework. In general, if you have to create a new scheme, it’s best to follow the general component syntax of scheme: //authority/path?query because most URI parsing libraries will understand the scheme.

 

Parsing Host Names

 

Although the concept of a host is not explicitly defined as part of the URI, a host name is often referenced as the authority portion of the URI. Therefore, you should consider the following points when dealing with host names in System.Uri, especially in the case of HTTP URIs:

 

  1. System.Uri supports a fully qualified DNS name (FQDN), an IP address, or a machine name as the host name.
  2. System.Uri always converts the host name to lowercase characters as part of parsing the URI.
  3. Internet Protocol version 6 (IPv6) addresses should be entered inside square brackets for URI construction, for example, http://[::1]/path.
  4. Internet Protocol version 4 (IPv4) addresses can be entered in their conventional dot-separated format, for example, http://127.0.0.1.

 

Using System.Uri in Your Application

 

In the course of developing your application or classes that work with a URI, there are a few key guidelines that you should follow to ensure that your application gets the best performance and security. The System.Uri type is extremely valuable when it comes to parsing and validating a URI. However, this functionality comes at a cost and there are cases when it should not be used.

 

When to Use System.Uri

 

Consider using System.Uri in your application whenever you need to parse or validate a URI in any way. Like many things in development, a URI can appear to be very simple on the surface but turn out to be quite complex when you begin to consider all the cases. We’ve seen that developers who avoid the temptation to write a “simple URI parser” are often rewarded later on in the development cycle when application logic or input assumptions change, causing the simple logic to become much more complex.

A significant cost is associated with constructing an instance of System.Uri when compared to that of simply creating a string. Most of this cost is because of the URI validation logic. Because of this cost, it’s important that methods in your application that parse a URI and then pass the URI on to any other method should always pass the URI as a System.Uri type rather than as a string. This way, you avoid a scenario where the URI is parsed multiple times as it gets converted from a string to a Uri instance, back to a string, and back to a Uri instance as it moves through the call stack.

 

When Not to Use System.Uri

 

Because of the cost of construction, System.Uri should not be used if you never intend to parse or validate the URI being represented. In these cases, a String type should be used to contain the URI.

In version 1.1 of the .NET Framework, System.Uri implements the MarshalByRefObject interface, which means that passing a System.Uri object as a parameter in a remote call will cause the Uri instance to be passed by reference rather than by value. Passing the Uri instance by reference can lead to unintended circumstances such as having an extremely high performance cost of accessing the properties that you think are local when they are really being remoted to another application that’s possibly on another machine. The fact that System.Uri implements MarshalByRefObject can also lead to a security or functional issue in your code if the application is making decisions based on an assumption that the Uri instance is immutable. Because of these reasons, you should avoid passing a URI as part of the signature in a remote procedure call. In cases where you must pass a URI in a remote procedure call, consider passing the URI as a String rather than as an instance of System.Uri.

It is anticipated that in the next major release of the .NET Framework, MarshalByRefObject will be removed from the System.Uri signature so that it is always passed in remote calls by value rather than by reference. The following are program examples related to the URIs in C++, C# and VB .NET

 

 

 


 

< Chap 5 TOC: URIs | Main | C#, VB .NET & C++ .NET URIs Examples >