URL Parsing and Path Traversal

URL Parsing – Here Be Dragons

Working as a web developer means it’s good to know as much about how the internet works as possible. The majority of security vulnerabilities are arguable introduced by developers not understanding conceptually how a given system actually operates – they may code for the most familiar ‘happy path’ of most used functionality, but fail to consider ‘edge cases’ based on perfectly valid, but less known functionality and options. It is therefore critical for developers and other technologists to appreciate how the Web actually works.

Despite all the myriad of sophisticated technologies that have emerged in recent years, the Web remains fundamentally the same system as  it has been since it was first developed in the early 1990s – a system in which HTML documents and other web resources are hosted and made available by server systems, with permission to access, update or delete each record requested over the HTTP protocol by clients. Although the HTTP protocol, as well as HTML and other file formats have themselves historically given rise to various vulnerabilities, a third class of vulnerability relates to the construction and parsing of the addresses within the system that is used to provide globally unique identifiers or ‘addresses’ for resources and known as the URI.

 

URIs

A Uniform Resource Identifier (URI) is a string of characters that unambiguously identifies a particular resource. To guarantee uniformity, all URIs follow a predefined set of syntax rules. Such identification enables interaction with representations of the resource over a network, typically the World Wide Web, using specific protocols. The most common form of URI is the Uniform Resource Locator (URL), frequently referred to informally as a ‘web address’.

 

URLs

A Uniform Resource Locator (URL), colloquially termed a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifier (URI) that occurs most commonly to reference web pages (http), but are also used for file transfer (ftp), email (mailto), database access (JDBC), and many other applications. The term ‘URL’ does not refer to a formal partition of URI space; rather, URL is a useful but informal concept.

Most web browsers display the URL of a web page above the page in an address bar, although recently some browsers have elected to hide parts of the URL by default and only show the full address when clicked. A typical URL could have the form that we are all familiar with:

 

http://www.example.com/index.html

 

This indicates a protocol or schema (http), a hostname (www.example.com), and a file name (index.html). Every HTTP URL conforms to the syntax of a generic URI. The URI generic syntax consists of a hierarchical sequence of five components:

 

scheme:[//[userinfo@]host[:port]]path[?query][#fragment]

 

You may not be familiar with all these elements since some are optional, and less frequently seen – only the scheme and path are mandatory. Applying this syntax to a standard URL containing all the elements, we get:

 

A URL broken down into its component parts

 

The below is another URL, though not one that you may have seen similar examples to before:

 

http://1.1.1.1 &@2.2.2.2:33/4.4.4.4?1.1.1.1# @3.3.3.3/

 

This is pretty difficult to get your head around initially, and code trying to make sense of it can likewise contain code errors that fail to consider unusual edge cases, which is often where vulnerabilities can be introduced. Before we look at these vulnerabilities in detail, we’ll take a look at how these URLs are interpreted by browsers, servers and other software.

 

URL Parsing

It is obviously fairly important that both client and server agree on which resource a given URL is intended to relate to, not least so that the client is able to access its intended resource target. They do this using:

  1. An agreed and standard syntax for URLs; and
  2. A “parser” that is able to reliably, consistently and uniquely identify a resource from a given URL

 

Parsers are software component that takes input (in this case URLs) and builds a data structure – often some kind of parse tree, abstract syntax tree or other hierarchical structure, giving a structural representation of the input while checking for correct syntax. We won’t go into too much detail here, but typically this involves a first stage of the token generation or lexical analysis, by which the input character stream is split into meaningful symbols defined by a grammar of regular expressions, the next stage is parsing or syntactic analysis, which is checking that the tokens form an allowable expression, and the third and final phase is semantic parsing or analysis, which is working out the implications of the expression just validated and taking the appropriate action.

 

URL Parsing Issues

Due to complexity in parser implementation and minor grey areas in specifications, it turns out that not all URL parsers parse URLs in the

same way. There are examples where individual parsers:

  1. Misinterpret standards;
  2. Omit subtler structures within the URL; and
  3. Are confused by less common URL forms.

 

These issues are exacerbated in that a web URL may be examined and parsed at several points after its submission by the client – on proxy servers, caches and CDNs during transit, and within in-house code, third-party components and separately developed subsystems on the destination server itself. If different systems interpret URLs differently, operations can end up being performed on resources where they were not intended.

Vulnerabilities, such as authorisation bypasses in particular, can arise when one parser is used for the access (authorisation) check for resource access, and then a second, separate parser is used to fulfil the request for the data retrieval.

In 2016 a flaw was discovered in PHP’s parse_url function. When processing a URL in a form such as http://permitted.com#@127.0.0.1/ it was discovered that the function returned the incorrect host. The top output shows the fixed (patched) behaviour,  with the second example showing how the earlier, vulnerable, code parsed the URL:

 

 

These examples are fairly unusual however – the majority of vulnerabilities in URL parsing rely on the parser acting correctly, and delivering functionality that is valid but which the developer has not considered. Let’s take a look at some common exploits relating to URL parsing.

 

Exploits

We are going to look at one main type of exploit types in this article – Path Traversal. Vulnerabilities in URL parsing can certainly enable additional exploit types, but this may be the most commonly seen and lends itself to consideration relating to URL parsing.

 

Path Traversal

Path Traversal is a relatively simple and highly impactful vulnerability that exploits the relative traversal capabilities of most filesystem paths. These can be employed by an attacker to cause the system to read or write files outside of the intended path scope.

To see how this works, imagine a scenario where a developer is passing a parameter into a function to return a filename from disk. They may ask for an input from the user for the filename to be returned. This may be a function that performs a task such as returning a Report file from a given directory on disk.

The code for this looks like the below, taking the user input “report_id” and returning that report:

 

public static bytes[] getReport(String report_id) {
    String reports_dir = "C:appgeneratedreports";
    filepath = Path.Combine(reports_dir, report_id);
    return File.ReadAllBytes(filepath);
} 

 

On the front-end, the expectation is for reports to be accessed by their ID eg:

 

http://someapp.com/reports?report_id=report_1234

 

and the naïve expectation of the developer is that where the report ID does not match a file in the reports directory, an error will be thrown and no file returned. Indeed, they know that the attacker cannot retrieve files from other directories because they have ensured that all file-paths are built up using the prefix “C:appgeneratedreports”.

 

However, the developer in this instance has only considered absolute filenames being passed in, whereas the File.ReadAllBytes function will dutifully process all file-paths, including those using ‘relative path’ directives. These are the character sequences “../” and “..” that are used to indicate “go up one directory”.

 

An attacker can pass in a report_id argument such as

 

“……win.ini”

 

and this will be concatenated by the code into:

 

“C:appgeneratedreports……win.ini”

 

which will resolve to and return the contents of file “C:win.ini”, a file outside of the application’s directory structure that the user should not have access to.

In this instance, the developer has failed to consider all edge cases of user-provided input and to validate that the input provided is in line with expectations. It may seem trivial to simply add code to reject input that contains these special characters, however attackers are able to bypass any such checks with relative ease using encoding schemes for the input – substituting the characters for character sequences that look different, but nevertheless refer to the same file or resource. Some examples are:

 

URL Encoding

Encoded Decoded
%2e%2e%2f ../
%2e%2e%5c ..
%255c .. (double encoded)
%u2216 .. (non-standard)

 

UTF-8 Character Encoding

Encoded Decoded
%cc%b7 (NON-SPACING SHORT SLASH OVERLAY)
%cc%b8 (NON-SPACING LONG SLASH OVERLAY)
%e2%81%84 ⁄ (FRACTION SLASH)
%e2%88%95 ∕ (DIVISION SLASH)
%ef%bc%8f (FULLWIDTH SLASH)

 

Unicode Encoding

Encoded Decoded
%c1%1c /
%c0%af

 

Prevention & Mitigation

Validate Input

The primary form of protection against path traversal attacks is an awareness of all the different exploits outlined above, and the defensive coding against them, primarily in validating user-provided input so that the malicious payloads listed do not work. To avoid path traversal, it is best practice to first resolve the path to an absolute path, and then check that this falls within the expected path.

There are various pitfalls in logic and implementation in creating such protection, but one possible algorithm for preventing directory traversal would be to:

  1. Use a built-in path canonicalization function in your development language (such as realpath() in C) that produces the canonical version of the pathname – that is, it effectively removes “..” sequences and symbolic links, as well as resolving and normalizing all characters (e.g. %20 converted to spaces).
  2. Only after this canonicalisation has occurred and an absolute, fully qualified and normalized descriptor of path for the file established, check that the path to the file begins with the file-path matching the ‘Document Root’

 

Indexing or Indirect References

In some cases it is possible to generate pseudo-random indexes, tokens or codes that map to corresponding files on the server.  When a file is requested, it is requested via this index ID rather than taking a filepath from the user. The filepath exists only in the lookup table on the server in reference to the corresponding ID, hence is immune from user-controlled input. This approach can only be used when the set of acceptable objects, such as filenames on disk, is finite or limited and known.

For example, ID 1 could map to “inbox.txt” and ID 2 could map to “profile.txt”. The input from the user can be trivially validated (it is either a numeric ID with a matching record in the index table or not) and is immune to path traversal.

 

Least Privilege

The account used by the application should enjoy the minimal privileges necessary with respect to the file system.  Ideally, this should be limited to files within the legitimate function of the application and current user. For example on Linux this may involve ensuring that the web server process runs under the www-data user rather than root and has limited or no access to system files outside of the web root that all web requests are served from. This does not prevent path traversal attacks, but mitigates their potential exploit should they occur.

 

Attack Surface Reduction

The converse of the above is that just as the web server should only be able to access the web root, the webroot itself should only contain files that the web-server needs to serve – specifically you should ensure that you store library, include, and utility files outside of the web document root, wherever possible.

 

Document Segregation

Segregating documents of different sensitivity onto separate file-servers or file partitions allows you to prevent mixing public documents and more sensitive material, and minimises the risk of data exfiltration in the event that a path traversal attack is successful.

If you are restricted to a single system, then it is possible to consider using a chroot jail. A chroot on Unix/Linux operating systems is an operation that changes the apparent root directory for a running process and its children. A program that is run in such a modified environment cannot normally access files outside the designated directory tree. Running a web-server inside a CHROOT jail ensures that it cannot access files outside this directory tree. Although vulnerabilities in chroot jails themselves have been discovered in the past, it does typically provide some additional segregation, but should not substitute for other preventative measures and is not a panacea.

 

How can AppCheck Help?

AppCheck help you with providing assurance in your entire organisation’s security footprint. AppCheck performs comprehensive checks for a massive range of web application vulnerabilities from first principle to detect vulnerabilities – including path traversal issues – in in-house application code. AppCheck also draws on checks for known infrastructure vulnerabilities in vendor devices and code from a large database of known and published CVEs. The AppCheck Vulnerability Analysis Engine provides detailed rationale behind each finding including a custom narrative to explain the detection methodology, verbose technical detail and proof of concept evidence through safe exploitation.

 

About AppCheck

 

AppCheck is a software security vendor based in the UK, that offers a leading security scanning platform that automates the discovery of security flaws within organisations websites, applications, network, and cloud infrastructure.

As always, if you require any more information on this topic or want to see what unexpected vulnerabilities AppCheck can pick up in your website and applications then please get in contact with us: info@appcheck-ng.com

 

Get started with Appcheck

No software to download or install.

Contact us or call us 0113 887 8380

Start your free trial

Your details
IP Addresses
URLs

Get in touch

Please enable JavaScript in your browser to complete this form.
Name