AppSec Blog

The C14N challenge

Failing to properly validate input data is behind at least half of all application security problems. In order to properly validate input data, you have to start by first ensuring that all data is in the same standard, simple, consistent format — a canonical form. This is because of all the wonderful flexibility in internationalization and data formatting and encoding that modern platforms and especially the Web offer. Wonderful capabilities that attackers can take advantage of to hide malicious code inside data in all sorts of sneaky ways.

Canonicalization is a conceptually simple idea: take data inputs, and convert all of it into a single, simple, consistent normalized internal format before you do anything else with it. But how exactly do you do this, and how do you know that it has been done properly? What are the steps that programmers need to take to properly canonicalize data? And how do you test for it? This is where things get fuzzy as hell.

First, it doesn't help that if you google "security canonicalization" the top hits are around file naming problems. Sure, this is a canonicalization problem, but not the root of canonicalization problems. If you dig deeper, you'll find that the explanations of canonicalization don't actually explain how to do it.

The Official (ISC)2 Guide to the CSSLP explains that

"It is recommended to decode once and canonicalize input into the internal representation before performing validation to ensure that validation is not circumvented".

But it doesn't point you to resources that show how to do this.

Cigital's Build Security In portal, which has a lot of useful information on software security issues, isn't helpful in this case either. Stack Overflow explains what canonicalization is and why it's important, but not how to do it.

OWASP's summary of canonicalization, locale and Unicode issues in app security describes the issues well, especially from an attack perspective. To protect your app from Unicode issues, it tells developers:

A suitable canonical form should be chosen and all user input canonicalized into that form before any authorization decisions are performed. Security checks should be carried out after UTF-8 decoding is completed. Moreover, it is recommended to check that the UTF-8 encoding is a valid canonical encoding for the symbol it represents.

Um, ok, that sounds good.

To handle different input formats:

Determine your application's needs, and set both the asserted language locale and character set appropriately.

Sure.... And then:

Assert the correct locale and character set for your application. Use HTML entities, URL encoding, and so on to prevent Unicode characters being treated improperly by the many divergent browser, server, and application combinations. Test your code and overall solution extensively.

Good advice, especially that"and so on?" part. But what is the programmer actually supposed to do, and how are they supposed to do it? Where's the sample code?

Microsoft's Michael Howard and David LeBlanc almost explain the whole canonicalization problem in this sample chapter from Writing Secure Code, at least for developers working on Windows apps in .NET, and I am sure that the rest the explanation is somewhere in the book, but there aren't a lot of developers who are going to read the whole book to find out.

OWASP's ESAPI documentation explains what canonicalization is all about:

Canonicalization is simply the operation of reducing a possibly encoded string down to its simplest form. This is important, because attackers frequently use encoding to change their input in a way that will bypass validation filters, but still be interpreted properly by the target of the attack.

This starts off simple. But the authors then go on to scare you sleepless with descriptions of nasty attacks based on double encoding, double encoding with multiple schemes, and nested encoding (with multiple schemes). Then they say that canonicalization is incredibly tricky, that you may have to deal with over 100 different character encodings, and higher-level encodings such as percent-encoding, HTM-entity encoding and bbcode {whatever that is}. And so you really should use ESAPI to take care of it. Which you probably should.

If you are going to use ESAPI to take care of the problem, you can get an idea of where to start courtesy of the Security Ninja. But if you're already using other frameworks that may or may not take care of canonicalization (how would you know?), or have a lot of legacy code that you need to maintain, and you want to understand the problem and maybe try to take care of it yourself, and you don't want to read through the ESAPI source code to figure out how they did it, where do you go for answers? Maybe I haven't looked hard enough, but I haven't found them yet.

Canonicalization is far too important a problem to leave to programmers to try to figure it out and solve on their own. We need experts at SANS and OWASP to step in and provide clear, simple, actionable guidance to developers on how to properly handle input canonicalization on the different major development platforms. Telling developers that a problem is fundamental and critical, and then not explaining how to deal with it properly (without rewriting the code to use something like ESAPI), just isn't good enough.

Post a Comment


* Indicates a required field.