PHP 7: Introducing a domain name validator and making the URL validator stricter

PHP Origami

DNS comes with a set of rules defining valid domain names. A domain name cannot exceed 255 octets (RFC 1034) and each label cannot exceed 63 octets (RFC 1035). It can contain any character (RFC 2181) but extra rules apply for hostnames (A and MX records, data of SOA and NS records): only alphanumeric ASCII characters and hyphens are allowed in labels (we’ll talk about IDNs at the end of this post), and they cannot start nor end with a hyphen.

Until now, there was no PHP’s filter validating that a given a string is a valid domain name (or hostname). Worst, FILTER_VALIDATE_URL was not fully enforcing domain name validity (this is mandatory for schemes such as http and https) and was allowing invalid URLs. FILTER_VALIDATE_URL was also lacking IPv6 host support.

<?php

// PHP 5.6.3

// Label ends with a hyphen
var_dump(filter_var('http://a-.bc.com', FILTER_VALIDATE_URL));
// string(16) "http://a-.bc.com"

// Label is more than 63 octets
var_dump(filter_var('http://toolongtoolongtoolongtoolongtoolongtoolongtoolongtoolongtoolongtoolong.com', FILTER_VALIDATE_URL));
// string(81) "http://toolongtoolongtoolongtoolongtoolongtoolongtoolongtoolongtoolongtoolong.com"

// Lack of IPv6 support
var_dump(filter_var('http://[2001:0db8:0000:85a3:0000:0000:ac1f:8001]', FILTER_VALIDATE_URL));
// bool(false)

These limitations will be fixed in PHP 7. I’ve introduced a new FILTER_VALIDATE_DOMAIN filter checking domain name and hostname validity. This new filter is now used internally by the URL validator. I also added IPv6 host support in URL validation:

<?php

// PHP 7.0.0-dev

// Validate a domain name
var_dump(filter_var('mandrill._domainkey.mailchimp.com', FILTER_VALIDATE_DOMAIN));
// string(33) "mandrill._domainkey.mailchimp.com"

// Validate an hostname (here, the underscore is invalid)
var_dump(filter_var('mandrill._domainkey.mailchimp.com', FILTER_VALIDATE_DOMAIN, FILTER_FLAG_HOSTNAME));
// bool(false)

// Label ends with a hyphen
var_dump(filter_var('http://a-.bc.com', FILTER_VALIDATE_URL));
// bool(false)

// Label is more than 63 octets
var_dump(filter_var('http://toolongtoolongtoolongtoolongtoolongtoolongtoolongtoolongtoolongtoolong.com', FILTER_VALIDATE_URL));
// bool(false)

// Lack of IPv6 support
var_dump(filter_var('http://[2001:0db8:0000:85a3:0000:0000:ac1f:8001]', FILTER_VALIDATE_URL));
// string(48) "http://[2001:0db8:0000:85a3:0000:0000:ac1f:8001]"

There is still a big lack in PHP’s domain names and URLs handling: internationalized domain names are not supported at all in the core. I’ve already blogged about an userland workaround, but as IDNs becomes more and more popularsa core support by PHP in streams and validation is necessary. For instance, almost all french registrars support them, and even TLDs – such as the Chinese one – are available in the wild in a non-ASCII form). I’ve started a patch enabling IDN support in PHP’s streams. It works on Unix but still lacks a Windows build system. As it requires making ICU a dependency of PHP, I’ll publish a PHP RFC on this topic soon!

Internationalized Domain Name (IDN) and PHP

PHP Beer Mug

Currently, PHP doesn’t have any native support of IDN: domains with non-ASCII characters such as http://www.académie-française.fr.

If you try to connect to such site you’ll get nothing but an error:

file_get_contents('http://académie-française.fr');
// PHP Warning:  file_get_contents(): php_network_getaddresses: getaddrinfo failed: nodename nor servname provided, or not known in php shell code on line 1

RFC 3490 specifies that applications must convert IDN in a plain ASCII representation called Punycode before making DNS an HTTP requests. If we use this representation for our domain, it works:

file_get_contents('http://www.xn--acadmie-franaise-npb1a.fr');
// Connection is OK

Fortunately, the PHP INTL extension provides a function that converts an IDN to its Punycode representation: idn_to_ascii

This function must be applied only on the domain part of an URL. It will not work if applied to the whole URL.

Lets leverage parse_url and PECL HTTP’s \http\Url  making an IDN URL converter that will work in all cases:

/**
 * Converts an URL with an IDN host to a ASCII one.
 *
 * @param $url string
 * @return string
 */
function url_with_idn_to_ascii($url)
{
    $parts = parse_url($url);
    $parts['host'] = idn_to_ascii($parts['host']);

    return (new \http\Url($parts))->toString();
}

Use it to access our IDN URL:

file_get_contents(url_with_idn_to_ascii('http://académie-française.fr'));
// Connection is OK

According to this stackoverflow answer and RFC 6125, it should also work for HTTPS connections on IDN. But after some search, I wasn’t able to find any HTTPS enabled IDN domain in the wild. If you’re aware of such setup, please post a comment with the URL!

Work is in progress in the PHP language to get native IDN support in streams and URL validation. Maybe one day I’ll be able to natively connect to classy URLs like http://kévin.dunglas.fr.

In the meantime, you can use this function to access URL containing IDN. On the validation side, the Symfony URL validator has built-in IDN support.

Remember that this function requires INTL and HTTP extensions to work.