With PHP 8.4 on the horizon I'm working on a new major release for Kint. As usual working on a debugging tool like this, I keep finding weird and wonderful PHP edge cases.
In 8.4 we're getting a new Dom API that promises to be better than the last one, so I've been working on unit testing and standardizing Kint's handling of XML representations like Dom and SimpleXMLElement.
If you don't spend all your time working with XML you may not know the difference between these two, so let me draw an analogy: Dom is to symfony as SimpleXML is to laravel.
It's hardly a tortured analogy: Dom is verbose but robust, and SimpleXML uses magic to be easier to use with some non-obvious behavior as consequence.
Basic use
You can get children with property access, and attributes with array access. Text contents can be acquired by casting to string.
$xml = <<<END
<root attribute="text">
<contents>Body</contents>
<contents>Footer</contents>
<Something-else>WTF?</Something-else>
</root>
END;
$xml = new SimpleXMLElement($xml);
assert((string) $xml['attribute'] === 'text');
assert((string) $xml->contents === 'Body');
assert((string) $xml->contents[0] === 'Body');
assert((string) $xml->contents[1] === 'Footer');
assert((string) $xml->{'Something-else'} === 'WTF?');
In the happy path that's all there is to it and the story ends there. Unfortunately, there's a lot more unhappy when it comes to SimpleXML
Strangeness
It seems the "Simple" in SimpleXML comes not from how easy it is to use but how easy it was to implement. On whatever random PHP version my local git repo is at simplexml.c clocks in at ~2700 LOC while the 5 largest ones in DOMDocument are around ~2300 each.
Pretty much all SimpleXMLElement operations result in another SimpleXMLElement (Even accessing a nonexistent node) and you only get concrete results when you cast it to something else. This is the cause of a laundry list of strange behavior.
You see an example of it above: $xml->contents and $xml->contents[0] both lead to the "same element", while this would be impossible without dynamic access hijinks.
Cast to Dom
Now under the hood both Dom and SimpleXML use libxml pointers so you can "cast" them from one api to the other using the simplexml_import_dom, dom_import_simplexml and (new in 8.4) dom\import_simplexml functions. (simplexml_import_dom works on both the old DOMDocument and new Dom\Document API)
One use for this is when you want to stick XML together. addChild lets you specify tag names and contents one at a time, but if you already have a SimpleXMLElement to add as a child you need to detour through Dom\Document::importNode()
Cast to string
When casting a SimpleXMLElement to string you don't get the textContent you expect. Instead it gives you all the text under the current node between existing elements, but not the text of their children.
In practice, this is completely useless, and makes it impossible to reliably get text mixed with nodes in SimpleXML:
$xml = new SimpleXMLElement('<root>Hello <div>cruel</div> World</root>');
assert((string) $xml === 'Hello World');
assert((string) $xml->div === 'cruel');
Cast to bool
SimpleXMLElement can be cast to bool. That alone is strange, meaning you can't simply check if a function returning ?SimpleXMLElement is truthy you have to strict compare it to null.
The criteria for what is true or false is also strange.
$test = new SimpleXMLElement('<div />');
assert(!$test);
$test = new SimpleXMLElement('<div></div>');
assert(!$test);
$test = new SimpleXMLElement('<div>test</div>');
assert($test);
$test = new SimpleXMLElement('<div test="val"></div>');
assert($test);
$test = new SimpleXMLElement('<div><div /></div>');
assert($test);
$test = new SimpleXMLElement('<div xmlns:x="http://localhost/"><x:div /></div>');
assert(!$test);
$test = new SimpleXMLElement('<div><div xmlns="http://localhost/" /></div>');
assert($test);
It seems to be true if it has any "Contents" meaning attributes, text, or child elements, but then it ignores namespace declaration attributes and namespace aliased tags, but not explicitly namespaced elements…
Perhaps this has something to do with:
Cast to array
Casting a SimpleXMLElement to array gives you up to 2 additional elements:
@attributesis an array (supposedly) containing all attributes0is the (non-CDATA) text content of the element if it has no child elements
$test = new SimpleXMLElement('<div attribute="value">Text</div>');
$arr = (array) $test;
assert($arr['@attributes'] === ['attribute' => 'value']);
assert($arr[0] === 'Text');
$test = new SimpleXMLElement('<div>Text and <child /></div>');
$arr = (array) $test;
assert(!isset($arr['@attributes']));
assert(!isset($arr[0]));
assert($arr['child'] instanceof SimpleXMLElement);
$test = new SimpleXMLElement('<div><![CDATA[Text]]></div>');
assert((string) $test === 'Text');
assert((array) $test === []);
It also removes the magic 0 index inconsistency from overlapping element names:
$xml = <<<END
<parent>
<child>First</child>
<child>Second</child>
</parent>
END;
$xml = new SimpleXMLElement($xml);
assert((array) $xml->child === ['First', 'Second']);
The interesting overlap with the bool cast is that the array casts also ignore namespace declarations and namespace aliased tags:
$test = new SimpleXMLElement('<div xmlns:x="http://localhost/"><x:div /></div>');
$arr = (array) $test;
assert($arr === []);
$test = new SimpleXMLElement('<div><div xmlns="http://localhost/" /></div>');
$arr = (array) $test;
assert($arr['div'] instanceof SimpleXMLElement);
So how do we deal with namespaces in SimpleXML?
Handling namespaces
Refresher for those who forgot: XML namespaces can be set on the root node, in which case they apply to all children. They can be explicitly set on specific aliases that can be added to the tags, and they be set explicitly per element.
The default alias '' leads to a namespace of null, which means we can get "Default" information from methods like children and attributes:
$xml = <<<END
<root xmlns:localhost="http://localhost/">
<tag attrib="base" localhost:attrib="namespaced">First</tag>
<tag xmlns="http://localhost/">Second</tag>
<localhost:tag>Third</localhost:tag>
</root>
END;
$xml = new SimpleXMLElement($xml);
assert($xml->getDocNamespaces() === ['localhost' => 'http://localhost/']);
assert((array) $xml->children() === ['tag' => ['First', 'Second']]);
assert((array) $xml->children(null) === ['tag' => ['First', 'Second']]);
assert((array) $xml->children('', true) === ['tag' => ['First', 'Second']]);
assert((array) $xml->tag->attributes() === ['@attributes' => ['attrib' => 'base']]);
assert((array) $xml->tag->attributes(null) === ['@attributes' => ['attrib' => 'base']]);
assert((array) $xml->tag->attributes('', true) === ['@attributes' => ['attrib' => 'base']]);
These are returning reduced subset SimpleXMLElement elements so you can access children and attributes the same way you would normally:
assert((string) $xml->tag[1] === (string) $xml->children()->tag[1]);
assert((string) $xml->tag['attrib'] === (string) $xml->tag->attributes()['attrib']);
The interesting part is that there's an unintuitive overlap between the two calling methods:
assert((array) $xml->children() === ['tag' => ['First', 'Second']]);
assert((array) $xml->children('http://localhost/') === ['tag' => ['Second', 'Third']]);
assert((array) $xml->children('localhost', true) === ['tag' => 'Third']);
You can't stick a full xmlns attribute onto an existing attribute so in practice both calling conventions are the same for attributes:
assert((array) $xml->tag->attributes() === ['@attributes' => ['attrib' => 'base']]);
assert((array) $xml->tag->attributes('http://localhost/') === ['@attributes' => ['attrib' => 'namespaced']]);
assert((array) $xml->tag->attributes('localhost', true) === ['@attributes' => ['attrib' => 'namespaced']]);
Of course, if your parent node has a namespace this changes things slightly:
$xml = <<<END
<root xmlns="http://default/" xmlns:localhost="http://localhost/">
<tag attrib="base" localhost:attrib="namespaced">First</tag>
<tag xmlns="http://localhost/">Second</tag>
<localhost:tag>Third</localhost:tag>
</root>
END;
$xml = new SimpleXMLElement($xml);
assert($xml->getDocNamespaces() === ['' => 'http://default/', 'localhost' => 'http://localhost/']);
assert((array) $xml->children() === ['tag' => ['First', 'Second']]);
assert((array) $xml->children(null) === ['tag' => ['First', 'Second']]);
assert((array) $xml->children('', true) === ['tag' => ['First', 'Second']]);
assert((array) $xml->children('http://default/') === ['tag' => 'First']);
assert((array) $xml->children('http://localhost/') === ['tag' => ['Second', 'Third']]);
assert((array) $xml->children('localhost', true) === ['tag' => 'Third']);
assert((array) $xml->tag->attributes() === ['@attributes' => ['attrib' => 'base']]);
assert((array) $xml->tag->attributes(null) === ['@attributes' => ['attrib' => 'base']]);
assert((array) $xml->tag->attributes('', true) === ['@attributes' => ['attrib' => 'base']]);
assert((array) $xml->tag->attributes('http://default/') === []);
assert((array) $xml->tag->attributes('http://localhost/') === ['@attributes' => ['attrib' => 'namespaced']]);
assert((array) $xml->tag->attributes('localhost', true) === ['@attributes' => ['attrib' => 'namespaced']]);
With a root namespace if you're willing to input the full namespace url every time there is no more overlap but this is beyond cumbersome and you'd have more fun in Dom land.
Meanwhile plain attributes always have a namespace of null. Dom confirms this too:
$xml = <<<END
<?xml version="1.0" ?>
<root xmlns="http://default/">
<tag attrib="base" />
</root>
END;
$xml = Dom\XMLDocument::createFromString($xml);
$attrib = $xml->firstChild->firstElementChild->attributes->item(0);
assert($attrib->name === 'attrib');
assert($attrib->namespaceURI === null);
Finding children
One of the things I need to do in Kint is check if a SimpleXMLElement has any children. According to the PHP docs there's a convenient hasChildren() method we can use. Great! Right?
$xml = <<<END
<root xmlns="http://default/" xmlns:localhost="http://localhost/">
<tag attrib="base" localhost:attrib="namespaced">First</tag>
<tag xmlns="http://localhost/">Second</tag>
<localhost:tag>Third</localhost:tag>
</root>
END;
$xml = new SimpleXMLElement($xml);
assert(!$xml->hasChildren());
assert(!$xml->tag->hasChildren());
Useless! The method is defined but in my testing it always returns false. And since pretty much every operation including trying to access non-existent properties results in another SimpleXMLElement it becomes a non-trivial problem.
assert($xml->tag->nonsense instanceof SimpleXMLElement);
You might try to cast it to string and if it's a string it doesn't have children right? Except because of the unintuitive string concatenation behavior of parent nodes this would produce a string just for indentation between nodes so that's inconclusive as well.
assert(strlen((string) $xml) > 0);
assert(strlen((string) $xml->tag) > 0);
You could try casting to array but it doesn't include namespaced children and you'd have to filter out the @attributes and 0 indices.
assert(((array) $xml)['tag'] === ['First', 'Second']);
assert(((array) $xml->tag)['@attributes'] === ['attrib' => 'base']);
assert(((array) $xml->tag)[0] === 'First');
assert(((array) $xml->tag)[1] === 'Second');
You'd also have no way to distinguish between something that has a string or a single child element without recursive parsing, and we don't want to do that in Kint because we have a depth limit.
Another edge case to the rescue! When children() returns a SimpleXMLElement that is just a list of children, and when you cast this to an array, it will never have the @attributes or 0 keys for string contents, so we can check whether there are any real elements with this.
assert($xml->children());
assert((array) $xml->children());
assert($xml->tag->children());
assert(!(array) $xml->tag->children());
Here's the function as implemented in Kint:
function hasChildElements(SimpleXMLElement $var): bool
{
$namespaces = \array_merge(['' => null], $var->getDocNamespaces());
foreach ($namespaces as $nsAlias => $nsUrl) {
if ((array) $var->children($nsAlias, true)) {
return true;
}
}
return false;
}
assert(hasChildElements($xml));
assert(!hasChildElements($xml->tag));
Combined with string casting and you can now tell fairly accurately what you're looking at, even with CDATA strings and namespaces.
Conclusion
- Don't use
SimpleXMLon a structure that mixes strings and elements under the same parent - Use property access to get children
- Use array access to get attributes
- Use aliases to access namespaced tags and attributes (ie.
$xml->children('alias', true)rather than$xml->children('http://namespace/')) - For anything else save yourself the hassle and just use
Dom