Like every interpreted language in real-world use, PHP compiles to an IR run on a VM. The opcodes the IR is made of are the closest you'll get to PHP's version of machine code.
Back in the day you used to use phpdbg or the VLD extension to dump PHP opcodes, but nowadays the opcache extension you probably have enabled will do so itself.
Making it work can be a little tricky, since at least in Debian opcache is disabled on CLI where you're probably running stuff.
The following INI values need to be set:
opcache.enable_cli=On
– even if opcache is enabled none of this will work on CLI if this is not enabledopcache.opt_debug_level=0x10000
– This shows the opcodes before optimization. There are other modes that can show them after optimization too
opcache.jit=disable
and opcache.log_verbosity_level=1
may also be necessary but weren't on my system.
Lastly to ensure php doesn't actually run the file we can pass -l
to make it only lint, rather than execute the file.
Our final command ends up being:
php -d opcache.enable_cli=On -d opcache.opt_debug_level=0x10000 -l test.php
Disassembly time
Hello world looks simple enough:
<?php
echo "Hello world";
$_main:
; (lines=2, args=0, vars=0, tmps=0)
; (before optimizer)
; /home/j/test/test.php:1-4
; return [] RANGE[0..0]
0000 ECHO string("Hello world")
0001 RETURN int(1)
We see some debug information about the code, followed by the actual opcodes with offsets. I'll focus on the opcodes themselves from here on.
Since echo
is a builtin it's not surprising it has its own opcode. Let's see what happens with print
:
<?php
print "Hello world";
0000 ECHO string("Hello world")
0001 RETURN int(1)
Good! It still uses the echo opcode. I'm an echo
supremacist at heart too.
The return value is interesting though. Typically C programs will return 0 on success. Anything else is considered an error on pretty much any system, with bash treating non-zero exit codes as a program failure and even win32 internally treating 0 as success.
In C world this makes sense since you typically only need one value to indicate things went smoothly, it's when things go wrong that you need more information.
In PHP however, it seems the RETURN
opcode is simply pasted at the end of every file, since an explicit return results in two opcodes prior to optimization:
<?php
echo "Hello world";
return 0;
0000 ECHO string("Hello world")
0001 RETURN int(0)
0002 RETURN int(1)
Actually running our program we see that regardless of our return value the exit code remains 0, indicating the program ran successfully.
This is probably because includes can return values like any function call:
<?php
var_dump(include "test2.php");
echo "Hello world";
return 0;
0000 INIT_FCALL 1 96 string("var_dump")
0001 V0 = INCLUDE_OR_EVAL (include) string("test2.php")
0002 SEND_VAR V0 1
0003 DO_ICALL
0004 RETURN int(1)
Here we see the opcode's calling convention for a var_dump
call, with SEND_VAR
being used to set the first argument to the result of the INCLUDE_OR_EVAL
opcode.
Running the program produces the expected result of:
Hello worldint(0)
Just to prove the implicit return at the end of the include is working, removing the return from test2.php
gives us:
Hello worldint(1)
If we want to produce an actual non-zero exit code we can use the exit
function in PHP. Unlike echo
or print
the parentheses are syntactically required here but the exit itself still has its own opcode:
<?php
echo "Hello world";
exit(1);
0000 ECHO string("Hello world")
0001 EXIT int(1)
0002 RETURN int(1)
Control flow
<?php
for ($i = 0; $i < 100; $i++) {
if ($i % 15 === 0) {
echo "FizzBuzz\n";
} elseif ($i % 3 === 0) {
echo "Fizz\n";
} elseif ($i % 5 === 0) {
echo "Buzz\n";
} else {
echo $i."\n";
}
}
Good old fizzbuzz! A prime example of control flow to test our opcodes on!
This time we're going to have something quite a bit longer:
0000 ASSIGN CV0($i) int(0)
0001 JMP 0020
0002 T2 = MOD CV0($i) int(15)
0003 T3 = IS_IDENTICAL T2 int(0)
0004 JMPZ T3 0007
0005 ECHO string("FizzBuzz
")
0006 JMP 0019
0007 T4 = MOD CV0($i) int(3)
0008 T5 = IS_IDENTICAL T4 int(0)
0009 JMPZ T5 0012
0010 ECHO string("Fizz
")
0011 JMP 0019
0012 T6 = MOD CV0($i) int(5)
0013 T7 = IS_IDENTICAL T6 int(0)
0014 JMPZ T7 0017
0015 ECHO string("Buzz
")
0016 JMP 0019
0017 T8 = CONCAT CV0($i) string("
")
0018 ECHO T8
0019 PRE_INC CV0($i)
0020 T10 = IS_SMALLER CV0($i) int(100)
0021 JMPNZ T10 0002
0022 RETURN int(1)
Well it sure looks like assembly. In opcode 0 and 1 we have our for loop initialization, which includes a JMP
to the end of the for loop. At 20 and 21 we then check if the for loop's conditional passes and jump back to the beginning at opcode 3 with IS_SMALLER
and JUMPNZ
.
While x86 would handle this with a CMP
and JL
PHP's approach is closer to what I model this code as in my head: First evaluate the expression then check if it's true or false.
This makes more intuitive sense to me than the strange x86 pattern of "Do every possible comparison now and pick which output to jump based on later". I'm sure there's a reason for the way x86 did it (Probably timing related) but it's hard to reason about at a higher level.
Looking at the first branch we have a MOD
opcode followed by an IS_IDENTICAL
and JMPZ
.
As in assembly when the condition in the if is false we jump past the block in opcode 4, inverting the logic of the control flow. Inside the block at opcode 6 we jump to opcode 19 where the for loop's incrementor PRE_INC
is run before the conditional can send us back to 2 again.
Pre-inc vs post-inc
Let's see how PHP handles pre and post increments.
<?php
$i = 8;
$x = $i++;
$y = ++$i;
echo $i;
echo $x;
echo $y;
0000 ASSIGN CV0($i) int(8)
0001 T4 = POST_INC CV0($i)
0002 ASSIGN CV1($x) T4
0003 T6 = PRE_INC CV0($i)
0004 ASSIGN CV2($y) T6
0005 ECHO CV0($i)
0006 ECHO CV1($x)
0007 ECHO CV2($y)
0008 RETURN int(1)
How dreadfully boring. No surprises here, though it's interesting that it consistently stores results of expressions in temporary variables reminiscent of registers before assigning actual variables to them.
Undefined variables
What if we don't define the variable at all? In PHP you can increment a non-existent integer and it should act as if it was defined as null
:
<?php
echo $i++;
echo ++$i;
0000 T1 = POST_INC CV0($i)
0001 ECHO T1
0002 T2 = PRE_INC CV0($i)
0003 ECHO T2
0004 RETURN int(1)
Looks like the opcodes also ignore undefined variables. Interesting that the POST_INC
happens before the PRE_INC
(Well of course it does, it wouldn't work otherwise duh)
Running this code produces a warning about the undefined variable, and because the null
is cast to an empty string the only output we receive is 2.
Strangely enough, declare(strict_types=1);
does nothing to change either the output or the opcodes. I would have thought it would refuse to run.
Concatenation as cast
In the last branch of fizzbuzz at opcode 17 we cast CV0($i)
to string by concatenating it with a newline. But what happens if we remove the concatenation?
A smaller test case:
<?php
$i = 8;
echo $i."\n";
0000 ASSIGN CV0($i) int(8)
0001 T2 = CONCAT CV0($i) string("
")
0002 ECHO T2
0003 RETURN int(1)
Here when we remove the concatenation we get:
<?php
$i = 8;
echo $i;
0000 ASSIGN CV0($i) int(8)
0001 ECHO CV0($i)
0002 RETURN int(1)
Huh. That's weird. It seems the ECHO
opcode doesn't require the operand to be a string at all. It probably does some typecasting internally. What happens if we typecast it explicitly?
<?php
$i = 8;
echo (string) $i;
0000 ASSIGN CV0($i) int(8)
0001 T2 = CAST (string) CV0($i)
0002 ECHO T2
0003 RETURN int(1)
So while ECHO
is capable of taking integers explicit casts are done through the CAST
opcode. I wonder what the optimizer does to this?
Optimizations
Oh yeah, we can change our opcache.opt_debug_level
to 0x20000
to see the optimized opcodes! Let's see what it produces:
<?php
$i = 8;
echo (string) $i;
0000 ASSIGN CV0($i) int(8)
0001 ECHO CV0($i)
0002 RETURN int(1)
Hmm. It seems the optimizer knows the CAST
isn't required here, but other than that it looks very similar.
In the concatenation example the only difference is that our T2
is now a T1
– probably because the optimizer is reusing temporary variables.
Looking at the optimized fizzbuzz and post-inc opcodes we see a similar reuse of temp variables not seen in the unoptimized code, and a removal of the redundant return but no other optimizations:
0000 ASSIGN CV0($i) int(0)
0001 JMP 0020
0002 T2 = MOD CV0($i) int(15)
0003 T1 = IS_IDENTICAL T2 int(0)
0004 JMPZ T1 0007
0005 ECHO string("FizzBuzz
")
0006 JMP 0019
0007 T2 = MOD CV0($i) int(3)
0008 T1 = IS_IDENTICAL T2 int(0)
0009 JMPZ T1 0012
0010 ECHO string("Fizz
")
0011 JMP 0019
0012 T2 = MOD CV0($i) int(5)
0013 T1 = IS_IDENTICAL T2 int(0)
0014 JMPZ T1 0017
0015 ECHO string("Buzz
")
0016 JMP 0019
0017 T1 = CONCAT CV0($i) string("
")
0018 ECHO T1
0019 PRE_INC CV0($i)
0020 T1 = IS_SMALLER CV0($i) int(100)
0021 JMPNZ T1 0002
0022 RETURN int(1)
0000 ASSIGN CV0($i) int(8)
0001 T3 = POST_INC CV0($i)
0002 ASSIGN CV1($x) T3
0003 T3 = PRE_INC CV0($i)
0004 ASSIGN CV2($y) T3
0005 ECHO CV0($i)
0006 ECHO CV1($x)
0007 ECHO CV2($y)
0008 RETURN int(1)
0000 T1 = POST_INC CV0($i)
0001 ECHO T1
0002 T1 = PRE_INC CV0($i)
0003 ECHO T1
0004 RETURN int(1)
The stack
Let's get ourselves a stack going. Let's just wrap our fizzbuzz in a function, stick a debug_print_backtrace
on the end, return a number, and see what happens when we call it.
<?php
function fizzbuzzdump() {
for ($i = 0; $i < 100; $i++) {
if ($i % 15 === 0) {
echo "FizzBuzz\n";
} elseif ($i % 3 === 0) {
echo "Fizz\n";
} elseif ($i % 5 === 0) {
echo "Buzz\n";
} else {
echo $i."\n";
}
}
debug_print_backtrace();
return 4;
}
$x = fizzbuzzdump();
$_main:
; (lines=4, args=0, vars=1, tmps=2)
; (before optimizer)
; test.php:1-22
; return [] RANGE[0..0]
0000 INIT_FCALL 0 272 string("fizzbuzzdump")
0001 V1 = DO_UCALL
0002 ASSIGN CV0($x) V1
0003 RETURN int(1)
fizzbuzzdump:
; (lines=26, args=0, vars=1, tmps=11)
; (before optimizer)
; test.php:3-19
; return [] RANGE[0..0]
0000 ASSIGN CV0($i) int(0)
0001 JMP 0020
0002 T2 = MOD CV0($i) int(15)
0003 T3 = IS_IDENTICAL T2 int(0)
0004 JMPZ T3 0007
0005 ECHO string("FizzBuzz
")
0006 JMP 0019
0007 T4 = MOD CV0($i) int(3)
0008 T5 = IS_IDENTICAL T4 int(0)
0009 JMPZ T5 0012
0010 ECHO string("Fizz
")
0011 JMP 0019
0012 T6 = MOD CV0($i) int(5)
0013 T7 = IS_IDENTICAL T6 int(0)
0014 JMPZ T7 0017
0015 ECHO string("Buzz
")
0016 JMP 0019
0017 T8 = CONCAT CV0($i) string("
")
0018 ECHO T8
0019 PRE_INC CV0($i)
0020 T10 = IS_SMALLER CV0($i) int(100)
0021 JMPNZ T10 0002
0022 INIT_FCALL 0 80 string("debug_print_backtrace")
0023 DO_ICALL
0024 RETURN int(4)
0025 RETURN null
I had to bring back the headers! Unlike machine code which uses pointers to jump into other functions or instructions, in PHP opcodes we have entirely different contexts for different scopes.
I don't know if this is just how opcache represents it or if it actually works like this but this is probably for the best, since I can't imagine the horrors I'd awaken to if it was possible to manually jump into the middle of another function without abiding by the calling convention.
It seems functions also have an implicit return, though the one in functions returns null
(As expected) contrary to the 1 returned from include
.
The only other thing to notice is the difference between DO_ICALL
and DO_UCALL
– in PHP functions that take user defined closures like usort
are prefixed with a "u" for "User defined". I presume DO_UCALL
and DO_ICALL
correspond to "User defined function call" and "Internal function call" though I'm not sure exactly what the difference is. Autoloader related perhaps?
The hard part (Exceptions)
This isn't the first time I've dug into PHP opcodes. Last time I gave up on exceptions, since the opcodes made no sense to me. This time we have opcodes dumped from opcache directly, so let's see what we get.
<?php
try {
try {
$nonexistent->method();
$other->method();
} catch (Throwable $t) {
throw $t;
}
} catch (Exception $e) {
$y = 1;
} finally {
$x = 0;
}
$_main:
; (lines=15, args=0, vars=6, tmps=5)
; (before optimizer)
; test.php:1-15
; return [] RANGE[0..0]
0000 INIT_METHOD_CALL 0 CV0($nonexistent) string("method")
0001 DO_FCALL
0002 INIT_METHOD_CALL 0 CV1($other) string("method")
0003 DO_FCALL
0004 JMP 0007
0005 CV2($t) = CATCH string("Throwable")
0006 THROW CV2($t)
0007 JMP 0010
0008 CV3($e) = CATCH string("Exception")
0009 ASSIGN CV4($y) int(1)
0010 T6 = FAST_CALL 0012
0011 JMP 0014
0012 ASSIGN CV5($x) int(0)
0013 FAST_RET T6
0014 RETURN int(1)
EXCEPTION TABLE:
0000, 0008, 0012, 0013 0000, 0005, -, -
Oh. An exception table. That wasn't there last time.
By the looks of it this is exception table is a dump of the zend_try_catch_element
structs mentioned in nikic's PHP7 VM post with the "finally" values dashed out for the inner try catch that doesn't have one.
Since the try already jumps to the finally FAST_CALL
and the catch falls through I'm not sure the finally offsets are necessary here, but I guess in the case of an uncaught exception the VM would need to go to finally itself.
For this to work I guess it's checking the exception table from last to first, since any exception in the inner try should go to the catch at 5, and any exception in the catch would be caught by the outer try, where the exception table encompasses the throw at 6.
The finally seems to be handled as a type of inline function with FAST_CALL
and FAST_RET
. Nikic says the FAST_CALL
either stores its own opcode position so FAST_RET
can return there, or the current exception. But if this is the case I again fail to see the need for the latter 2 offsets in the zend_try_catch_element
since the current exception would need to go through FAST_CALL
anyway.
But then as Nikic says:
Exceptions are the root of all evil. […] For now we will pretend that finally blocks do not exist, as they are a whole different rabbit hole.
Optimization
What does the exception code look like with optimizations?
$_main:
; (lines=14, args=0, vars=6, tmps=1)
; (after optimizer)
; test.php:1-15
0000 INIT_METHOD_CALL 0 CV0($nonexistent) string("method")
0001 DO_FCALL
0002 INIT_METHOD_CALL 0 CV1($other) string("method")
0003 DO_FCALL
0004 JMP 0009
0005 CV2($t) = CATCH string("Throwable")
0006 THROW CV2($t)
0007 CV3($e) = CATCH string("Exception")
0008 ASSIGN CV4($y) int(1)
0009 T6 = FAST_CALL 0011
0010 JMP 0013
0011 ASSIGN CV5($x) int(0)
0012 FAST_RET T6
0013 RETURN int(1)
EXCEPTION TABLE:
0000, 0007, 0011, 0012 0000, 0005, -, -
We see that it has dropped the JMP
call from the inner catch to the finally. Unfortunately this isn't the kind of clever optimization I was hoping for: It doesn't fall through to the outer catch but probably just recognized that the JMP
was dead code due to the unconditional THROW
.
Multiple catches
So how does PHP handle multiple catches?
<?php
try {
$nonexistent->method();
$other->method();
} catch (Throwable $t) {
throw $t;
} catch (Exception $e) {
$y = 1;
} finally {
$x = 0;
}
$_main:
; (lines=14, args=0, vars=6, tmps=1)
; (after optimizer)
; test.php:1-13
0000 INIT_METHOD_CALL 0 CV0($nonexistent) string("method")
0001 DO_FCALL
0002 INIT_METHOD_CALL 0 CV1($other) string("method")
0003 DO_FCALL
0004 JMP 0009
0005 CV2($t) = CATCH string("Throwable") 0007
0006 THROW CV2($t)
0007 CV3($e) = CATCH string("Exception")
0008 ASSIGN CV4($y) int(1)
0009 T6 = FAST_CALL 0011
0010 JMP 0013
0011 ASSIGN CV5($x) int(0)
0012 FAST_RET T6
0013 RETURN int(1)
EXCEPTION TABLE:
0000, 0005, 0011, 0012
It seems the CATCH
opcode has an optional additional offset to the start of the next catch block to try matching the exception to. The exception table then just stores the start of the first catch block and lets them check one at a time.
Looking at zend_compile_try
in zend_compile.c
it seems to be stored as op2
but I don't know enough about zend internals to say for sure, and my curiosity is now sated.