It's not abstract logic, it's rollout [1]: repeatedly simulating different actio...

It's not abstract logic, it's rollout [1]: repeatedly simulating different action sequences N steps into the future and comparing their score after the last step.

A brute force approach would just simulate all possible sequences and go with the highest-scoring one; RL algorithms pseudo-randomly sample the (typically intractably large) search space and use the results to update their policy (these days, typically implemented as a neural network).

In the Air Force examples, as long as shooting the human controller or the communication tower is not explicitly prohibited, there is nothing surprising about an RL agent trying that course of action (along with other random things like shooting rocks, prairie dogs and even itself). Doing so requires no abstract reasoning or understanding of the causal mechanism between shooting the human controller and getting a better score, just random sampling and score-keeping. If the rollout score consistently goes up after the action "shoot the human controller", any RL algorithm worth its salt will update its policy accordingly and start shooting the human controller.

[1] https://robotics.stackexchange.com/questions/16596/what-is-t...