Hi, first author of the original NeurIPS paper here! Someone just shared this HN post with me. I'll go through the comments but happy to try to answer questions as well.
I'm still wrapping my head around this (it seems rather magical; can it really be just that?)... but I find the idea of "sampling from a distribution of trajectories that reach a terminal state whose probability is proportional to a positive reward function, by minimizing the difference between the flow coming into the trajectories and the flow coming out of them, which by construction must be equal" to be both beautiful and elegant -- like all great ideas.
How did you and your coauthors come up with this? Trial and error? Or was there a moment of serendipitous creative insight?
--
To the moderators: My comment is now at the top of the page, but manux's comment above is more deserving of the top spot. I just upvoted it. Please consider pinning it at the top.
I think the inspiration came to me from looking at SumTrees and from having worked on Temporal Difference learning for a long time. The idea of flows came to Yoshua and I from the realization that we wanted some kind of energy conservation/preservation mechanism from having multiple paths lead to the same state.
But, yes in the moment it felt like some very serendipitous insight!
Thank you. It's nice to hear that you had one of those shower/bathtub Eureka moments!
> ...we wanted some kind of energy conservation/preservation mechanism from having multiple paths lead to the same state
Makes sense. FWIW, to me this looks like a Conservation Law -- as in Physics. I mean, it's not that the flows "must be" conserved, but that they are conserved (or go into sinks). Any physicists interested in AI should be all over this; it's right up their alley.
Could you highlight the difference between this and training a permutation invariant or equivariant policy network using standard supervised or RL methods? Assuming I also have a way of having an invariant/equivariant loss function
What the permutation invariance gets you is that the model doesn't arbitrarily prefer one (graph) configuration over another, but this seems tangential. The difference between this and RL is in what we do with the reward:
- RL says, give me a reward and I'll give you its max.
- GFlowNet says, give me a reward and I'll give you all its modes (via p(x) \propto R(x)).
Yes you would ideally have a loss (well, a reward/energy) that is invariant and operates e.g. directly on the molecule rather than on some arbitrary ordering of the nodes.