Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Finding answers to given questions is important in science
Coming up with good questions is important in science
Artificial scientists can learn to answer given questions and invent new questions
Artificial scientists are biased towards simpler, least costly experiments with surprising outcomes
An empirical analysis of automatic generation of interesting experiments is presented

Two important things in science: finding answers to given questions and coming up with good questions
Artificial systems can be used to implement creative part of science
Artificial scientists equipped with artificial curiosity and creativity have been published for 3 decades
Artificial Q&A system designed to invent and answer questions was the intrinsic motivation-based adversarial system from 1990
Two artificial NNs: controller C and world model M
M minimizes its error, C tries to find sequences of output actions that maximize the error of M
Artificial Q&A system from 1997 can ask arbitrary abstract questions with computable answers
Reward-maximizing C tries to come up with questions whose answers surprise the other
Artificial scientists maximize the sum of external rewards and intrinsic rewards
POWERPLAY framework (2011) enumerates the set of all formalisable questions
One Big Net For Everything offers a simplified NN version of POWERPLAY
Empirical investigation of two settings: generation of experiments driven by model prediction error and approach where C generates pure thought experiments in form of weight matrices of RNNs

System allows for design of computational experiments with binary yes/no outcomes
Experiments can run for multiple time steps
Controller and model can be implemented as LSTMs
Controller has START unit to propose experiments
Experiment has HALT and RESULT units
Experiment outcome is 1 if RESULT unit > 0.5, 0 otherwise
Model predicts experiment outcome before it is executed
Reward for controller is proportional to model’s surprise
Alternative reward based on compression progress
Negative reward for inefficient experiments
Most initial experiments are thought experiments
Model and controller can be trained by backpropagation

Automatic generation of experiments encoded as NNs
Evaluated empirically
Two setups: adversarial intrinsic reward and pure thought experiments encoded as RNNs
Adversarial intrinsic reward encourages experiments in differentiable environment
Experiments aid discovery of goal states in sparse reward setting
Pure thought experiments guided by information gain reward

Reinforcement learning (RL) usually involves exploration in an environment with non-differentiable dynamics
RL methods such as policy gradients are used
A fully differentiable environment is introduced to simplify investigation and focus on self-invented experiments
Environment is a 2D force field with position and velocity states and real-valued force vectors as actions
Negative reward of -0.1 for each time step and a large reward of 100 for reaching goal state
Environment is deterministic and experiments are independent of each other
Model M is a simple MLP with parameters w
Intrinsic reward signal is non-differentiable
Reward based on information gain
Average runtime of experiments increases slightly over time

Experiment setup uses feedforward NNs and a differentiable intrinsic reward function.
Investigates thought experiments with no environment interactions, using RNNs without inputs and an intrinsic curiosity reward based on information gain.

Extended the neural Controller-Model (CM) framework with the notion of arbitrary self-invented computational experiments with binary outcomes
Experiments are encoded as weight matrices of RNNs generated by the controller
Model has to predict the outcome of an experiment based on its parameters
Show that self-invented abstract experiments facilitate the discovery of rewarding goal states
Over time, controller is forced to create longer experiments
Second setup: controller generates pure abstract thought experiments in the form of RNNs
Over time, newly generated experiments result in less intrinsic information gain reward
Later experiments tend to have slightly longer runtime
Scaling these methods to more complex environments is challenging
Algorithm 2 summarizes the method described in Section 3.2
Efficient approximation of the policy gradients for the controller is achieved through an actor-critic method
Input to the LSTM history encoder is the sequence of the last 1000 experiments that have been executed
Hyperparameters listed in Table 2