Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Finding answers to given questions is important in science
- Coming up with good questions is important in science
- Artificial scientists can learn to answer given questions and invent new questions
- Artificial scientists are biased towards simpler, least costly experiments with surprising outcomes
- An empirical analysis of automatic generation of interesting experiments is presented
Paper Content
Introduction & previous work
- Two important things in science: finding answers to given questions and coming up with good questions
- Artificial systems can be used to implement creative part of science
- Artificial scientists equipped with artificial curiosity and creativity have been published for 3 decades
- Artificial Q&A system designed to invent and answer questions was the intrinsic motivation-based adversarial system from 1990
- Two artificial NNs: controller C and world model M
- M minimizes its error, C tries to find sequences of output actions that maximize the error of M
- Artificial Q&A system from 1997 can ask arbitrary abstract questions with computable answers
- Reward-maximizing C tries to come up with questions whose answers surprise the other
- Artificial scientists maximize the sum of external rewards and intrinsic rewards
- POWERPLAY framework (2011) enumerates the set of all formalisable questions
- One Big Net For Everything offers a simplified NN version of POWERPLAY
- Empirical investigation of two settings: generation of experiments driven by model prediction error and approach where C generates pure thought experiments in form of weight matrices of RNNs
Self-invented experiments encoded as neural networks
- System allows for design of computational experiments with binary yes/no outcomes
- Experiments can run for multiple time steps
- Controller and model can be implemented as LSTMs
- Controller has START unit to propose experiments
- Experiment has HALT and RESULT units
- Experiment outcome is 1 if RESULT unit > 0.5, 0 otherwise
- Model predicts experiment outcome before it is executed
- Reward for controller is proportional to model’s surprise
- Alternative reward based on compression progress
- Negative reward for inefficient experiments
- Most initial experiments are thought experiments
- Model and controller can be trained by backpropagation
Experimental evaluation
- Automatic generation of experiments encoded as NNs
- Evaluated empirically
- Two setups: adversarial intrinsic reward and pure thought experiments encoded as RNNs
- Adversarial intrinsic reward encourages experiments in differentiable environment
- Experiments aid discovery of goal states in sparse reward setting
- Pure thought experiments guided by information gain reward
Generating experiments in a differentiable environment
- Reinforcement learning (RL) usually involves exploration in an environment with non-differentiable dynamics
- RL methods such as policy gradients are used
- A fully differentiable environment is introduced to simplify investigation and focus on self-invented experiments
- Environment is a 2D force field with position and velocity states and real-valued force vectors as actions
- Negative reward of -0.1 for each time step and a large reward of 100 for reaching goal state
- Environment is deterministic and experiments are independent of each other
- Model M is a simple MLP with parameters w
- Intrinsic reward signal is non-differentiable
- Reward based on information gain
- Average runtime of experiments increases slightly over time
Pure rnn thought experiments
- Experiment setup uses feedforward NNs and a differentiable intrinsic reward function.
- Investigates thought experiments with no environment interactions, using RNNs without inputs and an intrinsic curiosity reward based on information gain.
Conclusion and future work
- Extended the neural Controller-Model (CM) framework with the notion of arbitrary self-invented computational experiments with binary outcomes
- Experiments are encoded as weight matrices of RNNs generated by the controller
- Model has to predict the outcome of an experiment based on its parameters
- Show that self-invented abstract experiments facilitate the discovery of rewarding goal states
- Over time, controller is forced to create longer experiments
- Second setup: controller generates pure abstract thought experiments in the form of RNNs
- Over time, newly generated experiments result in less intrinsic information gain reward
- Later experiments tend to have slightly longer runtime
- Scaling these methods to more complex environments is challenging
- Algorithm 2 summarizes the method described in Section 3.2
- Efficient approximation of the policy gradients for the controller is achieved through an actor-critic method
- Input to the LSTM history encoder is the sequence of the last 1000 experiments that have been executed
- Hyperparameters listed in Table 2