Alignment for Advanced ML Systems

Jessica Taylor, Eliezer Yudkowsky, Patrick LaVictoire, and Andrew Critch. Machine Intelligence Research Institute. 

Alignment in this context means making sure agents arrive at and optimize objective functions that are in the spirit of what was intended; that is that goals are reached while making sure no one gets hurt. One of the key takeaways from this overview is that our solutions must scale with intelligence, so for any new discovery, how long will it “hold” in lock step with advances in intelligence?

This report is divided into eight research topics:

  • Inductive ambiguity identification
  • Robust human imitation 
  • Informed oversight
  • Generalizable environmental goals 
  • Conservative concepts
  • Impact measures
  • Mild optimization
  • Averting instrumental incentives

I’m interested in robust human imitation, and how we can expand this research direction into non-human agents and systems. I question the “trusted human” as a safety benchmark, and was happy to see Evans et al cited, along with the assertion that “in reality humans are often irrational, ill-informed, incompetent, and immoral”. On a planet with an uncountable number of species, shouldn’t we have some other benchmark behavioral systems on the advisory counsel?

In a similar vein, I’m interested in generalizable environmental goals, once again, considering possibilities we may have overlooked due to our anthropic bias. This makes me think of sensory illusions and artifacts, and how these can change as sensors change. The paper suggests more elaborate sensor systems, and this is where I come back to what I’m dreaming up with ocotohand - thinking about the kinetic possibilities of eight spiraling appendages.

Also, so I don’t forget, I liked the reference to Everitt & Hutter (2016) research on data being used to guide agent towards utility function, not be measured for success directly.