To create a reward model for reinforcement learning, we needed to collect comparison data, which consisted of two or more model responses ranked by quality. We gave the trainers access to model-written suggestions to help them compose their responses. We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as… Continue reading Introducing ChatGPT