Actively "censoring" the AI is fundamental to how these language models are created. Such feedback is part of how the model learns.
In a certain light, every response in training that is marked as dispreferred by a human is censoring the AI. It will produce those kinds of results less often. The end-users will not encounter the dispreferred results as frequently. With ChatGPT criteria it was judged on included how relevant the answers were to the question, factually incorrect answers were penalized, and not being blatantly offensive was obviously one of the criteria, too.
What would a model that wasn't censored in training even look like?
(I believe ChatGPT also has a more traditional expert system placed between the user and the language model, which flags keywords and other programmed-in patterns. That is more literally censoring the language model. But the above-mentioned issue would still exist even without such a system.)
It can't cite anything because it's an LLM which is fundamentally unable to do that.
I've seen NRx people on the internet (* they're like rationalists but even more racist.) They seem willing to believe any abuse of statistics that looks sufficiently cynical.
In a certain light, every response in training that is marked as dispreferred by a human is censoring the AI. It will produce those kinds of results less often. The end-users will not encounter the dispreferred results as frequently. With ChatGPT criteria it was judged on included how relevant the answers were to the question, factually incorrect answers were penalized, and not being blatantly offensive was obviously one of the criteria, too.
What would a model that wasn't censored in training even look like?
(I believe ChatGPT also has a more traditional expert system placed between the user and the language model, which flags keywords and other programmed-in patterns. That is more literally censoring the language model. But the above-mentioned issue would still exist even without such a system.)