If you got N people (say N=10) to classify different segments of the script you'd find that they'd mostly agree about how to classify them but they wouldn't agree perfectly. You can get closer to a "gold truth" if you sit people together to discuss the difficult cases.
Any given classifer is going to be like one individual, if it is any good it is going to mostly agree with the gold truth but sometimes it won't. It's also the truth that some classifications will be ambiguous as some segment of the script will have some characteristics of one class and some of another or just might not fit rationally into the schema.
is helpful for the process of testing a number of different models for a range of parameters and deciding what works best. A classifier that is calibrated (returns a probability of class membership) can skip cases where it knows it doesn't know what it is talking about. In the financial world, a calibrated model + a Kelly better can make money trading, an uncalibrated model will lose money almost always.
It's not quite "random" to look at the most intentionally directed legislation of several years. What's your explanation? Genuinely curious. One way or another the clusters do exist, the trends do exist in both content and titling convention
My leading guess would be that Pornhub made some technical change to what titles were allowed or promoted, for who-knows-what reason. One possible guess would be that they simply started to promote titles with more descriptors, or more uncommon descriptors, in an attempt to get an easy boost to search specificity.
The timing is wrong for SESTA/FOSTA, and if SESTA/FOSTA was the reason for Pornhub making a change, even in anticipation, then it seems strange for Pornhub to intentionally make a change that would tend to emphasize titles that would increase political heat.
[On edit: ... and as I said, the "professionalization" hypothesis might also have legs as something that happened in response to an ID crackdown... but that wouldn't have to be related to SESTA/FOSTA, and would have had to happen before passage.]
Definitely plausible but it underrates the changes in actual content. It's not just SEO and titling, it's actual videos that have "stepsis" etc. as themes.
SEO contributes for sure, but I would reverse the statement here - it's more about new content and less about SEO. There's a feedback loop race to the bottom dynamic regardless
So I'm actually kind of lost here. As I understand it, your theory is that the anticipation of SESTA/FOSTA caused there to be more professional porn and less amateur porn on Pornhub, which in turn caused a "race to the bottom" in porn titles, along both rough/violent/rapey lines and incest lines at the same time, and furthermore that the titles reflected the actual content?
So you believe SESTA/FOSTA led to an unintended increase in the actual violence you'd see if you watched random videos on Pornhub?
I don't think much of SESTA/FOSTA, but I do think that's kind of a stretch.
I did switch that—it's not clear if those three words are the only 3 words defining the category. I suspect not, however, because:
1. It then wouldn't include the word "rough", which is far more common and indicative of sexual violence.
2. Elsewhere the page, the author includes "stepsis" as indicative of incest:
> Later titles are longer, and we start to observe a trend towards both incest ("Daughter", "Stepsis") and violence ("HARD FUCKING", "Fucked ROUGH", "Rough Fuck").
That last quote makes me think that the categories are larger than the 3 examples given, and "sexual violence" includes both the incest and violence terms.
I am the author, it's just those 3 terms for the tSNE cluster. Sorry, I can tell from some of the comments here the graphs need to be clearer. "Stepsis" is indicative of incest IMO, the "step" is a fig leaf.
I agree “stepsis” is indicative of incest, but I don’t agree “stepsis” is indicative of violence. And if you’re only using 3 words for “sexual violence”, then why did you go with “incest” instead of “rough”? Those are vastly different kinds of pornography.
Ha, if steps have consistent distances you could take the average distance at step X and generate a step of that length in some direction and be ~approximately correct regardless of the actual value
Well, you are certainly correct about how cosine sim would apply to the text embeddings, but I disagree about how useful that application is to our understanding of the model.
> In this case, cosine distance one would be in a case when it repeats word-by-word. It is not even a "similar thought" but some sort of LLM's OCD.
Observing that would be helpful in our understanding of the model!
> For anything else... cosine similarity says little. Sometimes, two steps can have opposite consultation, but they have very high cosine similarity. In another case, it can just expand on the same solution but use different vocabulary or look from another angle.
Yes, that would be good to observe also! But here I think you undervalue the specificity of the OAI embeddings model, which has 3072 dimensions. That's quite a lot of information being captured.
> A more robust approach would be to give the whole reasoning to an LLM and ask to grade according to a given criterion (e.g. "grade insight in each step, from 1 to 5").
Totally disagree here, using embeddings is much more reliable / robust, I wouldn't put much stock in LLM output, too much going on
The distance between "dairy creamer" and "non-dairy creamer" is too small. So an embedding for one will rank high for the other as well, even though they mean precisely opposite things. For example, the embedding for "dairy free creamer" will result in a low distance from both of the concepts such that you cannot really apply a reasonable threshold.
But in a larger frame, of "things tightly associated with coffee", they mean something extremely close. Whether these things are opposite from each other, or virtually identical, is a function of your point of view; or, in this context, the generally-meaningful level of discourse.
At scale, I expect having dairy vs non-dairy distance be very small is the more accurate representation of intent.
Of course, I also expect them to be very close and that's the problem with purely relying on embeddings and distance where, in this case, the two things mean entirely opposite preferences on the same topic.
(I think maybe why we sometimes see AI generated search overviews give certain types of really bad answers because the underlying embedding search is returning "semantically similar" results)
> Totally disagree here, using embeddings is much more reliable / robust, I wouldn't put much stock in LLM output, too much going on
I think both ways can be the preferable option, depending on how well the embedding space represents the text - and that is mostly dependet on the specific use case and model combination.
So if the embedding space does not correctly project required nuance, then it's often a viable option to get the top_n results and do the rest by utilizing the llm + validation calls.
But i do agree with you, i would always like to work with embeddings rather than some llm output. I think it would be such a great thing to have rock solid embedding space where one would not even consider to look at token predictor models.