Closed captioning has always seemed to me to be a notoriously bad data set due to misspellings and misphrasing. Has anyone tried to do (a better) speech to text of a cable news channel, for instance?
Doing frequency and sentiment analysis on this dataset would be pretty interesting.
Closed captioning does have a lot of noise in it, but we've done a lot of work to tidy that up. We also have the benefit of capturing so much data that the noise doesn't matter as much.
In real-time our systems extract and generate lots of information from the closed captions. Our NLP system identifies entities (e.g. whitehouse, chris brown, amanda berry), does frequency counts and we have some very large graphs of entity co-occurrences (leads to our statistical learning), e.g. Rihanna is commonly associated with Chris Brown.
We do a bunch more analysis on our graphs, including Latent Semantic Indexing (LSI), which helps drill down into quantifying the relationships between entities. Related to that we generate TFIDF scores for all identified entities, i.e. gives a sense of how "important" an entity is.
By combining our large scale entity graphs (both Frequency and LSI) with streams of closed captions we also do real-time topic extraction at multiple time scales, e.g. what is the topic of conversation on CNN for the last minute of conversation, for the last 5 minutes, for the whole program, etc.
What about STT and/or machine translation for foreign channels? It is really frustrating when foreign stations don't even have CC if you're trying to learn the language and need more spoken input.
While we could use STT (some of our team have backgrounds in it) we sought to use the cleanest existing signal, i.e. closed captions.
Part of the motivation for taking a statistical NLP approach is that it gives us more flexibility for processing foreign stations / languages (we don't yet do that).
I wonder could you time and geo-shift closed captions, i.e. show closed captions in two languages at once on the same TV program? That could make an interesting language learning tool and an interesting training set for machine translation.
Associating Rihanna to Chris Brown is pretty trivial. The amount of ink that correlates them is enormous; that doesn't seem like a very interesting example. I guess it depends on what you are going for.
That is a trivial example - it easily pops out of the data with a pure freq co-occurrence measurement BUT as I said freq is only a starting point. LSI and co generate far more interesting graphs and can be mined in many ways.
Unfortunately not. The information extraction aspect involves heavy duty math, where the more computational resources you can throw at the problem the better. We have a bunch of servers where all the cores on the CPUs are running at max 24 hours a day. At the moment, we're not using GPUs for our matrix math (the matrices tend to be very large). For our matrix math I am a fan of the fast ojAlgo math library http://ojalgo.org
(Boxfish dev here)
We do a lot of things with the raw subtitles to clean them up. We fix common misspellings using both a basic dictionary and statistical models. We also normalize the data between various sources.
One of the original incarnations of Google Video was something somewhat similar to this (an index of closed-captioning data from a lot of different tv streams). What they chose to do with it was different though: they allowed you to search closed-captioned content and it would show you a few thumbnails and the time of day when those words were said on air.
This memory is kind of hazy, ISTR it's from 2005 or so.
If we're waxing lyrical, there was a research paper out of Ireland from a small telecoms research outfit, circa 1996 describing pretty much this, but using teletext subtitles.
In addition to capturing and indexing the subtitles, it also captured the video, and so allowed the captions to be used as an index to the video.
I don't doubt someone can come up with an even earlier incarnation!
I personally think there's a great deal that can be done with this data.
A few years ago, someone documented how to use an Arduino + Video Experimenter Shield to easily log closed captioning data (http://blog.makezine.com/2011/08/16/enough-already-the-ardui...). Never got around to messing with it, but I can imagine 100 interesting things to do with that data.
Seems like scraping all closed captioning would be very valuable data indeed. Is there anyone else doing something like this that provides an API or data feed?
We're providing API access to some select partners for data experiments. The API is centered around trending, topics, search and metrics but does include some restricted transcript access. Ping kevin at boxfish (me) if you've got some ideas.
We use TVEyes.com in our news monitoring product and it's very close to real-time. It does make mistakes sometimes, but I'm not aware of any perfect transcription software.
That makes sense. Would be legal to capture the data and present it similar to a search engine? I'm guessing there is some sort of precedent for that sort of thing?
Boxfish, twitter, YouTube, Siri, and now with Ray Kurzweil @ Google... thinkers are converging on doing to every other form of content what Google did for structured documents.
The NLP trend is going to be amusing to watch at least (Siri, Summly), and whether its time has come in the next 5 years or not I'm not certain. But I know Ray Kurzweil knows this technology is inevitable.
--
As for BoxFish, I think this is a good example of a neatly executed, well funded startup with experienced founders and a solid space. No drama, no demo day, no immediate fires to put out, cool $3m in the bank, Deutsche Telekom AG subsidiary negotiating their deals for them, and "Yahoo just bought a kids startup for 17m" - the topic is hotter than others.
This is the type of startup I for one daydream of having stock of or working at. Has high potential to be worth $mmms or $bn in the future - you know, that all depends and what not. But the makings are clearly there. Excellent work guys! Congratulations.
That's a really interesting question. Hopefully by opening up the API we'll enable more people to ask and answer those questions. How does social impact upon TV? How does TV impact upon social?
BTW we do trend identification across genres, channels and broader categories. I've often wondered what insights can be gained by looking at entities / things that are trending but not trending as strongly as leading news or sports events? Or looking at the rate of change of trends, i.e. identify slowly emerging trends?
While the majority of TV is pre-recorded or repeated content (think of all the repeats of the Simpsons, Real Housewives of X etc). We know whether a show is recorded or live and the broad categories that a given show falls into. We also break up our trending calculations into different groups based, News, Sports etc and treat the data differently (as seen in the apps)
Also, bear in mind that while a show might be prerecorded it still may show useful data. For instance The Colbert Report and The O' Reilly Factor are usually recorded shows, however they can talk about drastically different things from show to show, and even between segments in shows.
I grant that useful trends are more difficult to extract from sitcoms and other things like that, but just because a show isn't live, doesn't that no useful trending information can be extracted.
We look at trending data over various periods of time, from minute length to longer so we can gather sentence level, show level, series level and even channel level topics.
You might be able to use indexing to show potential bias. If you have access to the data, how often do the major news networks use the word "Obama" versus "President" over the course of the day? Say... Fox News, CNN, CNBC, MSNBC.
We actually had a very interesting page up for this in the run up to the 2012 election, comparing and graphing mentions of Obama and Romney on each network along with sentiment analysis of what they said about them.
The page has since been taken down but here are two of our blog posts about the analysis.
I have no tablet so cannot try. Is the data archived? Could I, say, go back and search for "topic X" that might have been in "newsmagazine program Y" in the past month?
Reminds me a little of Bluefin Labs (acquired by Twitter). Just hook up this data with a sentiment-engine of Twitter and you can come up with some interesting correlations to how people react to television.
With the new Federal regulations stipulating that anything that originates on TV must be captioned when streamed over the internet, Boxfish will be able to get a fairly comprehensive picture of what's going on.
We currently only have US tv channels. We've experimented with others but given the limited resources of a startup we haven't had the time to expand out. We're definitely interested in it though.
Doing frequency and sentiment analysis on this dataset would be pretty interesting.