Boxfish: Realtime Index of Every Word Spoken on TV

amitparikh · on May 10, 2013

Closed captioning has always seemed to me to be a notoriously bad data set due to misspellings and misphrasing. Has anyone tried to do (a better) speech to text of a cable news channel, for instance?

Doing frequency and sentiment analysis on this dataset would be pretty interesting.

mikebennett · on May 10, 2013

Boxfish scientist & dev here.

Closed captioning does have a lot of noise in it, but we've done a lot of work to tidy that up. We also have the benefit of capturing so much data that the noise doesn't matter as much.

In real-time our systems extract and generate lots of information from the closed captions. Our NLP system identifies entities (e.g. whitehouse, chris brown, amanda berry), does frequency counts and we have some very large graphs of entity co-occurrences (leads to our statistical learning), e.g. Rihanna is commonly associated with Chris Brown.

We do a bunch more analysis on our graphs, including Latent Semantic Indexing (LSI), which helps drill down into quantifying the relationships between entities. Related to that we generate TFIDF scores for all identified entities, i.e. gives a sense of how "important" an entity is.

By combining our large scale entity graphs (both Frequency and LSI) with streams of closed captions we also do real-time topic extraction at multiple time scales, e.g. what is the topic of conversation on CNN for the last minute of conversation, for the last 5 minutes, for the whole program, etc.

Feel free to ask me more.

apaprocki · on May 10, 2013

What about STT and/or machine translation for foreign channels? It is really frustrating when foreign stations don't even have CC if you're trying to learn the language and need more spoken input.

mikebennett · on May 10, 2013

While we could use STT (some of our team have backgrounds in it) we sought to use the cleanest existing signal, i.e. closed captions.

Part of the motivation for taking a statistical NLP approach is that it gives us more flexibility for processing foreign stations / languages (we don't yet do that).

I wonder could you time and geo-shift closed captions, i.e. show closed captions in two languages at once on the same TV program? That could make an interesting language learning tool and an interesting training set for machine translation.

enraged_camel · on May 11, 2013

>>Closed captioning does have a lot of noise in it, but we've done a lot of work to tidy that up.

What's your average accuracy?

dinkumthinkum · on May 12, 2013

Associating Rihanna to Chris Brown is pretty trivial. The amount of ink that correlates them is enormous; that doesn't seem like a very interesting example. I guess it depends on what you are going for.

mikebennett · on May 16, 2013

That is a trivial example - it easily pops out of the data with a pure freq co-occurrence measurement BUT as I said freq is only a starting point. LSI and co generate far more interesting graphs and can be mined in many ways.

fixxer · on May 10, 2013

Is this system-on-chip?

mikebennett · on May 10, 2013

Unfortunately not. The information extraction aspect involves heavy duty math, where the more computational resources you can throw at the problem the better. We have a bunch of servers where all the cores on the CPUs are running at max 24 hours a day. At the moment, we're not using GPUs for our matrix math (the matrices tend to be very large). For our matrix math I am a fan of the fast ojAlgo math library http://ojalgo.org

fixxer · on May 11, 2013

Pure Java for matrix math? Seems like that would be something to avoid. Have you benchmarked it?

mikebennett · on May 13, 2013

I used the bench marks from http://code.google.com/p/java-matrix-benchmark

Decision came down to speed is reasonable, library has some other math functions I need, and time taken to build.

ChrisSalij · on May 10, 2013

(Boxfish dev here) We do a lot of things with the raw subtitles to clean them up. We fix common misspellings using both a basic dictionary and statistical models. We also normalize the data between various sources.

philsnow · on May 10, 2013

One of the original incarnations of Google Video was something somewhat similar to this (an index of closed-captioning data from a lot of different tv streams). What they chose to do with it was different though: they allowed you to search closed-captioned content and it would show you a few thumbnails and the time of day when those words were said on air.

This memory is kind of hazy, ISTR it's from 2005 or so.

sc00ter · on May 10, 2013

If we're waxing lyrical, there was a research paper out of Ireland from a small telecoms research outfit, circa 1996 describing pretty much this, but using teletext subtitles.

In addition to capturing and indexing the subtitles, it also captured the video, and so allowed the captions to be used as an index to the video.

I don't doubt someone can come up with an even earlier incarnation!

philsnow · on May 11, 2013

Everything old is new again :)

tibbon · on May 10, 2013

I personally think there's a great deal that can be done with this data.

A few years ago, someone documented how to use an Arduino + Video Experimenter Shield to easily log closed captioning data (http://blog.makezine.com/2011/08/16/enough-already-the-ardui...). Never got around to messing with it, but I can imagine 100 interesting things to do with that data.

Very cool company. I'm glad someone's doing this.

nutmeg · on May 10, 2013

Seems like scraping all closed captioning would be very valuable data indeed. Is there anyone else doing something like this that provides an API or data feed?

kburkitt · on May 10, 2013

We're providing API access to some select partners for data experiments. The API is centered around trending, topics, search and metrics but does include some restricted transcript access. Ping kevin at boxfish (me) if you've got some ideas.

jvanderwal · on May 10, 2013

We use TVEyes.com in our news monitoring product and it's very close to real-time. It does make mistakes sometimes, but I'm not aware of any perfect transcription software.

dangrossman · on May 10, 2013

Copyright law would make such a feed of closed caption transcripts illegal to distribute.

nutmeg · on May 10, 2013

That makes sense. Would be legal to capture the data and present it similar to a search engine? I'm guessing there is some sort of precedent for that sort of thing?

quan · on May 10, 2013

I'd love to see this used on Fox News to fact check everything they say in real time.

mrilhan · on May 10, 2013

I think the potential here is immense.

Boxfish, twitter, YouTube, Siri, and now with Ray Kurzweil @ Google... thinkers are converging on doing to every other form of content what Google did for structured documents.

The NLP trend is going to be amusing to watch at least (Siri, Summly), and whether its time has come in the next 5 years or not I'm not certain. But I know Ray Kurzweil knows this technology is inevitable.

--

As for BoxFish, I think this is a good example of a neatly executed, well funded startup with experienced founders and a solid space. No drama, no demo day, no immediate fires to put out, cool $3m in the bank, Deutsche Telekom AG subsidiary negotiating their deals for them, and "Yahoo just bought a kids startup for 17m" - the topic is hotter than others. This is the type of startup I for one daydream of having stock of or working at. Has high potential to be worth $mmms or $bn in the future - you know, that all depends and what not. But the makings are clearly there. Excellent work guys! Congratulations.

thereallurch · on May 10, 2013

Wont this just amplify existing trends instead of exposing new ones?

mikebennett · on May 10, 2013

Boxfish scientist & dev here.

That's a really interesting question. Hopefully by opening up the API we'll enable more people to ask and answer those questions. How does social impact upon TV? How does TV impact upon social?

BTW we do trend identification across genres, channels and broader categories. I've often wondered what insights can be gained by looking at entities / things that are trending but not trending as strongly as leading news or sports events? Or looking at the rate of change of trends, i.e. identify slowly emerging trends?

hayksaakian · on May 10, 2013

Precisely. Prerecorded (the majority of) TV has never created trends.

ChrisSalij · on May 10, 2013

(Boxfish dev here)

While the majority of TV is pre-recorded or repeated content (think of all the repeats of the Simpsons, Real Housewives of X etc). We know whether a show is recorded or live and the broad categories that a given show falls into. We also break up our trending calculations into different groups based, News, Sports etc and treat the data differently (as seen in the apps)

Also, bear in mind that while a show might be prerecorded it still may show useful data. For instance The Colbert Report and The O' Reilly Factor are usually recorded shows, however they can talk about drastically different things from show to show, and even between segments in shows.

I grant that useful trends are more difficult to extract from sitcoms and other things like that, but just because a show isn't live, doesn't that no useful trending information can be extracted.

We look at trending data over various periods of time, from minute length to longer so we can gather sentence level, show level, series level and even channel level topics.

MisterBastahrd · on May 10, 2013

You might be able to use indexing to show potential bias. If you have access to the data, how often do the major news networks use the word "Obama" versus "President" over the course of the day? Say... Fox News, CNN, CNBC, MSNBC.

ChrisSalij · on May 10, 2013

We actually had a very interesting page up for this in the run up to the 2012 election, comparing and graphing mentions of Obama and Romney on each network along with sentiment analysis of what they said about them.

The page has since been taken down but here are two of our blog posts about the analysis.

* http://blog.boxfish.com/post/30997338037/obama-vs-romney-who... * http://blog.boxfish.com/post/32880728776/tvs-thoughts-on-our...

mikecane · on May 10, 2013

I have no tablet so cannot try. Is the data archived? Could I, say, go back and search for "topic X" that might have been in "newsmagazine program Y" in the past month?

ChrisSalij · on May 10, 2013

The app is on Android/iOS and you can check out the boxfish.com to check out what we have.

We expose trending data via serach for the past 7 days in the app/website. Our API allows for longer term and more granular searches.

skram · on May 10, 2013

Here's one endpoint that seems to work and not require an API key: http://api.boxfish.com/v2/v3/trending/topics/?fields=count

RK · on May 10, 2013

Sounds similar to SnapStream. "Monitor everything said on TV"

http://snapstream.com

uptown · on May 10, 2013

Reminds me a little of Bluefin Labs (acquired by Twitter). Just hook up this data with a sentiment-engine of Twitter and you can come up with some interesting correlations to how people react to television.

https://bluefinlabs.com/

krazykringle · on May 10, 2013

Also: http://archive.org/details/tv

slifty · on May 10, 2013

For those interested in a real time API of caption streams you should be sure to check out Opened Captions: http://openedcaptions.com:3000/

Currently only for C-SPAN but that may change!

Finster · on May 10, 2013

With the new Federal regulations stipulating that anything that originates on TV must be captioned when streamed over the internet, Boxfish will be able to get a fairly comprehensive picture of what's going on.

bravura · on May 10, 2013

Is this only for US television? Or is it global?

What is the reach? I know several people who would be interested in this for smaller countries.

I couldn't find this information on the homepage.

ChrisSalij · on May 10, 2013

We currently only have US tv channels. We've experimented with others but given the limited resources of a startup we haven't had the time to expand out. We're definitely interested in it though.

e3pi · on May 10, 2013

`HN DDOS' again? Still spinning after five minutes on:

http://boxfish.com/#!search/Klinger

deepinsand · on May 10, 2013

Do they have a massive number of cable/satellite subcriptions? I've always wondered how they and IntoNow get their signals.

kburkitt · on May 10, 2013

[Boxfish:] We have lots of set top boxes :)