Mapping the demographics of language

I’ve been looking lately at using NLP and a couple of public resources (including the Twitter and Google Charts APIs) to map the demographics of language use across the US (broken up by age, gender and geography). I’m using language identification first to check that the text is in English, then generating the demographic data by disambiguating the geography and distributing each word over a gender and age likelihood.
The full working system can be found here (try it out!). Pretty interesting results so far – it works on just about any word that’s used often enough to have a good sample, but I’m finding the best cases to be on trending topics (girls 12-24 seem to love Justin Bieber, especially in Arkansas and Maine) and slang. A couple of examples:


Definitely a southern word, mostly males 18-24. [Details]


Pretty even distribution across the US and steady in terms of trending (not getting more or less popular), but definitely favored by girls 12-17. “omfg” sees the same distribution, but slightly more males.

“lmao” vs “lmbo”

Much lower median age for “laughing my butt off” (also a southernism?) + used much more heavily by females.

I could go on and on – I’ll stop with a WI shout-out to “brats,” “packers” and “Favre” and an east coast shout to “dunkin” and “sox” but try it out for yourself!

Posted in Research | Leave a comment

Twitter Visualization: Korean language tweets 3/5

A sample of all tweets in Korean on March 5 (local time – the time in the video is EST).

Posted in Research | Leave a comment

Twitter Visualizations: Chile 2/27, Olympics 2/21

I’ve been looking at Twitter data lately – here are a couple of visualizations of the twitter stream plotted onto Google Earth after doing some natural language processing on it. All the tweets on the map have had some named entity disambiguation and georeferencing done (e.g., mapping “New York,” “NYC”, “The Big Apple” and “Home of that boy Biggie” to lat/long coordinates) along with some sentiment analysis to gauge whether it was positive (happy/nice things to say), neutral, or negative (trash talking or sad).

Chile 2/27

The first is all tweets mentioning Chile between Feb 26 and Feb 27 – not a whole lot going on until 1:34am Feb 27, when the whole world lights up.  Note the sentiment starts out strongly negative (“chile has had an earthquake omggg this has me sooo scared”) but then moves positive as the day goes on (e.g., “Pray for the people in Chile!!”). Go full screen to see the timescale.

Olympics 2/21

The second is all tweets mentioning the Olympics between Feb 20-Feb 21 – you can see them peak both days about 9-10pm EST, especially on Sunday the 21st (figure skating?). Mostly green and yellow too, suggesting postive/neutral sentiment (e.g., go Bodie!).

Posted in Research | Leave a comment