I’ve been looking lately at using NLP and a couple of public resources (including the Twitter and Google Charts APIs) to map the demographics of language use across the US (broken up by age, gender and geography). I’m using language identification first to check that the text is in English, then generating the demographic data by disambiguating the geography and distributing each word over a gender and age likelihood.
The full working system can be found here (try it out!). Pretty interesting results so far – it works on just about any word that’s used often enough to have a good sample, but I’m finding the best cases to be on trending topics (girls 12-24 seem to love Justin Bieber, especially in Arkansas and Maine) and slang. A couple of examples:
“bruh”
Definitely a southern word, mostly males 18-24. [Details]
“omg”
Pretty even distribution across the US and steady in terms of trending (not getting more or less popular), but definitely favored by girls 12-17. “omfg” sees the same distribution, but slightly more males.
“lmao” vs “lmbo”
Much lower median age for “laughing my butt off” (also a southernism?) + used much more heavily by females.
I could go on and on – I’ll stop with a WI shout-out to “brats,” “packers” and “Favre” and an east coast shout to “dunkin” and “sox” but try it out for yourself!