Galvanize Capstone Project: Content Analysis of Craigslist's Missed Connections

Honestly, I struggled a bit deciding what to work on for my capstone project at Galvanize. People were suggesting I reach out to potential employers and ask them for current problems they're facing and for access to datasets. And sure, I guess that's a valid strategy for getting on people's radars. But, I'm gonna have the rest of my career to work on those business problems... Why not spend these couple of weeks being free to explore things I'm less likely to be tasked with in a job?

A week before our project proposals were due, I had built a Missed Connections web crawler using the requests and BeautifulSoup Python packages. Why? For fun. I thought it would be cool data to have. There are so many projects out there analyzing tweets and other forms of social media already, but what about analyzing content that isn't as regulated? Texts that are linked to user accounts inherently have some form of self-regulation or social agenda. And while social media is quite pervasive, it doesn't encompass as varied people groups as Craigslist. On Craigslist, people are free under anonymity to say some weird-ass things, and THAT is super interesting to me.

This project was a great way for me to showcase a variety of skills I had developed in my three short months at Galvanize:

  • Web scraping (and doing it for one of the most notoriously difficult websites ever, ugh)
  • Developing a database schema and scripting a fault-tolerant data collection process
  • Using unsupervised methods to find some sort of structure amidst the chaos of Craigslist
  • Designing a web app (Missed Connections Explorer) to share my findings via data viz and search (see screenshots below)
  • Using AWS to deploy my web app and run the scraper



Missed Connections Explorer Screenshots

Interactive Maps:


Search:

Most of my efforts in the 2-week timeframe were concentrated on topic modeling. Earlier, I had lazily thrown K-means at my corpus, but it didn't make much sense to say that each post solely belonged to one topic. Given my corpus size (~23,000 posts), I looked at three matrix factorization methods:

NMF (non-negative matrix factorization)
We decompose a TF or TF-IDF matrix using a non-negative least squares solver to say that each post is made up of different (positive) contributions of each topic.

LSA (Latent Semantic Analysis)
We apply SVD to a TF matrix to find that each post is made up of positive AND negative contributions of each topic. This is less interpretable than NMF.

LDA (Latent Dirichlet Allocation, not Linear Discriminant Analysis)
We can say that each post contains a distribution of topics (sums up to 1) and that each topic contains a distribution of words (also sums up to 1). It's formulated using integer counts, so TF is typically the matrix used for the factorization (though I believe there are some complicated ways to use TF-IDF instead).

LDA with my TF matrix didn't work well since my corpus contained a lot of similar wording across all documents, so things like "hit me up" or "I saw you" really dominated over what would have more "importance" in TF-IDF. I tried just using my TF-IDF matrix to see what would happen, and as expected, the results didn't make sense.

NMF and LDA make use of doc-term matrices constructed based on pure words- this means that words like "cat" and "dog" are considered to be completely unrelated concepts even though they usually appear in similar situations as a "pet". This is why we need some way of looking at word context. Which bring us to word2vec- you can learn "word vectors" to represent word contexts and embeddings. I think this post does a great job of giving a quick overview.

Unfortunately, constructing your own word vectors (or document vectors for doc2vec!) requires way more text than what I was able to scrape (Craigslist only shows posts from the last 45 days). Instead, I downloaded Google's pre-trained word vectors for use with gensim to vectorize my corpus. Then, I represented each document as an aggregated vector of all of its word vectors. Since the resulting TF-like matrix had negative entries, NMF was not a possible factorization. I continued with an LSA approach, but the resulting topic vectors produced topics like "fluffy, pungent" and "potato, sorghum". I initially thought this approach would be super promising, but... maybe I need to learn my own vectorizations for better results. But hey, I tried :)

I chose NMF for topic modeling, and I found that my topics were more coherent when I segmented my corpus by preferences (ex: m4m, which describes a man posting to connect with another man). I would've liked to try segmenting my corpus by city, but again, I didn't have enough posts by city to perform topic modeling. "Having enough posts" and number of topics was determined by looking at the mean squared error (MSE) between the original doc-term matrix and the product of the W and H matrices obtained by NMF.

Here's some example plots for the m4m category:

I wanted to use the "elbow method" to determine an appropriate number of topics (the optimal low-rank representation) hidden in our corpus. As we add in more topics (more ways to represent our corpus), the error between the original and reconstructed matrices decrease. Ideally, there'd be an "elbow" or a kink in our plot of topics vs. error that suggests that adding another topic doesn't add significant benefit. An example elbow in clustering looks like this:

Sadly, my plot had no clear elbow:

So, I ended up using trial and error to explore topics. I kept increasing number of topics until I saw topics that repeated the same information. For example, in m4m: there was a consistently clear topic of "gym, workout, fitness". When I started seeing separate topics for separate gyms, like "LA fitness, 24 hr fitness" and "Equinox", I knew I was past the optimal number of topics.

Once I settled on a number of topics, I verified my corpus contained enough documents for topic modeling. I made sure the reconstruction error leveled off as I gradually increased the number of posts to my current supply.

You can check out my NLP scripts on GitHub at http://github.com/stong1108/


Results (WARNING: NSFW content)

Universally, people write about eye contact and seem hopeful as they acknowledge that their post is a long shot. Common situations that are described in Missed Connections involve dog walking and commuting. Nothing too surprising there.

I also looked at the Missed Connections that made the Best-of-Craigslist list. These posts get to live forever on Craigslist thanks to user nominations.

Topics of these popular stories:

Toilets

  • bathroom
  • toilet
  • stall
  • flush
  • shit
  • bowl


Poop, especially involving animals and possibly bags of it

  • poop
  • shit
  • bird
  • dog
  • bag


Those body parts

  • penis
  • touch
  • normal penis
  • penis size
  • penis's (huh?)
  • little
  • hand
  • small
  • vagina


Traffic and accidents

  • hit car
  • driver
  • hit
  • bike
  • traffic
  • road
  • stop


Angry people

  • fuck
  • shit
  • bitch
  • stupid
  • shouted
  • shoved


Bar stories

  • beer
  • drink
  • drunk
  • bar
  • girl


So, the topics in Best-of-Craigslist suggest that the most popular Missed Connectons are silly poop/penis stories or posts serving revenge/sarcastic slights at others. :)

As for results by category, here's some of the results I thought were more interesting.

m4m (man for man)

i saw you at the gym (probably naked)

  • club
  • equinox
  • gym
  • locker room
  • steam room
  • sauna
  • la fitness
  • shower
  • workout
  • towel

(NSFW things)

  • load
  • mouth
  • hard
  • cum
  • fuck
  • sucked
  • big
  • want
  • dick
  • suck
  • cock

let's share pics

  • pic
  • pic for pic
  • send pic
  • good look

let's not use actual words

  • u
  • r
  • ur
  • n

descriptions

  • handsome
  • asian guy
  • black
  • black guy
  • asian
  • older
  • white guy
  • cute

m4w (man for woman)

reminiscing about past loves

  • miss love
  • lost
  • hope
  • miss
  • wish
  • life

descriptions

  • blonde
  • short
  • dress
  • white
  • blue
  • shirt
  • hair
  • wear
  • black

I thought it was interesting that white, blue, and black colors were mentioned the most (I thought red would be high on the list).

asian girls are cute

  • really
  • asian girl
  • super cute
  • super
  • asian
  • cute girl
  • cute
  • girl

women and their smiles are beautiful / gorgeous

  • gorgeous
  • lady
  • beautiful smile
  • beautiful woman
  • smile
  • woman
  • beautiful

w4m (woman for man)

heartache over past love

  • heart
  • hurt
  • pain
  • true
  • loved
  • forever
  • day
  • life
  • love

wishing you a happy birthday?

  • birthday
  • happy birthday
  • wish
  • day
  • hope
  • heart
  • tomorrow
  • best
  • year

asians are little

  • little asian
  • asian
  • little asian mouth
  • asian mouth
  • little

i hope you notice me

  • finger
  • hair
  • finger through hair
  • come classes
  • die
  • wait long
  • wait

descriptions

  • sexy
  • handsome
  • man
  • cute
  • blue
  • white
  • wear
  • shirt
  • hair
  • black

w4w (woman for woman)

softball (seriously)

  • softball
  • softball team
  • team
  • jersey

(NSFW things)

  • bi
  • want
  • asap
  • pic
  • pussy
  • play
  • fun
  • girl

being angry

  • hate
  • hate hate
  • horrible
  • awful
  • anger
  • karma

also wishing you a happy birthday?

  • birthday
  • happy birthday
  • hope
  • wish
  • happy
  • day

descriptions?

  • beautiful
  • neck luscious
  • angel
  • witch
  • young
  • woman

Unlabeled

a couple notices a woman / a man notices a couple

  • mw4w
  • m4mw
  • couple
  • fun

a transgender notices a man

  • t4m
  • pic
  • dick
  • real
  • send pic
  • send
  • big dick
  • sexy
  • big

a man notices a transgender?

  • hair
  • beautiful
  • black
  • short
  • tall
  • hot

This wraps up the work I accomplished during the 2 week time period. There's a lot more I'd like to study/implement in the future:

  • spelling corrector
  • impute missing categories / authorship attribution
  • perform topic modeling per city
  • build a larger corpus to construct my own vectors
  • look at trending topics (Pokemon Go came and went as a topic during the project period)

I have slides for a condensed, SFW presentation located on GitHub here: https://github.com/stong1108/CL_missedconn/blob/master/slides.pdf

My CL_missedconn repo contains all the code I used for scraping, storing posts to and querying PostgreSQL, running NLP, and building the web app.


Bonus: Favorite Find

I had a lot of fun with this project. I got to laugh every day and enjoyed reading pet names like muscle bear and daddy man. But, my favorite find came from running cosine similarity on my corpus one day to find a strong match between these two posts (Post A and Post B):

Post A:

I was lazily drinking beers and playing with your dog, but what I wanted to do was talk with you. *unfortunately* I was there with a girl who would've not been happy with it.

You know that feeling when you're with someone for the third time, but then you see someone else who catches your eye more, and you wish you could upgrade?

That's me right now.

You had a white bikini with splashes of color on it, and an amazing set of abs (hey, trill recognize trill). Fat chance you see this but... I was wearing a neon green pair of swim trunks, I'm also blond and have a beard.

(upgrade? Oh boy.)

Post B:

trill: true to ones self and real with all.

does this dude really think he can send out a missed connection in hopes the girl he wanted to "upgrade" doesn't see it? Didn't think that over?

I'm posting your missed connection comment below. Not to punish you... To save the girl you were with some time. Hopefully, you learn from your mistake, treat people with respect, and realize what trill means.

To actually girl he was with: others exist that will treat you how you should be treated. You should not be a secondary desire.

*****YOUR (un-edited) MISSED CONNECTION WAS:****

To the blonde with the merle border collie pup at the Greenbelt Sat - m4w (Gus Fruh)

I was lazily drinking beers and playing with your dog, but what I wanted to do was talk with you. *unfortunately* I was there with a girl who would've not been happy with it.

You know that feeling when you're with someone for the third time, but then you see someone else who catches your eye more, and you wish you could upgrade?

That's me right now.

You had a white bikini with splashes of color on it, and an amazing set of abs (hey, trill recognize trill). Fat chance you see this but... I was wearing a neon green pair of swim trunks, I'm also blond and have a beard.

*****END OF YOUR MISSED CONNECTION***** Attached screenshot for map and further record

Yup, girl found her guy posting on Missed Connections for someone else. Good for her, but why was she browsing Missed Connections? ... Why am I browsing Missed Connections? (No, those matched posts don't involve me.) Time to wrap up.

I'll close with a quote from a post that I think captures the beauty of Missed Connections really well:

... my point is, maybe it IS better to leave MC's! Not that I condone the slightly unstable folks who pine away at complete strangers for months and months on end, as hopeless as they are clueless. But maybe we're all hooked on MC's because we're in this fantasyland where people are better looking than they really are, we're not flat out rejected -- and we like this fantasyland.

Plus it spares us from the sweaty, nervous-laughter conversations that follow the awkward introductions, especially if one/both of the people involved are working. If by chance the person DOES see the MC, one is more likely to be successful at playing it a little cooler through emails.

But most of all, MC's give us this safety buffer for our emotional bullshit. No one really gets hurt if the wanted party never responds, because fuck it all -- it's just the internet.

Makes, sense, right?

comments powered by Disqus