Galvanize Week 5: Web Scraping, NLP, Clustering, and a look at Craigslist Missed Connections

After deciding to focus on my Github portfolio, I made a mini-project to demonstrate the skills we learned for the week. This week, we covered web scraping (BeautifulSoup and MongoDB), natural language processing (NLP), time series, and clustering.

So, I decided to have some fun with Craigslist's Missed Connections to practice:

  • web scraping
  • database management
  • data visualization
  • NLP
  • clustering

I didn't do any time series analysis since that would require a pretty healthy dataset to play with, and my script is only a week old...

To start off- let's talk about scraping Craigslist. I didn't think Missed Connections would be too difficult to scrape because I assumed the spam content to be relatively low compared to other Craigslist categories. This was my first time building a scraper, and I learned a ton in the process:

  1. Get used to iterating. I'm not sure how long I thought it would take to scrape and curate my own dataset, but it definitely took longer than I expected. My process seemed it would be simple: build a scraping script, build a database to store info, then add some functionality to the scraper to retrieve only new posts that aren't in the database. Each step required lots of iterating, and of course, there's always room for improvement and more iterations...

  2. Building a "friendly" scraper. To avoid getting caught and banned for overloading Craigslist's server, I put a delay between each request to view the next page. I used the sleep function from python's time module to emulate a person spending 10-13 seconds on each page with sleep(10 + 3*np.random.random()). It might be a bit on the conservative side, but I've managed to store 1153 posts so far without getting kicked off!

  3. Figuring out your database needs. Even though we had learned about MongoDB this week, it felt like overkill for my needs. I wasn't going to be storing much, and my table schema was pretty straight-forward, so I went with Postgres. My friend shared some helpful resources to help me make these decisions again in the future:

    When to Use MongoDB Rather than MySQL (or Other RDBMS): The Billing Example

    Visual Guide to NoSQL Systems

    I also learned that Python's sqlalchemy module is much nicer to use than psycopg2. Good god. Goodbye rollbacks and commits!

I'll give a small demonstration of my project, but you can check out my code here:

First, importing things:

from MissedConn import MissedConn
from maps import *
from manage_db import *

Next, we create a MissedConn object- I'm initializing it for just the 'sfc' subcategory of sfbay's missed connections for minimal scraping during this demonstration.

mc = MissedConn('')
df = mc.get_df()

Then, we can make an interactive Folium Map to visualize and read the Missed Connection posts that included latitude & longitude data (posts that contain maps).

I haven't decided if/how to represent the remaining posts- some include a neighborhood, and some don't or provide a location that's not super helpful. There's another function make_heat_map(df) in for making heat maps that is pretty similar to make_pinned_map(df) for now. The pinned map is more fun to me (and I'm especially proud of the markers).

Click on the pins to read the posts!

m = make_pinned_map(df)

After playing with this Missed Connections set, we can upload it to our Postgres database with the update_db(df) function from We can also grab a DataFrame of everything in our database to play with using db_to_df().

df_all = db_to_df()

Now we can move on to NLP and clustering!

First, we tokenize the words in our posts and vectorize each post based on its word content using TF-IDF. TF-IDF stands for Term Frequency-Inverse Document Frequency. I found this nomenclature a little annoying- "frequency" was misleading to me since it makes me think about dividing by some cycle length. It's more like Term Count - Inverse Document Count. You count how many times a word has occured in a specific document, and divide it by the number of times a word has occured in all documents. This tells you how important a word is within a specific post compared to all posts.

With vectorized posts, we can calculate which posts are most similar to each other by cosine similarity. Since each post is represented as a vector, we can compare them pairwise to find which two are most "aligned". The cosine similarities range from 0 to 1, since we have no negative values in our vectors (term frequency and document frequency can't be negative).


Posts with highest cosine similarity (0.611):

Post 11:
Until we meet again! I do miss you! I miss the walks on the beach. I hope all is well with you.        

Post 482:

Heh. Looks like my function works?

Lastly, I used k-means clustering to try to group similar posts together and looked at what words were the most representative of each cluster.

print_clustered_words(df_all, 5)

Cluster 0: wish, like, time, heart, love, just, want, know, miss, don
Cluster 1: contact, like, know, hey, guy, man, just, meet, love, looking
Cluster 2: looked, know, just, saw, work, hope, did, said, time, maybe
Cluster 3: color, like, shorts, tell, white, know, hair, shirt, black, wearing
Cluster 4: hope, talk, cute, friend, didn, like, say, really, wanted, hi

Meh. I looked at the actual posts too- this time for a smaller number of clusters.

print_clustered_posts(df_all, 3)

Cluster 0:
    We were talking until you were ordered to work. I was about to ask you for your phone number, but we were intercepted by a hater. You have two girls, live in Salinas, and I was mesmerized. Hit me up.
    Hello, I seen the corners of your mouth start and smile but couldn't tell if they were 100 percent directed at me? You have silver rimmed sunglasses on and were eating a bagel across from me. You were very pretty Asian girl and I a whiteguy with facial hair. If that smile was for me, I'd love to talk over coffee and a bagel. Just tell me what I was wearing and send a pic of you don't mind. Have a nice day.
    You sat next to me with your son Kevin. You were a beautiful woman having white wine and driving me crazy as I sat next to you. If you see this did you feel the same?

Cluster 1:
    Laura, we talked in Berlin and in SFO. It was really nice meeting you and I should have offered to meet again. Here I am. If you find this, let's meet again.
    we both got on the 38 R on geary and Leavenworth and we got off on filmore and geary.. let me tell you, you are so beautiful and hot ... i was gonna say hi , but we both runaway... I hope you see this , I'll love to see you again...
    I told you about a Raichu you already knew about, and you were wearing an awesome green dress. Beer, chat, and a lure on me if you like. I'll be at Cinebar tonight.

Cluster 2:
    My relationship has declined slowly from fun, passion and hope to mechanical without feelings. We never seem to have meaningful conversations anymore. I wouldn't let myself cross the line for a long time, I felt guilty at first but now, I know that releasing sexual tension would help clear my mind. I'm not looking for penetration, but rather being able to relax with a pretty woman and talk. I completely understand why someone would want to have an affair. We all need basic things: love, affection, attention, sex. I'm not looking for a "friends with benefits". I'd like to connect with someone who has the flexibility and understanding to nurture a relationship. Intimacy goes way beyond sex. I am most compatible with highly educated, ambitious, fit, tall, attractive women. No STDs, no smokers, drug users or heavy drinkers. I am seeking a fun connection not just sex.
    the thought of you soothes me, I am having a really horrible year, but I am sure you don't ever think of me. thank you for leaving me unscathed because if you hadn't I don't know what memory I would cling to now. I don't know if I should seek you out because that summer was fun and a long time ago and I really don't have anything that isn't tarnished. I am changed and it is not a good look. I miss myself with you: carefree and tan. thank you for being there in that period of my life and I am sorry for using your memory as solace today. merry christmas and happy birthday
    We met up a few times a few years ago, but then lost touch. I'd love to get together and start seeing you sometimes again, if you still check here. I miss filling you up. ;)

My clustering results weren't that interesting- the underlying themes for each cluster weren't clear to me. Looks like I need 1) a larger corpus to work with, and 2) a different vectorizing method that doesn't strictly compare individual words. I'll explore word2vec to look at word embeddings- it can capture relationships between words that are used in similar contexts.

comments powered by Disqus