Instagram Landscapes: Building Network Graphs with Neo4J in Python

modeling connected communities

Social media is powerful not only because it preys on our simian instinctual drive towards shiny things, but because it creates The content you create, share or comment on creates a ‘profile’ of how you express yourself online. It’s a beautiful phenomenon that enables communities to organically appear & grow and puts users in control of the content they care about.

I’m kidding. Companies use these profiles to advertise to you
Judgments aside, all the more reason to understand what’s going on under the surface.

Last time we explored some Instagram data and determined that follower count doesn’t really lead to higher user engagement.
I briefly mentioned network graphs — a bunch of Node objects connected by Relationships (or Edges) — and how they can model an interconnected social media environment.

I have since been overcome by the pervasive urge to , so let’s boot up Neo4J.

Nodes of Creation

There’s 2 easy ways I’ve found to jump into graphing:

  1. NetworkX
  2. Neo4J & Py2Neo

I’ve used both before; NetworkX is a streamlined Python library used to create and modify Graph objects through their composite Node and Edge objects. You can quite easily export the graph to a .geoff string and load it into Gephi, which is a sleek open-source graph visualization tool.

I’ve been leaning into Neo4J for a while due to its SQL-ish database capabilities, native clustering algorithms, and varied export/import options. The free desktop version comes with a neat GUI, and it’s quite easy to click around the tutorial and get a local server running.

I use Py2Neo to connect to my Neo database from Jupyter. After calling pip install py2neo you can connect in 3 lines:

from py2neo import Database, Graph, Node, Relationship
db = py2neo.Database() # instantiate using default bolt port
g = py2neo.Graph(host='localhost', auth = ('neo4j','password'))

The py2neo.Database() method automatically searches your local ports to find Neo’s default. This isn’t terribly cybersecure, but adding user authentication credentials is a few more clicks and we’re running local.

There’s a couple of ways to get stuff done, but it all revolves around that py2neo.Graph object I designated g . A simple way is creating a Graph.Transaction object, creating Nodes & Relationships within the transaction, and committing the changes to the graph afterwards.

tx = g.begin()  # create new transactionkevin = Node('Crab', name = 'Kevin')
tx.create(kevin) # create node: label/type = Crab, name = Kevin
margot = Node('Jellyfish', name = 'Margot', color = 'purple')
margot['color'] = 'green' # update property, dictionary style
rel = Relationship(kevin, 'FRIENDS_WITH', margot)
tx.create(rel) # create friendship between marine life
tx.commit() # push changes to connected graph, close transaction

After calling tx.commit() you’ll notice the changes updated in your Neo database.

Filling Out Your Graph

The sample Instagram data from last time is perfect for network graphing. Let’s jump into a quick populate_graph() method. This assumes you’ve got a Pandas dataframe called df which has an Instagram post as each row and columns including ‘username’, ‘followers’, ‘max_likes’ and such.

Image for post
Image for post
much more readable with color

The logic is simple (if messily executed): Loop through each row using df.iterrows() and create Node objects with column data stored as properties.
The only complexity comes from creating User and Post nodes at the same time; I use a list to keep track of already-created users, and a method called add_post_to_user() which handles property updates & relationship creation:

Image for post
Image for post

This one looks rough, but it’s just a series of building query strings and executing them with graph.evaluate(). The queries are written in Cypher, Neo4J’s own SQL-ish language. Regular multi-line f-strings don’t work in Cypher, however, so the many parameters in the middle g.evaluate() are the best way I’ve found to handle variable substitution.

Given this row’s Post has an existing User, we want to add the post’s metrics to the user’s Totals, combining & dividing by total_posts where necessary. I presorted the dataframe by username & date, so the last Post for each User will have their most recent follower counts.

At the end, I MATCH two nodes — the User with ‘username’ and Post with ‘post_url’ — and CREATE a “POSTED” directional relationship between them. Running a query with evaluate also auto-commits to the server’s Graph

This could be done in Py2Neo with NodeMatcher & Graph.Merge() as well, but those are ultimately running Cypher queries, and writing them yourself gives complete control of updates.

Feature Engineering

However we’re still far from a comprehensive “social media ecosystem”. We’ve got things like post_caption, all_comments_on_post, and growth_over_time, but the data lacks one crucial thing: The username of each comment isn’t recorded, just the comment itself.
Ideally we’d want to draw comments as edges, so we’d get something like:


This would let us understand who comments on whose content, and by extent “who associates with whom?” which, as we mentioned at the beginning, is a lucrative question.

However, the comment-string for each post does contain @username tags written in the comment body. We can use some quick RegEx with re.findall() to identify each tagged_user for a post’s comments and create
(User)-[:TAGGED_IN]->(Post) relationships.

Image for post
Image for post
I don’t understand this, but my graph might

One thing to note is that @username appearing in a comment could mean one of two things:

  1. Someone tagged their friend in the comment
  2. The comment is replying to another comment (instagram storage format)

A reply does tell us that the person being replied to commented on the original post, which is somewhat useful. We could try to differentiate between the two by looking at each comment and RegExing ‘which tag is at the very beginning of each comment’. But a user tagging their friend at the beginning will be stored in the same format.

It would be better overall to alter the data gathering pipeline to record the username of each posted comment. But for now, we’ll make do with “Tagged_In” meaning an ambiguous connection to the post.

Another possible Node to add is Hashtags:

(What’s the verb here? Posted overlaps with user-post relationships. Someone “used” a hashtag?)

Hashtag nodes would provide a interesting bridge between posts, users and comments. The ultimate goal is to model ‘connections’ between people in Instagram’s digital landscape, so it’s important to ask questions like “Should the number of times they used a hashtag increase the weight of the relationship?”.

Next steps in graphing

Here’s our very rudimentary Insta-graph:

Image for post
Image for post
tell me that doesn’t look like jellyfish.

Red user nodes are connected to pink Post nodes. Cnidarian resemblances aside, it’s not terrifically informative yet. We don’t have enough data visualized to show relationships between hashtags (which are off floating somewhere) due to my computer’s hardware limitations; regular Neo4J isn’t built for massive-quantity viz the way Gephi is.

However the dev team recently released Neo4J Bloom for free use with local desktop servers. This is wicked exciting for me, because I can start to build larger, more complex graph visualizations with more control over aesthetics & arrangement.

Next time we’ll build out more relationships & explore Bloom vizualizations.

Written by

data scientist, machine learning engineer. passionate about ecology, biotech and AI.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store