Social media is powerful not only because it preys on our simian instinctual drive towards shiny things, but because it creates connections. The content you create, share or comment on creates a ‘profile’ of how you express yourself online. It’s a beautiful phenomenon that enables communities to organically appear & grow and puts users in control of the content they care about.
I’m kidding. Companies use these profiles to advertise to you even harder.
Judgments aside, all the more reason to understand what’s going on under the surface.
Last time we explored some Instagram data and determined that follower count doesn’t really lead to higher user engagement.
I briefly mentioned network graphs — a bunch of Node objects connected by Relationships (or Edges) — and how they can model an interconnected social media environment.
I have since been overcome by the pervasive urge to graph, so let’s boot up Neo4J.
Nodes of Creation
There’s 2 easy ways I’ve found to jump into graphing:
- Neo4J & Py2Neo
I’ve used both before; NetworkX is a streamlined Python library used to create and modify Graph objects through their composite Node and Edge objects. You can quite easily export the graph to a .geoff string and load it into Gephi, which is a sleek open-source graph visualization tool.
I’ve been leaning into Neo4J for a while due to its SQL-ish database capabilities, native clustering algorithms, and varied export/import options. The free desktop version comes with a neat GUI, and it’s quite easy to click around the tutorial and get a local server running.
I use Py2Neo to connect to my Neo database from Jupyter. After calling
pip install py2neo you can connect in 3 lines:
from py2neo import Database, Graph, Node, Relationship
db = py2neo.Database() # instantiate using default bolt port
g = py2neo.Graph(host='localhost', auth = ('neo4j','password'))
py2neo.Database() method automatically searches your local ports to find Neo’s default. This isn’t terribly cybersecure, but adding user authentication credentials is a few more clicks and we’re running local.
There’s a couple of ways to get stuff done, but it all revolves around that py2neo.Graph object I designated
g . A simple way is creating a Graph.Transaction object, creating Nodes & Relationships within the transaction, and committing the changes to the graph afterwards.
tx = g.begin() # create new transactionkevin = Node('Crab', name = 'Kevin')
tx.create(kevin) # create node: label/type = Crab, name = Kevinmargot = Node('Jellyfish', name = 'Margot', color = 'purple')
margot['color'] = 'green' # update property, dictionary style
tx.create(margot)rel = Relationship(kevin, 'FRIENDS_WITH', margot)
tx.create(rel) # create friendship between marine lifetx.commit() # push changes to connected graph, close transaction
tx.commit() you’ll notice the changes updated in your Neo database.
Filling Out Your Graph
The sample Instagram data from last time is perfect for network graphing. Let’s jump into a quick
populate_graph() method. This assumes you’ve got a Pandas dataframe called
df which has an Instagram post as each row and columns including ‘username’, ‘followers’, ‘max_likes’ and such.
The logic is simple (if messily executed): Loop through each row using
df.iterrows() and create Node objects with column data stored as properties.
The only complexity comes from creating User and Post nodes at the same time; I use a list to keep track of already-created users, and a method called
add_post_to_user() which handles property updates & relationship creation:
This one looks rough, but it’s just a series of building
query strings and executing them with
graph.evaluate(). The queries are written in Cypher, Neo4J’s own SQL-ish language. Regular multi-line f-strings don’t work in Cypher, however, so the many parameters in the middle
g.evaluate() are the best way I’ve found to handle variable substitution.
Given this row’s Post has an existing User, we want to add the post’s metrics to the user’s Totals, combining & dividing by total_posts where necessary. I presorted the dataframe by username & date, so the last Post for each User will have their most recent follower counts.
At the end, I MATCH two nodes — the User with ‘username’ and Post with ‘post_url’ — and CREATE a “POSTED” directional relationship between them. Running a query with evaluate also auto-commits to the server’s Graph
This could be done in Py2Neo with
Graph.Merge() as well, but those are ultimately running Cypher queries, and writing them yourself gives complete control of updates.
However we’re still far from a comprehensive “social media ecosystem”. We’ve got things like
growth_over_time, but the data lacks one crucial thing: The username of each comment isn’t recorded, just the comment itself.
Ideally we’d want to draw comments as edges, so we’d get something like:
This would let us understand who comments on whose content, and by extent “who associates with whom?” which, as we mentioned at the beginning, is a very lucrative question.
However, the comment-string for each post does contain @username tags written in the comment body. We can use some quick RegEx with
re.findall() to identify each tagged_user for a post’s comments and create
One thing to note is that @username appearing in a comment could mean one of two things:
- Someone tagged their friend in the comment
- The comment is replying to another comment (instagram storage format)
A reply does tell us that the person being replied to commented on the original post, which is somewhat useful. We could try to differentiate between the two by looking at each comment and RegExing ‘which tag is at the very beginning of each comment’. But a user tagging their friend at the beginning will be stored in the same format.
It would be better overall to alter the data gathering pipeline to record the username of each posted comment. But for now, we’ll make do with “Tagged_In” meaning an ambiguous connection to the post.
Another possible Node to add is Hashtags:
(What’s the verb here? Posted overlaps with user-post relationships. Someone “used” a hashtag?)
Hashtag nodes would provide a interesting bridge between posts, users and comments. The ultimate goal is to model ‘connections’ between people in Instagram’s digital landscape, so it’s important to ask questions like “Should the number of times they used a hashtag increase the weight of the relationship?”.
Next steps in graphing
Here’s our very rudimentary Insta-graph:
Red user nodes are connected to pink Post nodes. Cnidarian resemblances aside, it’s not terrifically informative yet. We don’t have enough data visualized to show relationships between hashtags (which are off floating somewhere) due to my computer’s hardware limitations; regular Neo4J isn’t built for massive-quantity viz the way Gephi is.
However the dev team recently released Neo4J Bloom for free use with local desktop servers. This is wicked exciting for me, because I can start to build larger, more complex graph visualizations with more control over aesthetics & arrangement.
Next time we’ll build out more relationships & explore Bloom vizualizations.