In my last article, I noted that a big messy visualization is great, but not incredibly useful for asking specific questions. We need a more organized way to explore our graph.
Enter Neo4J, a database specifically designed for graphs. It’s got loads of neat things like relational data imports & native algorithms, and I’m only using a fraction of its power today. It’s also very aesthetic, which is quite underrated in tech. They’ve got a Medium page if you want to see more.
From Python to Database
A lot of data gets imported to Neo from .CSVs, and there’s interesting ways to convert a conventional SQL-ish database to a relational graph. Since I’ve already got a NetworkX graph in Python, I found a specialized import library called ‘NeoNX’ and rejoiced in my quick path to success.
However, NeoNX lacked some critical functionality I needed, so I talked graphs with Dan, who used Py2Neo his neat tweet sentiment analysis graph. It’s quite powerful and supports database querying in Cypher, Neo’s SQL analog. I rewrote the graphing functions in Jupyter and uploaded the bribes to Neo, totaling 617 nodes (82 industry, 535 members) & 5,355 donations. Here’s a visualization of a smaller (300) subsection of the graph.
Let’s get an overview of some important metrics. Neo’s even got an example suite, which is bloody nice of them, let me tell you.
Nodes are labeled based on Party, or Industry if they’re not congressmembers. At first it looks like average relationships aren’t that different between the parties, but CRP’s API only returns the top 10 industries per candidate. The connectedness of industries varies greatly.
Each member’s money received in descending order:
There’s some popular names up there. We’ll look into Warren and Bernie because they’ve been in the news lately, and then Cruz because he’s a big beneficiary.
Here’s Bernie’s donations (total $2,271,338), ordered by the number of donations the industry’s made in total (c) — the industry’s “popularity”.
Here’s Ted ($18,758,299) and Warren ($10,886,802):
That was a lot of rather dry tables, so let’s look at some graphs. Here’s the aforementioned three and all their connected industries. This is where you can begin to see the real value of graphs: seeing connections beyond the first degree, and it only gets more interesting from there.
The top 30 biggest receiving members and their connected industries. We can get a rough picture of which industries are more partisan and which are more central.
However, Neo clusters them pretty tight, and I can’t find any edge tuning mechanisms, so we’ll have to look at smaller pieces of the picture.
I queried the top 20 members and manually fixed their positions farther out from each other, so the individual connections are more visible. The same structure from the last query is visible.
Hey big spender
Next, we’ll look at the top industries and their second-degree connections. “ind_money_spent”: Total money spent by the industry. “ind_num_bribes”: number of donations to members. “mem_money”: total money received by the members that industry has donated to. mem_num_bribes: number of donations received by all members the industry has donated to.
I displayed the top 10 industries & all the members they’ve donated to, but it’s an undifferentiated blob — there’s not much value in seeing ‘where they end up’ unless I include the whole graph, the algorithms of which might never finish converging in Neo.
Some takeaways: displaying only the top 10 (of 82) Industry nodes + their connections showed us 534/535 congressmembers, and 2391/5355 donations. The big spenders really go all in.
But this is still quite a mess. Let’s get algorithmic.
Once you’ve got a graph, there’s loads of interesting algorithms that can derive fascinating structure about the data. The structure has always been there — the human brain just isn’t poised to see it naturally, so we made computers to help us see it; that’s what makes data science so beautiful. Here’s another table.
The Betweenness Centrality algorithm calculates the shortest weighted route between every pair of nodes. For each node, it then sees how many of these paths pass through it, giving a higher score based on frequency.
If we simplify lobbying donations as the transfer of money, emails and handshakes, then Betweenness is influence: ‘who knows a lot of people’, ‘who can network/exert control over others’. This is certainly a simplification — graphs are models of real-world networks, and models don’t get all the details right. But for our purposes, it does the job at indicating patterns hidden beneath the surface.
Some of these high-influence industries are more concerning than others. Take the Oil/Gas and Pharmaceutical lobbies; they’ve got a nasty combination of several factors: high overall money spent, many connections, and connections to diverse members (indicated by their high Betweenness score). So do the Retired and Insurance lobbies, but they’re not killing people by pushing pollution and painkillers — another reason that it’s important to interpret data through a human mind, because we pick up on the sort of stuff computers might miss.
We’ve now got a very solid base to work with. Neo allows in-depth querying of specific nodes and patterns, and offers some powerful algorithms for further analysis. It’s harder to get a visual overview of the whole huge graph, but there’s other tools for that (like Gephi). Still, the developments so far are promising; what tends to get lost in bureaucracy will be dragged into the light by the unflinching power of math.
Next week, we’ll use community detection algos to see if there’s groups of industries & members who tend to flock together. In the end, I’d like to host my Neo database remotely so anyone can query/download it, and design a decent frontend for non-tech users to easily investigate congresspeople of interest. In the meantime, I uploaded some of the queried results as CSVs in my GoogleDrive if you’re eager to check them out.
You can follow my project on Github; stay tuned for more inquisitive investigations into iniquitous inducements.