Algorithmic Detection in Congressional Lobbying
which birds flock together?
Welcome back to another gripping investigation into legal bribery. Last time, we put the 2018 US Congress graph into Neo4J and ran some basic queries. We already have a way of discovering who’s in which pocket, so now it’s time to expand the definition of ‘pocket’ to look at communities. But first, a brief note on ‘influence’.
Ranking Algorithms
I ran Google’s PageRank algorithm to see which members are more ‘involved’. The famous algo works by constructing a graph with webpages as nodes, forming relationships where pages link to each other. Each site then gets a score based on the relative number of links that point to it; Google tends to show you higher scored sites over lesser-linked alternatives.
Here’s the top 20 scored members. The minimum score is 0.15, and only 27 members were above 0.20. We can infer that ‘involvement’ is rather heavily concentrated in a wealthy few (sound familiar?).
Some things look a bit off. Even though Bernie Sanders only received $2m to Ted Cruz’s $19m, he’s ahead of him in score. PageRank has some nuances that are important to consider:
I factored money into the score, which is critical due to this graph’s nonstandard structure (most graphs are more connected, but Members don’t link to Members nor Industries to Industries).
So each candidate’s score is a bit complex. It’s affected by the number & size of incoming donations, but the ‘weight’ of those donations are also affected by the ‘weight’ of the industry it came from. The industry’s weight is determined by its total money spent, but also how spread out its donations are.
The logic is that a weblink from Wikipedia is worth more ‘influence’ than a link from your cousin’s amateur Ska review blog, and that a dollar from Pharmaceuticals is more important than a dollar from Poultry. So Bernie may have received money from more diverse industries than Ted, or more ‘influential’ industries, or both. PageRank seems to prioritize breadth of connectedness rather than depth.
Community Detection
I decided to run Louvain Modularity as it’s quite sturdy even in larger network graphs and runs in O(n*log²(n)), which is not bad for something so complex. It measures and compares edge density in and out of possible communities within the graph, and tends to yield a nuanced result regardless of how tangled the data is. After I experimented with different iterative parameters, I found that it tended to divide all 617 nodes into 5 communities. Here’s the spreadsheet of each node & their assigned community number.
I colored the nodes accordingly for easy visualization.
Neo4J’s Browser has some limitations when dealing with larger and denser graphs, although the limits also exist in our minds — it’s hard to discern any useful signal in a tangled, noisy graph. When I limited the result to only display 1/6th of the graph, however, things started to become more clear.
Machines aren’t visual, but we are. At a smaller scale, we begin to see some of the structure the Louvain algorithm saw. Health Services & Oil/Gas lobbies (yellow) seem to have a “geographically” distinct member-base than the unions (orange), and Insurance + Securities (pink) has a large, diverse member-base (as indicated by their central position in my rough earlier graphs).
Increasing the subset to 2/6ths of the graph made it a bit more chaotic. The dispersal of the tighter groups likely indicates that the picture isn’t complete: the missing 4/6ths will pull it into “shape”. Additionally, the layout algorithms Neo’s browser uses might differ from the ones used in Gephi.
It’s important to note that these visual limitations are purely from computational issues on my end: Neo’s browser isn’t designed for this sort of load, and there’s a couple visualization tools that are. In the next installment in this series, we’ll sort the communities in Gephi with several detection algorithms used in different layout algorithms.
Although most of the recent work has been done locally in Neo, you can follow my project on GitHub.
I’ve been working on hosting the graph remotely & deploying an interactive queryable app (Heroku, AWS and such) but the services seem to all charge a hefty fee each month (graph data is trickier than regular SQL), so I’ll continue investigating an economic way to simplify access to the data. You can always grab the relevant CSVs from my drive.