A Python ETL Pipeline for NASA’s Sulfur Dioxide Data

Effortlessly stream pollution to your SQL server

While exploring the SO² data I’ve been writing about, I wanted to formalize my Extract-Transform-Load process into an easily executable file so that interested users can download their own data without having to muck through Jupyter notebooks.

Pipelines are a crucial part of any data-powered business; smart ones will gather information, clean it and store it all in real-time. Companies have different approaches to storage depending on size and software requirements, but the philosophy is the same: get the data from A to B in a robust and hopefully simple manner.

First, you’ll need some HDF5 files from NASA, which is easier than it sounds. They basically just want you to register an account before downloading, and it’s easy to specify how much & which data you want. Once you’ve got some .h5 files to unpack, move them to a folder called ‘data’ within your working directory.

This script is compatible with any server that connects to SQLAlchemy; all you need to do is substitute your own values for username, password, host and database name, so your config.py file is simply:

user = 'drake'
pw = 'orange'
host = 'rhymes.with.us-east-1.rds.amazonaws.com'
db = 'doorhinge'

If you’re not using MySQL, you’ll also have to edit the engine string:

def make_engine():
user = config.user # substitute your own config variables
pw = config.pw
host = config.host
db = config.db
connst = f'mysql+pymysql://{user}:{pw}@{host}/{db}'
engine = create_engine(connst, echo=False)
return engine

Change that “mysql+pymysql” to the proper SQLAlchemy configuration for your server (Oracle, Postgres, etc).

Once that’s all set, just run “python pipeline.py” from your terminal and data will flow freely into your database. After imports and sub-methods, the entire process is one method:

def process_h5s():
files = get_files_list() # get filepath strings
engine = make_engine() # create server engine
print(f'processing {len(files)} files...')
for f in tqdm(files):
df = process_orbit(f) # extract h5 data into df
df.to_sql(config.db, con=engine, if_exists='append') # upload
return

SQLAlchemy and Pandas are super compatible, and together they can save you a lot of trouble customizing table variables. The dataframe.to_sql() method uses your SQLAlchemy engine to create a new table in the database, and setting it to ‘append’ allows us to iteratively push new .h5s to the same table. TQDM is a neat wrapper which gives you a % progress bar for loops. A clean pipe is a good pipe.

This is a pipeline, but it’s certainly not how every pipeline should look. If we need daily updates to feed to an atmosphere-monitoring app, we’d have to run a headless browser to access GESDISC with an existing account (I can’t find an API for it) and download new links each day, then curl the links into our ~/data folder before resuming the piping. If we wanted rainfall instead, we’d have to nab different links and change the .h5-to-dataframe method accordingly.

But this is just a basic ETL process: when my SO² analysis gets predictive, this pipeline becomes much more interesting. A “complete” pipeline will often have modeling & analysis steps after loading the data to storage. For example, when you speak to your friend about something and your phone shows you an ad for it ten minutes later, you can be sure there’s a very efficient pipeline at work.

You can follow this project, and grab the pipeline, on my GitHub.

Written by

data scientist, machine learning engineer. passionate about ecology, biotech and AI. https://www.linkedin.com/in/mark-s-cleverley/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store