Wednesday, March 12, 2014

Importing GTFS files into SQLite

A great comment in my previous post about demystifying GTFS transit schedule data pointed out that the various files in a GTFS file are simply database tables. Each file can be imported into a relational database as a separate table, and queried using SQL instead of the custom scripts I used.

In fact, I found SQL to be faster and easier to maintain than the Python script...for a while. Eventually I rewrote my Python code, added a preprocessor, and found just as fast and easier to maintain than mixed Python/SQLite.
Nevertheless, thanks to Stefan, for the tip!

Here's a little more detail about exactly how to do it.

We will use the very simple, fast application SQLite for this, since our tables and queries will be rather simple and straightforward. Other possible databases include MongoDB and CouchDB. Indeed, for the very simple queries we used before, a series of good-old gdbm key-value databases could work.


Setup and importing GTFS tables into SQLite


In Ubuntu, installing SQLite3 is very simple:

sudo apt-get install sqlite3


Next, let's manually download the GTFS file for the Milwaukee County Transit System, uncompress it, create a new database, add a table to the database for the stops file, import stops file into the database, and save the database.


$ mkdir /home/me/GTFS                               # Create a working directory 
$ wget -O /home/me/GTFS/mcts.gtfs http://kamino.mcts.org/gtfs/google_transit.zip
                                                    # Download the GTFS file
$ unzip -d /home/me/GTFS /home/me/GTFS/mcts.gtfs    # Unzip the GTFS file
$ sqlite3

sqlite> attach /home/me/GTFS/mcts.db as mcts        # Create a new database
sqlite> create table stops(stop_id TEXT,stop_code TEXT,stop_name TEXT,
                           stop_desc TEXT,stop_lat REAL,stop_lon REAL,
                           zone_id NUMERIC,stop_url TEXT,timepoint NUMERIC);
sqlite> .separator ","                              # Tell SQLite that it's a CSV file
sqlite> .import /home/me/GTFS/stops.txt stops       # Import the file into a db table
sqlite> .dump                                       # Test the import
sqlite> delete from main.stops where stop_id like 'stop_id';  # Delete the header line
sqlite> select * from mcts.stops where stop_id == 5505;       # Test the import
sqlite> .backup mcts /home/me/GTFS/mcts.db          # Save the database
sqlite> .quit


Scripting imports

We can also script it. Here's a more robust script that creates multiple tables. The column names are explained on the first line of each table (which is why we must delete that line). The data types -TEXT, REAL, and NUMERIC-, and conversions from various Java, Python, C, and other datatypes are clearly explained in the SQLite documentation. The explanation of the field names, and expected datatype, is explained in the GTFS documentation. Each provider's GTFS file can include many optional fields, and may use different optional fields over time, so you are *likely* to need to tweak this script a bit to get it to work:

create table agency(agency_id TEXT,agency_name TEXT,agency_url TEXT,
                    agency_timezone TEXT,agency_lang TEXT, agency_phone TEXT);
create table calendar_dates(service_id TEXT,date NUMERIC,exception_type NUMERIC);
create table routes(route_id TEXT,agency_id TEXT,route_short_name TEXT,
                    route_long_name TEXT,route_desc TEXT,route_type NUMERIC,
                    route_url TEXT,route_color TEXT,route_text_color TEXT);
create table shapes(shape_id TEXT,shape_pt_lat REAL,shape_pt_lon REAL,
                    shape_pt_sequence NUMERIC);
create table stops(stop_id TEXT,stop_code TEXT,stop_name TEXT,
                   stop_desc TEXT,stop_lat REAL,stop_lon REAL,
                   zone_id NUMERIC,stop_url TEXT,timepoint NUMERIC);
create table stop_times(trip_id TEXT,arrival_time TEXT,departure_time TEXT,
                        stop_id TEXT,stop_sequence NUMERIC,stop_headsign TEXT,
                        pickup_type NUMERIC,drop_off_type NUMERIC);
create table trips(route_id TEXT,service_id TEXT,trip_id TEXT,
                   trip_headsign TEXT,direction_id NUMERIC,
                   block_id TEXT,shape_id TEXT);
.separator ','
.import /home/me/GTFS/agency.txt agency
.import /home/me/GTFS/calendar_dates.txt calendar_dates
.import /home/me/GTFS/routes.txt routes
.import /home/me/GTFS/shapes.txt shapes
.import /home/me/GTFS/stops.txt stops
.import /home/me/GTFS/stop_times.txt stop_times
.import /home/me/GTFS/trips.txt trips
delete from agency where agency_id like 'agency_id';
delete from calendar_dates where service_id like 'service_id';
delete from routes where route_id like 'route_id';
delete from shapes where shape_id like 'shape_id';
delete from stops where stop_id like 'stop_id';
delete from stop_times where trip_id like 'trip_id';
delete from trips where route_id like 'route_id';
select * from stops where stop_id == 5505;

And run that script using:

$ sqlite3 mcts.db < mcts_creator_script



Reading GTFS data from SQLite


Now, like in the previous GTFS post, let's find the next buses at the intersection of Howell and Oklahoma. There are four stops at that location: 658, 709, 5068, and 5152.


First, let's find the appropriate service codes for today's date:

# Query: The list of all service_ids for one date.
sqlite> SELECT service_id FROM calendar_dates WHERE date == 20140310;
14-MAR_CY-AON_0
[...long list...]
14-MAR_WN-PON_0


Scripting reads

These queries can also be scripted. Here's an example script that looks up the four stops we care about for a two-hour window:

-- Usage:  $ sqlite3 GTFS/mcts.db < GTFS/mcts_lookup.sh
-- Usage:  sqlite> .read GTFS/mcts_lookup.sh 

-- List of the valid service_id codes for the current date
CREATE VIEW valid_service_ids AS
   SELECT service_id 
   FROM calendar_dates 
   WHERE date == strftime('%Y%m%d', 'now', 'localtime')
   ;

SELECT stop_times.arrival_time, trips.route_id, trips.trip_headsign
   FROM trips, stop_times

   -- Match the trip_id field between the two tables
   WHERE stop_times.trip_id == trips.trip_id

   -- Limit selection to the stops we care about 
   AND stop_times.stop_id IN (658,709,5068,5152)

   -- Limit selection to service_ids for the correct day
   AND trips.service_id IN valid_service_ids

   -- Limit selection to the next hour from now
   AND stop_times.arrival_time > strftime(
                                 '%H:%M:%S', 'now', 'localtime', '-5 minutes')
   AND stop_times.arrival_time < strftime(
                                 '%H:%M:%S', 'now', 'localtime', '+1 hour')
   ORDER BY stop_times.arrival_time
   ;

-- Clean Up
DROP VIEW valid_service_ids;

And there are two ways to run the script:

sqlite&gt> .read lookup_script         # Within sqlite
$ sqlite3 mcts.db < lookup_script      # Shell script



Results

I found that importing GTFS files into SQLite requires a lot of memory and CPU...but is faster than Python, and the scripts are smaller and easier to maintain. SQLite is a good, fast processor or pre-processor.

File sizes:

  • mcts.gtfs: 5.6M
  • mcts.db: 79M
  • mtcs.gtfs unzipped files: 86M
I think that preprocessing or shrinking those files will be important for low-power or low-bandwidth applications.

Query times:

Here are the query times for buses within a two-hour window at Howell and Oklahoma:
  • Original Python: 13 sec
  • SQLite: 2.5 sec
  • Rewritten Python w/Preprocessor: 0.2 sec

Code size:

To do those queries,
The original Python 3 script: 206 lines.
SQLite script to import the GTFS file: 33 lines.
SQLite script to lookup the stops: 32 lines.

No comments: