Showing posts with label transit. Show all posts
Showing posts with label transit. Show all posts

Friday, June 13, 2014

Intermediate GTFS: How to prevent stop_times.txt from hogging all your RAM

In a GTFS file for a large provider (example: Chicago), the stop_times.txt file can be huge.

Here's an example:

$ unzip -v 20140611.cta.gtfs

 Archive:  20140611.cta.gtfs
  Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
 --------  ------  ------- ---- ---------- ----- --------  ----
      235  Defl:N      153  35% 2014-06-10 22:49 6dfdcbd7  agency.txt
     4889  Defl:N      554  89% 2014-06-10 22:49 38544274  calendar.txt
     2139  Defl:N      335  84% 2014-06-10 22:49 f69f2cb4  calendar_dates.txt
    13000  Defl:N     2279  83% 2014-06-10 22:49 59bf453e  routes.txt
 29896363  Defl:N  7158190  76% 2014-06-10 22:54 21c7d003  shapes.txt
  1332298  Defl:N   264714  80% 2014-06-10 23:00 48a736df  stops.txt
355900629  Defl:N 57437701  84% 2014-06-10 23:24 e13831ff  stop_times.txt
     2514  Defl:N      575  77% 2014-06-10 23:24 51dca8cf  transfers.txt
  5654714  Defl:N   656156  88% 2014-06-10 23:25 7056aa15  trips.txt
       42  Defl:N       39   7% 2014-06-10 22:49 87676593  frequencies.txt
    18959  Defl:N     6180  67% 2011-06-20 11:07 bcf138d1  developers_license_agreement.htm
 --------          -------  ---                            -------
392825782         65526876  83%                            11 files

Uncompressed, the stop_times.txt file is 356 MB, 91% of the entire archive.

You can see right away that loading that behemoth into memory will cause a pause (30 seconds, on my hardware) that a user will probably notice. And the way Python dicts use memory, 356 MB of CSV data can easily explode into 3 or 4 GB of a comparable python dict.  And that leaves us with a problem: We can't load the file on-demand because it takes too darn long, we can't keep it stored in an easily-accessible form in memory because it's too darn big. While we could keep it in CSV format in RAM, that's not a fomat we can do anything with - every query would need to iterate through the entire file, and that's sort of the worst of both worlds.




So what can we do?


First, we can filter data immediately upon load, and discard big chunks of it right away. This means the structure of the program must change from:

cal = load_table(calendar.txt)
rou = load_table(routes.txt)
tri = load_table(trips.txt)
sto = load_table(stops.txt)
tim = load_table(stop_times.txt)
svcid = valid_service_ids_for_today(cal)
trips = trips_that_stop_there_today(tri, svc_id, tim)

to something more like:

cal = load_table(calendar.txt)
rou = load_table(routes.txt)
tri = load_table(trips.txt)
sto = load_table(stops.txt)
tim = load_and_filter_stop_times_table(stop_times.txt, filter_criteria)
svcid = valid_service_ids_for_today(cal)
trips = trips_that_stop_there_today(tri, svc_id, tim)

Instead of loading the entire table and holding it in memory, a customized loader filters the data right away, and chuck out the 90% we don't actually want.


Second, we can use Python's read size parameter to prevent a spike in RAM usage (and consequent slowdown of the entire system).

For example, normally you read data with something like:

with gtfs_file.open('stop_times.txt', mode='r') as infile:
    input = infile.read()
output = do_stuff_to(input.decode())
return output


And we can rearrange it to something like:

eof    = False
output = {}
with gtfs_file.open('stop_times.txt', mode='r') as infile:
    while eof == False:
        input_block  = infile.readlines(4096)
        if len(input_block) == 0:
            eof = True

        output_block = do_stuff_to(input_block)
        output.update(output_block)

return output

The EOF flag controls the process, and an empty read triggers the flag.
The input RAM requirements are tiny, since each block re-uses the same memory over and over. Only the filtered output grows.

The performance penalty of this kind of loading seems about 30% (10 seconds) longer than a single read(), but now the process can run happily in the background without slowing other processes. Filtering the huge file down to a manageable size and searchable format also make subsequent use of the data _much_ faster.




Working examples

Here's working sample code of a generic load_table(), which I called map_gtfs_table_to_dict()

Here's working sample code of a load_stop_times(), which I called map_stop_times_to_dict()


Let's use the generic sample code to load trips.txt from a CTA GTFS file:

>>> import map_gtfs_table_to_dict as load_table
>>> import zipfile
>>>
>>> # You must provide your own GTFS file, of course!
>>> gtfs_path = "/home/ian/gtfs/20140611.cta.gtfs"
>>> gtfs_file = zipfile.ZipFile(gtfs_path, mode='r')
>>> all_trips = load_table(gtfs_file, 'trips.txt')
>>>
>>> len(all_trips)
90428
>>>
>>> for trip_id in list(all_trips)[0:5]:
...      print(trip_id, all_trips[trip_id])
... 
45072053995 {'route_id': 'Y', 'direction_id': '0', 'wheelchair_accessible': '1',
'service_id': '104509', 'direction': '0', 'shape_id': '304500033',
'block_id': '45007747580', 'schd_trip_id': 'R501'}
437076941189 {'route_id': '147', 'direction_id': '1', 'wheelchair_accessible': '1',
'service_id': '43701', 'direction': 'North', 'shape_id': '4374519',
'block_id': '437008159276', 'schd_trip_id': '76941189'}
437076941185 {'route_id': '97', 'direction_id': '1', 'wheelchair_accessible': '1',
'service_id': '43701', 'direction': 'East', 'shape_id': '4374369',
'block_id': '437008158984', 'schd_trip_id': '76941185'}
437076941184 {'route_id': '97', 'direction_id': '1', 'wheelchair_accessible': '1',
'service_id': '43701', 'direction': 'East', 'shape_id': '4374369',
'block_id': '437008159008', 'schd_trip_id': '76941184'}
437076941186 {'route_id': '97', 'direction_id': '1', 'wheelchair_accessible': '1',
'service_id': '43701', 'direction': 'East', 'shape_id': '4374369',
'block_id': '437008158986', 'schd_trip_id': '76941186'}
>>> 

There we are, all 90,000 trips in a few seconds. Each line of the GTFS table broken into a dict for easy access.



Now, let's try using the same generic load function to load stop_times.txt

>>> import map_gtfs_table_to_dict as load_table
>>> import zipfile
>>>
>>> # You must provide your own GTFS file, of course!
>>> gtfs_path = "/home/ian/gtfs/20140611.cta.gtfs"
>>> gtfs_file = zipfile.ZipFile(gtfs_path, mode='r')
>>> all_stops = load_table(gtfs_file, 'stop_times.txt')
Traceback (most recent call last):
  File "", line 1, in 
  File "./test.py", line 56, in map_gtfs_table_to_dict
    line_dict[column] = line.split(',')[columns[column]].strip('" ')
MemoryError
>>> 

It took a few minutes to consume all that memory and fail, and everything else on the system slowed to a crawl.

Finally, let's use the custom load_stop_table() to load-and-filter stop_times.txt without reducing the rest of the system to a memory-starved crawl:

>>> import map_stop_times_to_dict as load_table
>>> import zipfile
>>>
>>> gtfs_path = "/home/ian/gtfs/20140611.cta.gtfs"
>>> gtfs_file = zipfile.ZipFile(gtfs_path, mode='r')
>>> stop_list = ['3661', '3698', '15026', '17433']
>>> filtered_stops = load_table(gtfs_file, stop_list)
4379
>>> for line in list(filtered_stops)[0:5]:
...     print(line, filtered_stops[line])
... 
4521987 {'stop_headsign': '79th Red Line', 'departure_time': '13:53:57',
'shape_dist_traveled': '22219', 'arrival_time': '13:53:57',
'pickup_type': '0', 'stop_id': '3661', 'stop_sequence': '35',
'trip_id': '437077075442'}
4538374 {'stop_headsign': '79th Red Line', 'departure_time': '15:04:01',
'shape_dist_traveled': '37847', 'arrival_time': '15:04:01',
'pickup_type': '0', 'stop_id': '3680', 'stop_sequence': '58',
'trip_id': '437077077228'}
4325384 {'stop_headsign': '127th/Lowe', 'departure_time': '17:26:00',
'shape_dist_traveled': '27644', 'arrival_time': '17:26:00',
'pickup_type': '0', 'stop_id': '5992', 'stop_sequence': '42',
'trip_id': '437077000628'}
475149 {'stop_headsign': '95th Red Line', 'departure_time': '18:01:18',
'shape_dist_traveled': '14028', 'arrival_time': '18:01:18',
'pickup_type': '0', 'stop_id': '15026', 'stop_sequence': '20',
'trip_id': '436073915060'}
4112403 {'stop_headsign': '95th Red Line', 'departure_time': '22:08:53',
'shape_dist_traveled': '29370', 'arrival_time': '22:08:53',
'pickup_type': '0', 'stop_id': '15026', 'stop_sequence': '42',
'trip_id': '437076995010'}
>>> 

It's still slow to load the entire file. On my equipment, I can load stop_times.txt in about 30 seconds this way. Varying the block size has some benefit, but you can play with that yourself.

In the final example, the 'line' variable -- the key to that subdict of one trip serving a stop -- is just a counter, the (approximate) CSV row number. stop_times.txt has no unique item on each line that can be used as a dict key.


Wednesday, March 12, 2014

Importing GTFS files into SQLite

A great comment in my previous post about demystifying GTFS transit schedule data pointed out that the various files in a GTFS file are simply database tables. Each file can be imported into a relational database as a separate table, and queried using SQL instead of the custom scripts I used.

In fact, I found SQL to be faster and easier to maintain than the Python script...for a while. Eventually I rewrote my Python code, added a preprocessor, and found just as fast and easier to maintain than mixed Python/SQLite.
Nevertheless, thanks to Stefan, for the tip!

Here's a little more detail about exactly how to do it.

We will use the very simple, fast application SQLite for this, since our tables and queries will be rather simple and straightforward. Other possible databases include MongoDB and CouchDB. Indeed, for the very simple queries we used before, a series of good-old gdbm key-value databases could work.


Setup and importing GTFS tables into SQLite


In Ubuntu, installing SQLite3 is very simple:

sudo apt-get install sqlite3


Next, let's manually download the GTFS file for the Milwaukee County Transit System, uncompress it, create a new database, add a table to the database for the stops file, import stops file into the database, and save the database.


$ mkdir /home/me/GTFS                               # Create a working directory 
$ wget -O /home/me/GTFS/mcts.gtfs http://kamino.mcts.org/gtfs/google_transit.zip
                                                    # Download the GTFS file
$ unzip -d /home/me/GTFS /home/me/GTFS/mcts.gtfs    # Unzip the GTFS file
$ sqlite3

sqlite> attach /home/me/GTFS/mcts.db as mcts        # Create a new database
sqlite> create table stops(stop_id TEXT,stop_code TEXT,stop_name TEXT,
                           stop_desc TEXT,stop_lat REAL,stop_lon REAL,
                           zone_id NUMERIC,stop_url TEXT,timepoint NUMERIC);
sqlite> .separator ","                              # Tell SQLite that it's a CSV file
sqlite> .import /home/me/GTFS/stops.txt stops       # Import the file into a db table
sqlite> .dump                                       # Test the import
sqlite> delete from main.stops where stop_id like 'stop_id';  # Delete the header line
sqlite> select * from mcts.stops where stop_id == 5505;       # Test the import
sqlite> .backup mcts /home/me/GTFS/mcts.db          # Save the database
sqlite> .quit


Scripting imports

We can also script it. Here's a more robust script that creates multiple tables. The column names are explained on the first line of each table (which is why we must delete that line). The data types -TEXT, REAL, and NUMERIC-, and conversions from various Java, Python, C, and other datatypes are clearly explained in the SQLite documentation. The explanation of the field names, and expected datatype, is explained in the GTFS documentation. Each provider's GTFS file can include many optional fields, and may use different optional fields over time, so you are *likely* to need to tweak this script a bit to get it to work:

create table agency(agency_id TEXT,agency_name TEXT,agency_url TEXT,
                    agency_timezone TEXT,agency_lang TEXT, agency_phone TEXT);
create table calendar_dates(service_id TEXT,date NUMERIC,exception_type NUMERIC);
create table routes(route_id TEXT,agency_id TEXT,route_short_name TEXT,
                    route_long_name TEXT,route_desc TEXT,route_type NUMERIC,
                    route_url TEXT,route_color TEXT,route_text_color TEXT);
create table shapes(shape_id TEXT,shape_pt_lat REAL,shape_pt_lon REAL,
                    shape_pt_sequence NUMERIC);
create table stops(stop_id TEXT,stop_code TEXT,stop_name TEXT,
                   stop_desc TEXT,stop_lat REAL,stop_lon REAL,
                   zone_id NUMERIC,stop_url TEXT,timepoint NUMERIC);
create table stop_times(trip_id TEXT,arrival_time TEXT,departure_time TEXT,
                        stop_id TEXT,stop_sequence NUMERIC,stop_headsign TEXT,
                        pickup_type NUMERIC,drop_off_type NUMERIC);
create table trips(route_id TEXT,service_id TEXT,trip_id TEXT,
                   trip_headsign TEXT,direction_id NUMERIC,
                   block_id TEXT,shape_id TEXT);
.separator ','
.import /home/me/GTFS/agency.txt agency
.import /home/me/GTFS/calendar_dates.txt calendar_dates
.import /home/me/GTFS/routes.txt routes
.import /home/me/GTFS/shapes.txt shapes
.import /home/me/GTFS/stops.txt stops
.import /home/me/GTFS/stop_times.txt stop_times
.import /home/me/GTFS/trips.txt trips
delete from agency where agency_id like 'agency_id';
delete from calendar_dates where service_id like 'service_id';
delete from routes where route_id like 'route_id';
delete from shapes where shape_id like 'shape_id';
delete from stops where stop_id like 'stop_id';
delete from stop_times where trip_id like 'trip_id';
delete from trips where route_id like 'route_id';
select * from stops where stop_id == 5505;

And run that script using:

$ sqlite3 mcts.db < mcts_creator_script



Reading GTFS data from SQLite


Now, like in the previous GTFS post, let's find the next buses at the intersection of Howell and Oklahoma. There are four stops at that location: 658, 709, 5068, and 5152.


First, let's find the appropriate service codes for today's date:

# Query: The list of all service_ids for one date.
sqlite> SELECT service_id FROM calendar_dates WHERE date == 20140310;
14-MAR_CY-AON_0
[...long list...]
14-MAR_WN-PON_0


Scripting reads

These queries can also be scripted. Here's an example script that looks up the four stops we care about for a two-hour window:

-- Usage:  $ sqlite3 GTFS/mcts.db < GTFS/mcts_lookup.sh
-- Usage:  sqlite> .read GTFS/mcts_lookup.sh 

-- List of the valid service_id codes for the current date
CREATE VIEW valid_service_ids AS
   SELECT service_id 
   FROM calendar_dates 
   WHERE date == strftime('%Y%m%d', 'now', 'localtime')
   ;

SELECT stop_times.arrival_time, trips.route_id, trips.trip_headsign
   FROM trips, stop_times

   -- Match the trip_id field between the two tables
   WHERE stop_times.trip_id == trips.trip_id

   -- Limit selection to the stops we care about 
   AND stop_times.stop_id IN (658,709,5068,5152)

   -- Limit selection to service_ids for the correct day
   AND trips.service_id IN valid_service_ids

   -- Limit selection to the next hour from now
   AND stop_times.arrival_time > strftime(
                                 '%H:%M:%S', 'now', 'localtime', '-5 minutes')
   AND stop_times.arrival_time < strftime(
                                 '%H:%M:%S', 'now', 'localtime', '+1 hour')
   ORDER BY stop_times.arrival_time
   ;

-- Clean Up
DROP VIEW valid_service_ids;

And there are two ways to run the script:

sqlite&gt> .read lookup_script         # Within sqlite
$ sqlite3 mcts.db < lookup_script      # Shell script



Results

I found that importing GTFS files into SQLite requires a lot of memory and CPU...but is faster than Python, and the scripts are smaller and easier to maintain. SQLite is a good, fast processor or pre-processor.

File sizes:

  • mcts.gtfs: 5.6M
  • mcts.db: 79M
  • mtcs.gtfs unzipped files: 86M
I think that preprocessing or shrinking those files will be important for low-power or low-bandwidth applications.

Query times:

Here are the query times for buses within a two-hour window at Howell and Oklahoma:
  • Original Python: 13 sec
  • SQLite: 2.5 sec
  • Rewritten Python w/Preprocessor: 0.2 sec

Code size:

To do those queries,
The original Python 3 script: 206 lines.
SQLite script to import the GTFS file: 33 lines.
SQLite script to lookup the stops: 32 lines.

Friday, January 24, 2014

Transit schedule data demystified - using GTFS

General Transit Feed Specification (GTFS) is the Google-originated standard format for transit route, stop, trip, schedule, map, and fare data. Everything except realtime.

It's called a feed because it (usually) includes an RSS update for changes.
There are lists of feeds on the Google wiki, and on the separate GTFS data website.

Each organization's GTFS file includes all their services, so some agency files can get pretty big, and get updated often. Any schedule change or route adjustment means a new release of the entire GTFS file. The file itself is merely a big zipfile, containing several csv files that are strangely required to be mislabelled as .txt.

Here's the contents of Milwaukee County Transit System's GTFS file:

$ unzip -l mcts.zip 
Archive:  mcts.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      169  2014-01-10 05:01   agency.txt
    40136  2014-01-10 05:00   calendar_dates.txt
     5746  2014-01-10 05:01   routes.txt
   307300  2014-01-10 05:00   stops.txt
 35198135  2014-01-10 05:00   stop_times.txt
   650622  2014-01-10 05:01   trips.txt
  8369736  2014-01-10 05:01   shapes.txt
     3490  2014-01-10 05:01   terms_of_use.txt
---------                     -------
 44575334                     8 files

Yeah, 44MB unzipped.
But only 5MB zipped. Still not something you want to download every day to your phone.

Let's find a stop at Mitchell International Airport:

$ cat stops.txt | grep AIRPORT
7168,7168,AIRPORT,,  42.9460473, -87.9037345,,,1
7162,7162,AIRPORT & ARRIVALS TERMINAL,,  42.9469597, -87.9030569,,,0

It's right, there are two stops at the airport. Each stop has a latitude and longitude, a unique ID number, and a descriptive name. The final field designates a timepoint (1=Timepoint, 0=Not).

Let's try an intersection where two routes cross:

$ cat stops.txt | grep "HOWELL & OKLAHOMA"
709,709,HOWELL & OKLAHOMA,,  42.9882051, -87.9043319,,,1
658,658,HOWELL & OKLAHOMA,,  42.9885464, -87.9045333,,,1
$ cat stops.txt | grep "OKLAHOMA & HOWELL"
5152,5152,OKLAHOMA & HOWELL,,  42.9881561, -87.9046550,,,1
5068,5068,OKLAHOMA & HOWELL,,  42.9883466, -87.9041176,,,1

Here's a problem that will require some logic to solve. I consider the intersection to be one place (not a GTFS term). Many trips and routes can use the same stop. Multiple stops (GTFS terms) can exist at the same place. In this case, northbound, southbound, eastbound, and westbound buses each have a different stop at the same place.

This might make your job easier...or harder.

GTFS cares about trips and stops. It doesn't care that Stops #709 and #5152 are twenty meters apart, and serve different routes - that it's a transfer point. Nothing in GTFS explicitly links the two stops. Generally, you must figure out the logic to do that - you have the lat/lon and the name to work with.

GTFS does have an optional transfers.txt file, that fills in the preferred transfer locations for you. But that's for a more advanced exercise.


Let's see what stops at #709:

$ grep -m 5 ,709, stop_times.txt 
4819177_1560,06:21:00,06:21:00,709,         14,,0,0
4819179_1562,06:49:00,06:49:00,709,         14,,0,0
4819180_1563,07:02:00,07:02:00,709,         14,,0,0
4819181_1564,07:15:00,07:15:00,709,         14,,0,0
4819182_1565,07:28:00,07:28:00,709,         14,,0,0


These fields are trip_id, arrival_time, departure_time, and stop-sequence (14th).

Let's see the entire run of trip 4819177_1560:

$ grep 4819177_1560 stop_times.txt 
4819177_1560,06:09:00,06:09:00,7162,          2,,0,0  # Hey, look - stops out of sequence in the file
4819177_1560,06:09:00,06:09:00,7168,          1,,0,0  # Begin Trip
4819177_1560,06:11:00,06:11:00,7178,          3,,0,0
[...]
4819177_1560,06:20:00,06:20:00,8517,         13,,0,0
4819177_1560,06:21:00,06:21:00,709,         14,,0,0   # Howell & Oklahoma
4819177_1560,06:22:00,06:22:00,711,         15,,0,0
[...]
4819177_1560,07:17:00,07:17:00,1371,         66,,0,0
4819177_1560,07:19:00,07:19:00,6173,         67,,0,0
4819177_1560,07:20:00,07:20:00,7754,         68,,0,0  # End of trip 

We can also look up more information about trip 4819177_1560:

$ grep 4819177_1560 trips.txt 
  GRE,13-DEC_WK,4819177_1560,N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS,0,515111,13-DEC_GRE_0_12

This needs a little more explanation
  • route_id: Green Line (bus)
  • service_id (weekday/days-of-service): 13-DEC_WK
  • headsign: N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS
  • direction_id (binary, 0 or 1): 0
  • block_id (useful only if the same bus changes routes): 515111
  • shape_id (useful for route maps): 13-DEC_GRE_0_12

Let's look up the route_id:

$ grep GRE routes.txt
  GRE,MCTS,  GRE,MetroEXpress GreenLine,,3,http://www.ridemcts.com/Routes-Schedules/Routes/GRE/,,

The full route name is MetroEXpress GreenLine, it's a bus (type-3 = bus) route, and we have the operator website for it.

Let's look up the service_id:

$ grep -m 10 13-DEC_WK calendar_dates.txt 
13-DEC_WK,20140113,1
13-DEC_WK,20140114,1
13-DEC_WK,20140115,1
13-DEC_WK,20140116,1
13-DEC_WK,20140117,1
13-DEC_WK,20140120,1
13-DEC_WK,20140121,1
13-DEC_WK,20140122,1
13-DEC_WK,20140123,1
13-DEC_WK,20140124,1

Ah, this specific trip is a weekday (Monday-Friday) only trip.


Let's look up the route map shapefile for the trip:

$ grep 13-DEC_GRE_0_12 shapes.txt 
13-DEC_GRE_0_12,  42.946054, -87.903810,10001
13-DEC_GRE_0_12,  42.946828, -87.903659,10002
13-DEC_GRE_0_12,  42.946824, -87.903588,10003
13-DEC_GRE_0_12,  42.946830, -87.903472,10004
[...]
13-DEC_GRE_0_12,  43.123137, -87.915431,670004
13-DEC_GRE_0_12,  43.123359, -87.915228,670005
13-DEC_GRE_0_12,  43.124016, -87.914535,670006
13-DEC_GRE_0_12,  43.124117, -87.914440,670007

The line for this trip has 520 points. That's pretty detailed.



So what do we know?

We know that Stop #709 is served by the GreenLine route, it's the 14th stop in direction 0, it's a bus line, we have all the times the stop is served, and we have the route website. We know the route map and all the other stops of any trip serving that stop.

How can we find the next scheduled bus at stop #709?

One way is to start with all trips that stop at #709 from stop_times.txt.

Since we probably know what time it is, we can filter out all the past times, and most of the future times. This leaves us with a nice, small list of, say, 10 possibles that include trips that don't run today at all (we must delve deeper to determine).

We can look up each of those trips in trips.txt, and get the route.

Each trip also includes a service_id code. The calendar_dates.txt file tells us which dates each service_id code is valid.

Right, we need to do three lookups.

The shell code gets a bit complex with three lookups, so I shifted to Python and wrote a basic next-vehicle-at-stop-lookup in about 160 lines. Python lists are handy, since it can handle all the stops at a location just as easily as a single stop. Python's zip module is also handy, so I can read data directly from the zipfile. But at 13 seconds, Python is probably too slow for this kind of application:

$ time ./next_bus.py 

Next departures from Howell & Okahoma
16:16   GRE N AIRPORT - VIA OAKLAND-HOWELL METROEXPRESS
16:22   GRE N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS
16:26    51 OKLAHOMA - TO LAKE DRIVE
16:28    51 TO 124TH ST. - VIA OKLAHOMA
16:30   GRE N AIRPORT - VIA OAKLAND-HOWELL METROEXPRESS
16:35   GRE N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS
16:43    51 TO 124TH ST. - VIA OKLAHOMA
16:44   GRE N AIRPORT - VIA OAKLAND-HOWELL METROEXPRESS
16:45    51 TO NEW YORK
16:45   GRE N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS
16:56   GRE N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS

real 0m13.171s   # Ugh. If I had started 13 seconds sooner, I wouldn't be bored now.
user 0m10.740s
sys 0m0.260s



All that time crunching the GTFS file has not gone unnoticed.

Trip planners (like Google) pre-process the data, mapping out and caching link-node and transfer relationships, limiting the trip data to the next hour or two (as appropriate), and using rather fancy algorithms to prune the link-node map to a likely set of  possibilities before looking at trips along those links.

That's one reason Google Transit is much faster than 13 seconds.

But that's all advanced stuff.

Also advanced is how to integrate real-time data, which uses one of several different formats. Next time...