Here's an example:
$ unzip -v 20140611.cta.gtfs Archive: 20140611.cta.gtfs Length Method Size Cmpr Date Time CRC-32 Name -------- ------ ------- ---- ---------- ----- -------- ---- 235 Defl:N 153 35% 2014-06-10 22:49 6dfdcbd7 agency.txt 4889 Defl:N 554 89% 2014-06-10 22:49 38544274 calendar.txt 2139 Defl:N 335 84% 2014-06-10 22:49 f69f2cb4 calendar_dates.txt 13000 Defl:N 2279 83% 2014-06-10 22:49 59bf453e routes.txt 29896363 Defl:N 7158190 76% 2014-06-10 22:54 21c7d003 shapes.txt 1332298 Defl:N 264714 80% 2014-06-10 23:00 48a736df stops.txt 355900629 Defl:N 57437701 84% 2014-06-10 23:24 e13831ff stop_times.txt 2514 Defl:N 575 77% 2014-06-10 23:24 51dca8cf transfers.txt 5654714 Defl:N 656156 88% 2014-06-10 23:25 7056aa15 trips.txt 42 Defl:N 39 7% 2014-06-10 22:49 87676593 frequencies.txt 18959 Defl:N 6180 67% 2011-06-20 11:07 bcf138d1 developers_license_agreement.htm -------- ------- --- ------- 392825782 65526876 83% 11 files
Uncompressed, the stop_times.txt file is 356 MB, 91% of the entire archive.
You can see right away that loading that behemoth into memory will cause a pause (30 seconds, on my hardware) that a user will probably notice. And the way Python dicts use memory, 356 MB of CSV data can easily explode into 3 or 4 GB of a comparable python dict. And that leaves us with a problem: We can't load the file on-demand because it takes too darn long, we can't keep it stored in an easily-accessible form in memory because it's too darn big. While we could keep it in CSV format in RAM, that's not a fomat we can do anything with - every query would need to iterate through the entire file, and that's sort of the worst of both worlds.
So what can we do?
First, we can filter data immediately upon load, and discard big chunks of it right away. This means the structure of the program must change from:
cal = load_table(calendar.txt) rou = load_table(routes.txt) tri = load_table(trips.txt) sto = load_table(stops.txt) tim = load_table(stop_times.txt) svcid = valid_service_ids_for_today(cal) trips = trips_that_stop_there_today(tri, svc_id, tim)
to something more like:
cal = load_table(calendar.txt) rou = load_table(routes.txt) tri = load_table(trips.txt) sto = load_table(stops.txt) tim = load_and_filter_stop_times_table(stop_times.txt, filter_criteria) svcid = valid_service_ids_for_today(cal) trips = trips_that_stop_there_today(tri, svc_id, tim)
Instead of loading the entire table and holding it in memory, a customized loader filters the data right away, and chuck out the 90% we don't actually want.
Second, we can use Python's read size parameter to prevent a spike in RAM usage (and consequent slowdown of the entire system).
For example, normally you read data with something like:
with gtfs_file.open('stop_times.txt', mode='r') as infile: input = infile.read() output = do_stuff_to(input.decode()) return output
And we can rearrange it to something like:
eof = False output = {} with gtfs_file.open('stop_times.txt', mode='r') as infile: while eof == False: input_block = infile.readlines(4096) if len(input_block) == 0: eof = True output_block = do_stuff_to(input_block) output.update(output_block) return output
The EOF flag controls the process, and an empty read triggers the flag.
The input RAM requirements are tiny, since each block re-uses the same memory over and over. Only the filtered output grows.
The performance penalty of this kind of loading seems about 30% (10 seconds) longer than a single read(), but now the process can run happily in the background without slowing other processes. Filtering the huge file down to a manageable size and searchable format also make subsequent use of the data _much_ faster.
Working examples
Here's working sample code of a generic load_table(), which I called map_gtfs_table_to_dict()Here's working sample code of a load_stop_times(), which I called map_stop_times_to_dict()
Let's use the generic sample code to load trips.txt from a CTA GTFS file:
>>> import map_gtfs_table_to_dict as load_table >>> import zipfile >>> >>> # You must provide your own GTFS file, of course! >>> gtfs_path = "/home/ian/gtfs/20140611.cta.gtfs" >>> gtfs_file = zipfile.ZipFile(gtfs_path, mode='r') >>> all_trips = load_table(gtfs_file, 'trips.txt') >>> >>> len(all_trips) 90428 >>> >>> for trip_id in list(all_trips)[0:5]: ... print(trip_id, all_trips[trip_id]) ... 45072053995 {'route_id': 'Y', 'direction_id': '0', 'wheelchair_accessible': '1', 'service_id': '104509', 'direction': '0', 'shape_id': '304500033', 'block_id': '45007747580', 'schd_trip_id': 'R501'} 437076941189 {'route_id': '147', 'direction_id': '1', 'wheelchair_accessible': '1', 'service_id': '43701', 'direction': 'North', 'shape_id': '4374519', 'block_id': '437008159276', 'schd_trip_id': '76941189'} 437076941185 {'route_id': '97', 'direction_id': '1', 'wheelchair_accessible': '1', 'service_id': '43701', 'direction': 'East', 'shape_id': '4374369', 'block_id': '437008158984', 'schd_trip_id': '76941185'} 437076941184 {'route_id': '97', 'direction_id': '1', 'wheelchair_accessible': '1', 'service_id': '43701', 'direction': 'East', 'shape_id': '4374369', 'block_id': '437008159008', 'schd_trip_id': '76941184'} 437076941186 {'route_id': '97', 'direction_id': '1', 'wheelchair_accessible': '1', 'service_id': '43701', 'direction': 'East', 'shape_id': '4374369', 'block_id': '437008158986', 'schd_trip_id': '76941186'} >>>
There we are, all 90,000 trips in a few seconds. Each line of the GTFS table broken into a dict for easy access.
Now, let's try using the same generic load function to load stop_times.txt
>>> import map_gtfs_table_to_dict as load_table >>> import zipfile >>> >>> # You must provide your own GTFS file, of course! >>> gtfs_path = "/home/ian/gtfs/20140611.cta.gtfs" >>> gtfs_file = zipfile.ZipFile(gtfs_path, mode='r') >>> all_stops = load_table(gtfs_file, 'stop_times.txt') Traceback (most recent call last): File "", line 1, in File "./test.py", line 56, in map_gtfs_table_to_dict line_dict[column] = line.split(',')[columns[column]].strip('" ') MemoryError >>>
It took a few minutes to consume all that memory and fail, and everything else on the system slowed to a crawl.
Finally, let's use the custom load_stop_table() to load-and-filter stop_times.txt without reducing the rest of the system to a memory-starved crawl:
>>> import map_stop_times_to_dict as load_table >>> import zipfile >>> >>> gtfs_path = "/home/ian/gtfs/20140611.cta.gtfs" >>> gtfs_file = zipfile.ZipFile(gtfs_path, mode='r') >>> stop_list = ['3661', '3698', '15026', '17433'] >>> filtered_stops = load_table(gtfs_file, stop_list) 4379 >>> for line in list(filtered_stops)[0:5]: ... print(line, filtered_stops[line]) ... 4521987 {'stop_headsign': '79th Red Line', 'departure_time': '13:53:57', 'shape_dist_traveled': '22219', 'arrival_time': '13:53:57', 'pickup_type': '0', 'stop_id': '3661', 'stop_sequence': '35', 'trip_id': '437077075442'} 4538374 {'stop_headsign': '79th Red Line', 'departure_time': '15:04:01', 'shape_dist_traveled': '37847', 'arrival_time': '15:04:01', 'pickup_type': '0', 'stop_id': '3680', 'stop_sequence': '58', 'trip_id': '437077077228'} 4325384 {'stop_headsign': '127th/Lowe', 'departure_time': '17:26:00', 'shape_dist_traveled': '27644', 'arrival_time': '17:26:00', 'pickup_type': '0', 'stop_id': '5992', 'stop_sequence': '42', 'trip_id': '437077000628'} 475149 {'stop_headsign': '95th Red Line', 'departure_time': '18:01:18', 'shape_dist_traveled': '14028', 'arrival_time': '18:01:18', 'pickup_type': '0', 'stop_id': '15026', 'stop_sequence': '20', 'trip_id': '436073915060'} 4112403 {'stop_headsign': '95th Red Line', 'departure_time': '22:08:53', 'shape_dist_traveled': '29370', 'arrival_time': '22:08:53', 'pickup_type': '0', 'stop_id': '15026', 'stop_sequence': '42', 'trip_id': '437076995010'} >>>
It's still slow to load the entire file. On my equipment, I can load stop_times.txt in about 30 seconds this way. Varying the block size has some benefit, but you can play with that yourself.
In the final example, the 'line' variable -- the key to that subdict of one trip serving a stop -- is just a counter, the (approximate) CSV row number. stop_times.txt has no unique item on each line that can be used as a dict key.