Friday, January 24, 2014

Transit schedule data demystified - using GTFS

General Transit Feed Specification (GTFS) is the Google-originated standard format for transit route, stop, trip, schedule, map, and fare data. Everything except realtime.

It's called a feed because it (usually) includes an RSS update for changes.
There are lists of feeds on the Google wiki, and on the separate GTFS data website.

Each organization's GTFS file includes all their services, so some agency files can get pretty big, and get updated often. Any schedule change or route adjustment means a new release of the entire GTFS file. The file itself is merely a big zipfile, containing several csv files that are strangely required to be mislabelled as .txt.

Here's the contents of Milwaukee County Transit System's GTFS file:

$ unzip -l 
  Length      Date    Time    Name
---------  ---------- -----   ----
      169  2014-01-10 05:01   agency.txt
    40136  2014-01-10 05:00   calendar_dates.txt
     5746  2014-01-10 05:01   routes.txt
   307300  2014-01-10 05:00   stops.txt
 35198135  2014-01-10 05:00   stop_times.txt
   650622  2014-01-10 05:01   trips.txt
  8369736  2014-01-10 05:01   shapes.txt
     3490  2014-01-10 05:01   terms_of_use.txt
---------                     -------
 44575334                     8 files

Yeah, 44MB unzipped.
But only 5MB zipped. Still not something you want to download every day to your phone.

Let's find a stop at Mitchell International Airport:

$ cat stops.txt | grep AIRPORT
7168,7168,AIRPORT,,  42.9460473, -87.9037345,,,1
7162,7162,AIRPORT & ARRIVALS TERMINAL,,  42.9469597, -87.9030569,,,0

It's right, there are two stops at the airport. Each stop has a latitude and longitude, a unique ID number, and a descriptive name. The final field designates a timepoint (1=Timepoint, 0=Not).

Let's try an intersection where two routes cross:

$ cat stops.txt | grep "HOWELL & OKLAHOMA"
709,709,HOWELL & OKLAHOMA,,  42.9882051, -87.9043319,,,1
658,658,HOWELL & OKLAHOMA,,  42.9885464, -87.9045333,,,1
$ cat stops.txt | grep "OKLAHOMA & HOWELL"
5152,5152,OKLAHOMA & HOWELL,,  42.9881561, -87.9046550,,,1
5068,5068,OKLAHOMA & HOWELL,,  42.9883466, -87.9041176,,,1

Here's a problem that will require some logic to solve. I consider the intersection to be one place (not a GTFS term). Many trips and routes can use the same stop. Multiple stops (GTFS terms) can exist at the same place. In this case, northbound, southbound, eastbound, and westbound buses each have a different stop at the same place.

This might make your job easier...or harder.

GTFS cares about trips and stops. It doesn't care that Stops #709 and #5152 are twenty meters apart, and serve different routes - that it's a transfer point. Nothing in GTFS explicitly links the two stops. Generally, you must figure out the logic to do that - you have the lat/lon and the name to work with.

GTFS does have an optional transfers.txt file, that fills in the preferred transfer locations for you. But that's for a more advanced exercise.

Let's see what stops at #709:

$ grep -m 5 ,709, stop_times.txt 
4819177_1560,06:21:00,06:21:00,709,         14,,0,0
4819179_1562,06:49:00,06:49:00,709,         14,,0,0
4819180_1563,07:02:00,07:02:00,709,         14,,0,0
4819181_1564,07:15:00,07:15:00,709,         14,,0,0
4819182_1565,07:28:00,07:28:00,709,         14,,0,0

These fields are trip_id, arrival_time, departure_time, and stop-sequence (14th).

Let's see the entire run of trip 4819177_1560:

$ grep 4819177_1560 stop_times.txt 
4819177_1560,06:09:00,06:09:00,7162,          2,,0,0  # Hey, look - stops out of sequence in the file
4819177_1560,06:09:00,06:09:00,7168,          1,,0,0  # Begin Trip
4819177_1560,06:11:00,06:11:00,7178,          3,,0,0
4819177_1560,06:20:00,06:20:00,8517,         13,,0,0
4819177_1560,06:21:00,06:21:00,709,         14,,0,0   # Howell & Oklahoma
4819177_1560,06:22:00,06:22:00,711,         15,,0,0
4819177_1560,07:17:00,07:17:00,1371,         66,,0,0
4819177_1560,07:19:00,07:19:00,6173,         67,,0,0
4819177_1560,07:20:00,07:20:00,7754,         68,,0,0  # End of trip 

We can also look up more information about trip 4819177_1560:

$ grep 4819177_1560 trips.txt 

This needs a little more explanation
  • route_id: Green Line (bus)
  • service_id (weekday/days-of-service): 13-DEC_WK
  • direction_id (binary, 0 or 1): 0
  • block_id (useful only if the same bus changes routes): 515111
  • shape_id (useful for route maps): 13-DEC_GRE_0_12

Let's look up the route_id:

$ grep GRE routes.txt
  GRE,MCTS,  GRE,MetroEXpress GreenLine,,3,,,

The full route name is MetroEXpress GreenLine, it's a bus (type-3 = bus) route, and we have the operator website for it.

Let's look up the service_id:

$ grep -m 10 13-DEC_WK calendar_dates.txt 

Ah, this specific trip is a weekday (Monday-Friday) only trip.

Let's look up the route map shapefile for the trip:

$ grep 13-DEC_GRE_0_12 shapes.txt 
13-DEC_GRE_0_12,  42.946054, -87.903810,10001
13-DEC_GRE_0_12,  42.946828, -87.903659,10002
13-DEC_GRE_0_12,  42.946824, -87.903588,10003
13-DEC_GRE_0_12,  42.946830, -87.903472,10004
13-DEC_GRE_0_12,  43.123137, -87.915431,670004
13-DEC_GRE_0_12,  43.123359, -87.915228,670005
13-DEC_GRE_0_12,  43.124016, -87.914535,670006
13-DEC_GRE_0_12,  43.124117, -87.914440,670007

The line for this trip has 520 points. That's pretty detailed.

So what do we know?

We know that Stop #709 is served by the GreenLine route, it's the 14th stop in direction 0, it's a bus line, we have all the times the stop is served, and we have the route website. We know the route map and all the other stops of any trip serving that stop.

How can we find the next scheduled bus at stop #709?

One way is to start with all trips that stop at #709 from stop_times.txt.

Since we probably know what time it is, we can filter out all the past times, and most of the future times. This leaves us with a nice, small list of, say, 10 possibles that include trips that don't run today at all (we must delve deeper to determine).

We can look up each of those trips in trips.txt, and get the route.

Each trip also includes a service_id code. The calendar_dates.txt file tells us which dates each service_id code is valid.

Right, we need to do three lookups.

The shell code gets a bit complex with three lookups, so I shifted to Python and wrote a basic next-vehicle-at-stop-lookup in about 160 lines. Python lists are handy, since it can handle all the stops at a location just as easily as a single stop. Python's zip module is also handy, so I can read data directly from the zipfile. But at 13 seconds, Python is probably too slow for this kind of application:

$ time ./ 

Next departures from Howell & Okahoma
16:28    51 TO 124TH ST. - VIA OKLAHOMA
16:43    51 TO 124TH ST. - VIA OKLAHOMA
16:45    51 TO NEW YORK

real 0m13.171s   # Ugh. If I had started 13 seconds sooner, I wouldn't be bored now.
user 0m10.740s
sys 0m0.260s

All that time crunching the GTFS file has not gone unnoticed.

Trip planners (like Google) pre-process the data, mapping out and caching link-node and transfer relationships, limiting the trip data to the next hour or two (as appropriate), and using rather fancy algorithms to prune the link-node map to a likely set of  possibilities before looking at trips along those links.

That's one reason Google Transit is much faster than 13 seconds.

But that's all advanced stuff.

Also advanced is how to integrate real-time data, which uses one of several different formats. Next time...

Sunday, January 5, 2014

Upstart Jobs at login

Login is not the same is startup. Let's just get that out of the way first.
  • Startup is the time between boot and the login screen. It's the habitat of system jobs.
  • Login is the time after you enter your password. It's the habitat of user jobs.

The easy way to run a task at login is to run a script from your .bashrc.
And the (deceptively not-) easy way to run a task at logout is to run a script from your .bash_logout

But today we're not doing it the easy way. Today we're going to use dbus and Upstart.

Emitting Upstart Signals from your .bashrc

It's terribly easy.

1) Emit a user-level Upstart signal by adding a line to .bashrc:

# Upstart signal that .bashrc is running
initctl emit "I_AM_LOGGING_IN"

2) Add a user-level Upstart job to ~.config/upstart/ for one user, or to /usr/share/upstart/sessions/ for all users:

# /home/$USER/.config/upstart/login_test.conf
description "login test"
start on I_AM_LOGGING_IN            # Start criteria
exec /bin/date > /tmp/login_test    # Do something

3) Open a new terminal window (to load the new .bashrc). When you open the window, the Upstart job creates the tempfile at /tmp/login_test.

Clean up: Restore your bashrc, and delete the sample Upstart job.

Can I emit system-level Upstart signals from .bashrc?

Not directly. The script runs as a user, not as root.

You can use a secondary method of triggering system-level Upstart signals, like sending a Dbus signal, or manipulating a file, or connecting to a socket.

Can I emit Upstart signals from .bash_logout?


Using initctl emit in .bash_logout will merely result in an error. The user-level Upstart daemon seems to be terminated before .bash_logout is run. The command will return a cryptic "Rejected send message"error from PID 1 (system Upstart). Since .bash_logout is not running as root, it cannot emit system-level signals.

Also, GUI terminal programs do not not run .bash_logout, unless you specify compatibility (with a flag) when you start.

That easy way of doing login actions is still too hard

Boy, are you difficult to please.

Okay, there is an even easier way, but it's more complicated to explain: Instead of .bashrc emitting an Upstart event, let Upstart listen for a dbus signal.

Here is an example of the dbus message that occurs when I login via SSH to a new session. This signal is emitted by systend-logind every time a new TTY, ssh, or X-based GUI login occurs.

The signal is not emitted when you are in a GUI environment and simply open a terminal window - that's not a login, that's a spawn of your already-existing GUI environment:

signal sender=:1.3 -> dest=(null destination) serial=497 
  path=/org/freedesktop/login1; interface=org.freedesktop.login1.Manager;
    string "4"
    object path "/org/freedesktop/login1/session/_34"

The important elements are the source, the "SessionNew" signal, and the path of the new session.

Aside, let's query systemd-logind to find if the login is to a TTY, X session, or SSH. logind has lots of useful information about each session:

$ dbus-send --system                              \ 
            --dest=org.freedesktop.login1         \
            --print-reply                         \
            --type=method_call                    \
            /org/freedesktop/login1/session/_34   \ # Path from the signal
            org.freedesktop.DBus.Properties.Get   \
            string:org.freedesktop.login1.Session \
method return sender=:1.3 -> dest=:1.211 reply_serial=2
   variant       string "sshd"

It's right. I did connect using ssh.

Now let's construct an Upstart job that runs when I login via a TTY, X Session, or SSH. We will use Upstart's built-in dbus listener.

# /home/$USER/.config/upstart/login_test.conf
description "login test"
start on dbus SIGNAL=SessionNew     # Listen for the dbus Signal
exec /bin/date > /tmp/login_test    # Do something

  • Now, whenever you login to a TTY, X session, or SSH session, the job will run.
  • If your job needs to tell the difference between those sessions, you know how to find out using dbus.
  • If *everybody* needs the job, place it in /usr/share/upstart/sessions/ instead of each user's .config/upstart/

What about super-easy logout jobs?

Logout jobs are harder, and generally not recommended. Not super-easy. They are hard because you can't guarantee they will run. Maybe the user will hold down the power button. Or use the "shutdown -h now" command. Or the power supply sent a message that the battery only has 60 seconds of life left. Or the user absolutely cannot miss that bus....

Here's the dbus signal that systemd-logind emits when a TTY, X, or SSH user session ends:

signal sender=:1.21 -> dest=(null destination) serial=286 
  interface=org.freedesktop.Accounts.User; member=Changed

All this tells me is that User1000 now has a different number of sessions running. Maybe it's a login (yes, it emits the same signal upon login). Maybe it's a logout.

Sure, we can do a login-and-logout Upstart job...

# /home/$USER/.config/upstart/login_test.conf
description "login and logout test"
start on dbus SIGNAL=Changed INTERFACE=org.freedesktop.Accounts.User
exec /bin/date > /tmp/login_test

...but then you need logic to figure out who logged in or logged out, and whether it's an event you care about. Certainly doable, but probably not worthwhile for most users.

In other words, if you want to backup-at-logout, you need to structure it as a backup-then-logout sequence. Logout is not an appropriate trigger to start the sequence...from the system's point of view.

But I really want to do a job a logout!

Okay, here's how to do a job when you log out of the GUI environment. Logging out is the trigger. This won't work for SSH or TTY sessions.

The Upstart jobs /etc/init/lightdm.conf and/etc/init/gdm.conf emit a system-level "desktop-shutdown" signal when the X server is stopped. You can use that job as your start criteria.

# /etc/init/logoff_test.conf
description "logout test"
setuid some_username        # Your script probably doesn't need to run as root
start on stopping lightdm   # Run *before* it is stopped
exec /bin/date > /tmp/logoff_test

Friday, January 3, 2014

Searching for the right Upstart signal or job

If you want to use Upstart to start/stop a job on any of the not-obvious triggers (like "startup"), then you need to do some digging to find the right trigger.

initctl show-config

Be careful, there are TWO sets of Upstart jobs: System jobs and user jobs. Use sudo to distinguish between them.

$ sudo initctl show-config dbus   # Use sudo for system-level jobs
  start on local-filesystems
  stop on deconfiguring-networking

$ initctl show-config dbus        # Omit sudo for user-level jobs
  start on starting xsession-init 

Searching for a job or a signal using grep

The initctl show-config command without any job name prints all the jobs. That means you can use grep on the full list. Here is an example of using grep to look for all root jobs that care about the system "startup" signal:

$ sudo initctl show-config | grep -B8 startup
  start on (starting mountall or (runlevel [016] and ((desktop-shutdown or stopped xdm) or stopped uxlaunch)))
  start on mounted MOUNTPOINT=/run
  stop on runlevel [06]
  start on runlevel [2345]
  stop on runlevel [!2345]
  start on (startup and (((graphics-device-added PRIMARY_DEVICE_FOR_DISPLAY=1 or drm-device-added PRIMARY_DEVICE_FOR_DISPLAY=1) or stopped udevtrigger) or container))
  emits virtual-filesystems
  emits local-filesystems
  emits remote-filesystems
  emits all-swaps
  emits filesystem
  emits mounting
  emits mounted
  start on startup
  start on runlevel [2345]
  stop on runlevel [!2345]
  start on mounted MOUNTPOINT=/
  start on mounted MOUNTPOINT=/
  start on (startup and started udev)
  start on runlevel S
  stop on runlevel [!S]
  stop on (started $WAIT_FOR or stopped $WAIT_FOR)
  start on filesystem
  emits recovery
  emits startup
  start on runlevel [2345]
  stop on runlevel [!2345]
  start on socket PROTO=inet PORT=34567 ADDR=
  start on (runlevel [23] and ((not-container or container CONTAINER=lxc) or container CONTAINER=lxc-libvirt))
  stop on runlevel [!23]
  start on ((startup and started udev) and not-container)
  emits not-container
  start on mounted MOUNTPOINT=/run
  start on mounted MOUNTPOINT=/dev
  start on (runlevel [23] and ((not-container or container CONTAINER=lxc) or container CONTAINER=lxc-libvirt))
  stop on runlevel [!23]
  start on ((((startup and filesystem) and started udev) and stopped udevtrigger) and stopped udevmonitor)
  start on runlevel [2345]
  start on block-device-added ID_FS_USAGE=crypto
  start on startup
  emits net-device-up
  emits net-device-down
  emits static-network-up
  start on net-device-added
  stop on net-device-removed INTERFACE=$INTERFACE
  emits plymouth-ready
  start on (startup or started plymouth-splash)
  start on (started plymouth and ((graphics-device-added PRIMARY_DEVICE_FOR_DISPLAY=1 or drm-device-added PRIMARY_DEVICE_FOR_DISPLAY=1) or stopped udev-fallback-graphics))
  start on (started dbus or runlevel [06])
  stop on stopping plymouth
  start on (stopped rc RUNLEVEL=[2345] and ((not-container or container CONTAINER=lxc) or container CONTAINER=lxc-libvirt))
  stop on runlevel [!2345]
  start on (startup and starting udevtrigger)

We found one job that emits startup (friendly-recovery).
We found seven jobs that listen for it: udev-fallback-graphics, mountall, kmod, udevtrigger, hostname, plymouth-ready, and udevmonitor

Searching for a signal using upstart-monitor

The upstart-monitor application is a handy GUI and command-line tool to listen to all the signal chatter in Upstart. The application is provided by the upstart-monitor package in the Ubuntu repositories. A bug in 13.10 prevents it from running on a non-GUI system like Ubuntu Server, but it's also easy to fix the bug yourself...

Here are the signals emitted by Upstart when I switch over to a TTY, login, wait ten seconds, and then logout. This isn't an example of monitoring logins (do that using consolekit or logind) - this is an example of monitoring the Upstart signals emitted by a change in tty2.

$ upstart-monitor --no-gui --destination=system-bus
# Upstart Event Monitor (console mode)
# Connected to D-Bus system bus
# Columns: time, event and environment

2014-01-03 23:23:43.013436 stopping JOB='tty2' INSTANCE='' RESULT='ok'
2014-01-03 23:23:43.020309 starting JOB='tty2' INSTANCE=''
2014-01-03 23:23:43.031193 starting JOB='startpar-bridge' INSTANCE='tty2--started'
2014-01-03 23:23:43.033055 started JOB='startpar-bridge' INSTANCE='tty2--started'
2014-01-03 23:23:43.040671 stopping JOB='startpar-bridge' INSTANCE='tty2--started' RESULT='ok'
2014-01-03 23:23:43.042496 stopped JOB='startpar-bridge' INSTANCE='tty2--started' RESULT='ok'
2014-01-03 23:23:43.044271 started JOB='tty2' INSTANCE=''

You can see the progression of signals: starting, started, stopping, stopped.
You can also see the nesting of jobs. startpar-bridge starts on starting tty2, and runs it's entire task of starting-started-stopping-stopped for tty2 to transition from starting to started.

If you want to trigger a job when tty2 is starting or started, you now know the signals that get emitted. Your job can listen for those signals.

Drawing out relationships using dotfiles

Dot diagram of Upstart user jobs
The initctl2dot application creates dotfile graphics of xdot application. initctl2dot is included with the upstart package, part of all Ubuntu installations (even ubuntu-minimal). xdot is a separate package available in the Ubuntu repositories (Software Center).

As the name implies, initctl2dot's input is initctl's output.You can manually trim an initctl show-show-config output, and input that to initctl2dot if you really want a specific diagram.

You can easily diagram and display the entire system job tree...though it's perhaps less useful than you may expect:

$ initctl2dot --system --outfile /tmp/
$ xdot /tmp/

You can also diagram the user job tree:

$ initctl2dot --user --outfile /tmp/
$ xdot /tmp/

Limiting the dotfile size

The initctl2dot manpage includes options for showing/hiding various relationship types (emit, start on, stop on, etc) for clarity.

Another handy option is the --restrict-to-jobs flag, to draw much smaller charts.

For example, let's diagram the system "startup" signal relationships we already discovered using grep:

$ initctl2dot --system --outfile /tmp/ \
$ xdot /tmp/ 

And there you have it. How to search system jobs and user jobs for useful signals, and how to easily diagram out the relationships among signals and jobs.