Overview

Every project developed has to have a clear purpose, meaning and of course, to be useful. And this is how I also wanted my bachelor thesis to be. To approach a problem that many people encounter and that I feel could be improved.

In the city where I studied for my bachelor degree, people encounter daily problems when it comes to public transport. One never knows when a certain bus or tram will arrive at a station. Most of the time, the information displayed in the station or in the existing applications were inaccurate. They were not at all reliable. I lost count of how many times I had to take a taxi to the train station because the tram was so late I almost lost the train home. Or chose to walk instead of waiting in the station.

People can not rely on public services but they have to use them every day to go to work, or school or anywhere else, and this causes them a lot of stress. However, there are also higher developed countries with great public transport systems that can be taken as an example. Finland deserves to be mentioned here, having one of the best and most reliable public transport services. They even have big displays in the most used stations that track in real-time the upcoming buses so that the citizens can see them.

So this is how I came up with my idea. I wanted to somehow bring a little of that Finnish experience to Romania by building an application that will give people a better experience.

CityRide is a public transport application that tracks the public transports in a city and provides users with information about the schedules, routes and the time on which public transport will arrive at a certain station.

It is a progressive web application, meaning that the application was developed to be installed and accessed from any device, at any time.

The project contains data about public transport from Timisoara, Romania, but the implementation provides easy scalability within the application.

Technologies

PostgreSQL for creating the database

Typescript + NodeJS for writing the backend and creating the REST API.

GitHub actions for automating the deployment of the server. Each time new code was pushed on the master branch of the project, it was also deployed on Heroku.

Docker for isolating the server environment.

Heroku for hosting the server.

AWS for hosting the database using AWS RDS (Relational Database Service) and for creating an automated process that inserts/updates the information from the database.

Typescript + ReactJS + Bootstrap for building the client application.

Challenges

Lack of data and support

Sometimes the problems with developing an application and putting your own ideas into practice arise outside of the coding process. I wanted to create a real-time tracking application, where people could see in real-time the route of a bus when it stops into an intersection and that could realistically predict the time it takes for it to get to the upcoming stations. An application that could really improve the quality of service offered to citizens. Unfortunately, from the beginning, I encountered barriers in communicating with the public transport agency.

After trying to discuss for a few weeks with the agency, to be able to get the GTFS data that I needed, I understood why everything was so chaotic in the city’s public transport network. The agency didn’t even have the GTFS data, but instead, they send me to other developers who also try, for a few years now, to develop a fully working application that can be reliable.

On top of that, the agency didn’t have real-time GTFS data which meant that my entire project had to lose its initial purpose and adapt to what I had.

Sometimes an idea is held back because of outside factors. Still, I learned a lot and grow as a developer by working on this project. I had to adapt to different situations and collaborate with other developers giving me other perspectives on confronting problems.

Creating an efficient database structure

One of the biggest and most challenging parts of developing this application was creating an efficient and clear database structure. Public transport databases are a lot of work as they require developers to take into consideration a lot of use cases. For example, a bus doesn’t always have the same route or schedule. These can differ based on the weekdays or unexpected events. During weekends or holidays, some of the routes might not operate, or their schedule may be different. Moreover, what happens if a route is closed for maintenance? This information has to be up to date in the application, so the user can know what is happening. The database must be architected in such a way to solve these problems.

Thus, this is where GTFS comes into the discussion. GTFS (General Transit Feed Specification), is an internationally used format by the public transit agencies that are used to publish the transit data as accessible as possible for easier software integration.

GTFS format is divided into two main components, static data and real-time data. Most of the cities which have a public transport network have also the GTFS static data that contains information about programs, routes, stations, fares, or geographic transit. However, the real-time component is the one that makes the biggest difference as it contains predictions of arrival, vehicle positions and service recommendations, powerful information for developing a real-time tracking application.

For more information about GTFS, check out gtfs.org/.

Moving on, after studying and discussing with other software developers who have experience in using this format, I end up having the following database structure.

Database model explained

At first, the database model can look overwhelming, but in essence, when taking a closer look, everything is simple and logical. So, let’s take it from the beginning:

Feed is the first table from the database model and the one that makes everything stick together. Every update of the data from a public transit agency is represented by a new line in this table. This table is essential when the application is scaled and will retain more than one transit network.

For example, what happens if we have data from two public agencies, from two different cities that somehow have a station with the same name? How will we know which station depends on which city? Well, if this table doesn’t exist, duplication of data would be unavoidable, the information provided to users would become invalid and the application could not have more than one transport network.

Agency represents the table that holds info about the public transit agency used.

Stops table is populated with all the stations where the passengers are picked up or dropped off.

Routes table has info on groups of trips that are displayed as a single service.

Services is a link table between the feed table

Trips contains sequences of two or more stops that occur during a specific period of time.

Stop-times retains the times when a vehicle arrives and leaves a stop for every single trip.

Calendar is the table that holds the schedules for every trip.

Calendar-Dates table contains the exceptions from the schedule, for example, free days, maintenance and so on.

Shapes contains rules for mapping a vehicle’s route.

Shape-Points contains all the points (geographic coordinates) necessary for mapping a route.

Transfers allows us to make connections between routes.

Inserting/Updating GTFS data

A city’s public transport network contains thousands and thousands of data points that need to stay updated all the time. For example, the data set for Timisoara contains approximately 132,270 lines in the database that need to be updated whenever something is changing. Timisoara is not a big city, it doesn’t have metro or train lines used for public transportation within the city’s areas, so for a bigger city, the number of data will grow exponentially.

It is obvious that these updates cannot be done manually. In fact, this idea is not even an option on such large data sets, so an automated process must be used in order to resolve the problem effectively. Thus, using Amazon Web Services, I created a worker that will be triggered by uploading an archive on AWS S3. The worker’s job will be to successfully insert or update the transit data into the RDS database.

The worker, step by step

The data received from public transit agencies comes in an archive file that contains multiple GTFS files in CSV format. The name of the files is according to the GTFS format rules. As I mentioned before, the upload of the archive in S3 will trigger the worker, starting with the first lambda function that will process the files and continuing with several other functions which will handle each file individually. The upload of the archive in the storage service is the only manual step.

Now let’s talk about how everything is organized. In the AWS S3, it is created a bucket called cityride-bucket-project and inside it, a folder called zip, where the archives are uploaded. The rest of the folders will be created automatically when the first lambda function is called. There will be one folder for each GTFS file, which will be found there.

Processing the ZIP file

The first lambda function will deal with processing the uploaded archive and making sure that the files are organized accordingly so that the extraction of the data can be possible.

The steps that the unzipper lambda function follows are:

Unzips the ZIP file
Creates a temporary copy of the directory containing the unzipped files and maps it
Uploads to S3 each file in the temporary directory (each file will be uploaded to the subdirectory named identical to the file name)
Deletes the ZIP file to keep only the application-relevant data in the AWS S3 service. The ZIP file no longer matters now that the information it holds has been saved independently.

Activating the state machine

When the first lambda function has successfully finished its job, it will trigger another lambda function. It is necessary at this point to mention that the next 9 lambda functions are the ones that will add the data into the database step by step. Each lambda function will handle one of the files stored in the folders from S3. The names of the functions indicate the tables that are going to be populated.

Why 9 lambda functions and not just one? Because a lambda function cannot run for more than 15 minutes. In the case of small cities, with a relatively small amount of data, this is not a problem, but if we are focusing on a public transport network from a bigger city, it is much safer to allocate 15 minutes for the data from each GTFS file, than 15 minutes for all files.

That being said, the functions will run sequentially, as presented in the schema below.

The files are added in a specific order and changing it will cause functions to throw errors and immediately stop the process. The order is decided by the relations between tables. For example, a Route has many Trips, so it is necessary to first add the route, that will be required as a foreign key in the Trips table.

Moreover, between these two, there is another lambda function called ShapeShapePoints. This function performs two actions. Inserts all geographic coordinates necessary to create the shape of a trip (the stops of any public transport are not included, since those are saved in the Stops table) and separates them into shapes using the additional Shape table. The Shape table is linked to the Trips table, meaning that Trips table will have a foreign key to the Shape tables.

There are two methods in which the data is added into the database:

If the data doesn’t require any changes, then the insertion is done by passing an object array to the insert function provided by Knex. Each object represents a line from the file. This method speeds up the process of adding a large number of rows to the table.
The second method is used to insert the data line by line when the file requires changes. or if the database table contains additional columns to the GTFS columns (most of the difference is given by the unique identifier of a feed, found in almost all tables).

Of course, this worker can also be improved by adding more functionalities that will enhance the process. We can perform sanity checks, integrity checks or clean-ups in each file. For example, for a GTFS we may receive data for 20 stations, from which 5 are not used, therefore cleanup must be done to get rid of those stations. Or maybe in a GTFS, we receive the schedule for the whole year, but we are already in March, so again, cleanup and integrity validations must be performed.

Moreover, most of the GTFS files used in this implementation are required or conditionally optional. If we look on the GTFS website we can see that there are a lot of files that hold different kinds of data, for different implementations, such as online payment integration. This can and should be integrated into the application, whenever is possible since it will improve user experience. Referring to the worker, it is easy to add another GTFS file in this state machine. All it needs to be done is create a lambda function that will handle that additional file.

The system is capable of also overriding existing data that has changed its value.

On the available data set, which contains data from a relatively small public transport network, the data entry system takes approximately 2 minutes to add and/or update the data, the files having a total of approximately 132,270 lines.