Kaggle 2015 Flight Delays and Cancellations dataset (with this fix for some airport codes). It has information about flights for major US airlines in 2015 (arrival/departure times, delay, flight ID, distance, ...) and information about airport cities/coordinates, airlines.
The data was retrieved via SQL queries in PostgresSQL database, using ROW_TO_JSON
for export.
SELECT month, count(*)
FROM flights
GROUP BY month
SELECT make_date(year, month, day) as date, count(*)
FROM flights
GROUP BY date
As we can see, the most busiest months are in the summer (especially July-August), probably because of holidays, also there are some increases in the beginning of Spring (Easter, Spring break?), and the least active period is from middle of January to middle of February.
Also it depends on day of the week: there is a huge drop of activity on Saturdays, probably related to business trips.
SELECT origin_airport, destination_airport, count(*)
FROM flights
GROUP BY origin_airport, destination_airport
Most of the biggest airports (and flight destinations) are located near the east coast and in the south part of the west coast (LA, SF), the same as US population density.
Circle size represents airport size (number of flights) and opacity shows its' delay rate (counting only delays that were longer than 10 minutes).
This map uses bigger scale to reduce circle sizes and overlapping, however it's still present in some cases.
SELECT origin_airport,
ROUND(count(*) / (SELECT count(*) FROM flights WHERE origin_airport=f.origin_airport)::decimal, 4) as delay_rate,
(SELECT count(*) FROM flights WHERE origin_airport=f.origin_airport) as flights_count
FROM flights as f
WHERE departure_delay > 10
GROUP BY origin_airport
From this map we can see that most airports have at least 10% of delayed flights.
Also it looks like delays are more likely in smaller airports, such as 2 airports on Alaska with 45+%. All of the bigger airports do not have such high delay rate.
Circle size represents airline size (number of flights) and Y axis shows its' delay rate (counting only delays that were longer than 10 minutes).
SELECT airline,
ROUND(count(*) / (SELECT count(*) FROM flights WHERE f.airline=f.airline)::decimal, 4) as delay_rate,
(SELECT count(*) FROM flights WHERE airline=f.airline) as flights_count
FROM flights as f
WHERE departure_delay > 10
GROUP BY airline
The chart shows that most of the delays occur on flights from big airlines, smaller airlines seem to have less delays.