Kaggle 2015 Flight Delays and Cancellations dataset (with this fix for some airport codes). It has information about flights for major US airlines in 2015 (arrival/departure times, delay, flight ID, distance, ...) and information about airport cities/coordinates, airlines.
The data was retrieved via SQL queries in PostgresSQL database, using ROW_TO_JSON
for export.
The dataset contains 5 819 079 flights, 322 airports and 14 airlines.
Attribute | Description | Type | Min | Max | Average | Median |
---|
Attribute types according to these definitions: isoconsultantpune.com/statistics-in-quality/
SELECT
MIN(distance) as min_distance, MAX(distance) as max_distance, ROUND(AVG(distance), 2) as avg_distance, PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by distance) as median_distance,
MIN(departure_delay) as min_departure_delay, MAX(departure_delay) as max_departure_delay, ROUND(AVG(departure_delay), 2) as avg_departure_delay, PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by departure_delay) as median_departure_delay,
MIN(arrival_delay) as min_arrival_delay, MAX(arrival_delay) as max_arrival_delay, ROUND(AVG(arrival_delay), 2) as avg_arrival_delay, PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by arrival_delay) as median_arrival_delay,
MIN(air_time) as min_air_time, MAX(air_time) as max_air_time, ROUND(AVG(air_time), 2) as avg_air_time, PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by air_time) as median_air_time,
MIN(scheduled_time) as min_scheduled_time, MAX(scheduled_time) as max_scheduled_time, ROUND(AVG(scheduled_time), 2) as avg_scheduled_time, PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by scheduled_time) as median_scheduled_time,
MIN(elapsed_time) as min_elapsed_time, MAX(elapsed_time) as max_elapsed_time, ROUND(AVG(elapsed_time), 2) as avg_elapsed_time, PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by elapsed_time) as median_elapsed_time,
MIN(air_system_delay) as min_air_system_delay, MAX(air_system_delay) as max_air_system_delay, ROUND(AVG(air_system_delay), 2) as avg_air_system_delay, PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by air_system_delay) as median_air_system_delay,
MIN(security_delay) as min_security_delay, MAX(security_delay) as max_security_delay, ROUND(AVG(security_delay), 2) as avg_security_delay, PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by security_delay) as median_security_delay,
MIN(airline_delay) as min_airline_delay, MAX(airline_delay) as max_airline_delay, ROUND(AVG(airline_delay), 2) as avg_airline_delay, PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by airline_delay) as median_airline_delay,
MIN(late_aircraft_delay) as min_late_aircraft_delay, MAX(late_aircraft_delay) as max_late_aircraft_delay, ROUND(AVG(late_aircraft_delay), 2) as avg_late_aircraft_delay, PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by late_aircraft_delay) as median_late_aircraft_delay,
MIN(weather_delay) as min_weather_delay, MAX(weather_delay) as max_weather_delay, ROUND(AVG(weather_delay), 2) as avg_weather_delay, PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by weather_delay) as median_weather_delay,
MIN(taxi_in) as min_taxi_in, MAX(taxi_in) as max_taxi_in, ROUND(AVG(taxi_in), 2) as avg_taxi_in, PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by taxi_in) as median_taxi_in,
MIN(taxi_out) as min_taxi_out, MAX(taxi_out) as max_taxi_out, ROUND(AVG(taxi_out), 2) as avg_taxi_out, PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by taxi_out) as median_taxi_out
FROM flights
There are less than 2% of cancelled or diverted flights, and most cancellations are caused by weather.
SELECT
COUNT(*) filter (WHERE cancelled = 1) as cancelled_count,
COUNT(*) filter (WHERE diverted = 1) as diverted_count,
COUNT(*) filter (WHERE cancelled = 0 AND diverted = 0) as normal_count
FROM flights
SELECT cancellation_reason, COUNT(*)
FROM flights
WHERE cancellation_reason is not null
GROUP BY cancellation_reason
There is a huge drop of activity on Saturdays, probably related to business trips.
Also it seems like there are more flights in the summer (especially July-August), probably because of holidays, as well as some increases in the beginning of Spring (Easter, Spring break?), and the least active period is from middle of January to middle of February.
SELECT make_date(year, month, day) as date, count(*)
FROM flights
GROUP BY date