DFL Exploratory Data Analysis and Video Processing

Part I: install packages for video data processing and analysis

  • packages for video data analysis: imageio package and moviepy package
  • imageio package
    • imageio package does not require additional installation. Kaggle notebook currently have imageio==2.19.3
    • however, a imageio-ffmpeg, which is a required dependency for video data processing, needs to be installed
    • follow the below steps to install imageio-ffmpeg offline
      • click + Add data button on the upper right corner of your notebook
      • click on Notebook Output Files
      • enter imageio-ffmpeg in the search box and click search
      • you will see that imageio-ffmpeg notebook on the top of the list, click Add
      • copy !conda install /kaggle/input/imageio-ffmpeg/*.tar.bz2 to a cell and run it
  • moviepy pakcage
    • similar to imageio-ffmpeg package, additional effort is required to install the package offline
    • follow the below steps to install it
      • click + Add data button on the upper right corner of your notebook
      • click on Notebook Output Files
      • enter moviepy in the search box and click search
      • you will see that moviepy notebook on the top of the list, click Add
      • copy !conda install /kaggle/input/moviepy/*.tar.bz2 to a cell and run it

Part II: exploratory data analysis for video data

Understand the train data

  • task description: detect three kinds of player events, both the time of occurrence and the type, within these videos.
  • three kinds of player events:

Plays: A Play describes a player's attempt to switch ball control to another member of his team. A play event may be executed as a Pass or as a Cross.

Throw-Ins: A Throw-In refers to a situation where the game is restarted after the ball went out of play over the sideline following the touch of the opposite team. The ball must be thrown with hands, from behind and over the head of executing player.

Challenge: A Challenge is a player action during which two players of opposing teams are physically capable of either gaining or receiving ball control and attempt to do so. A Challenge requires one of the two players to touch the ball or to foul the opposing player.

  • Training Data
    • train/ - Folder containing videos to be used as training data, comprising video recordings from eight games.
    • train.csv - Event annotations for videos in the train/ folder.
      • video_id - Identifies which video the event occurred in.
      • event - The type of event occurrence, one of challenge, play, or throwin. Also present are labels start and end indicating the scoring intervals of the video.
      • event_attributes - Additional descriptive attributes for the event.
      • time - The time, in seconds, the event occurred within the video.
  • Understand the train data
    • in train data, there are 4382 samples of event data (play, throwin, challenge), 3418 start timestamps and 3418 end timestamps for events.
    • among the 4382 event samples: 81% are play events, only ~4% of samples are throwin events
    • the gap between start timestamp of an event and the timestamp of the event can be as long as 2 seconds and as short as half a second
    • the gap between end timestamp of an event and the timestamp of the event can be as long as 2 seconds and as short as half a second
    • the gap between start timestamp of an event and end timestamp of the event, when both timestamps present, are around 2.5 seconds

Explore the video data

references:

  • https://imageio.readthedocs.io/en/stable/examples.html#iterate-over-frames-in-a-movie
  • https://imageio.readthedocs.io/en/latest/reference/userapi.html?highlight=get_data#imageio.core.format.Reader.get_meta_data
  • https://stackoverflow.com/questions/72773615/how-to-seek-a-frame-in-video-using-imageio
  • https://stackoverflow.com/questions/52257731/extract-part-of-a-video-using-ffmpeg-extract-subclip-black-frames
  • https://stackoverflow.com/questions/29718238/how-to-read-mp4-video-to-be-processed-by-scikit-image
  • https://zulko.github.io/moviepy/ref/ffmpeg.html?highlight=ffmpeg_extract_subclip#moviepy.video.io.ffmpeg_tools.ffmpeg_extract_subclip

install packages for video data processing and analysis

!conda install /kaggle/input/imageio-ffmpeg/*.tar.bz2
Downloading and Extracting Packages
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
!conda install /kaggle/input/moviepy/*.tar.bz2
Downloading and Extracting Packages
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

load packages

#basic libs

import pandas as pd
import numpy as np
import os
from pathlib import Path

from datetime import datetime, timedelta
import time
from dateutil.relativedelta import relativedelta

import gc
import copy


#additional data processing


from sklearn.preprocessing import StandardScaler, MinMaxScaler


#visualization
import seaborn as sns
import matplotlib.pyplot as plt

#load images
import matplotlib.image as mpimg
import PIL
from PIL import Image

#for loading videos
import imageio
import imageio.v2 as iio
from moviepy.video.io.ffmpeg_tools import ffmpeg_extract_subclip  #to extract a sub clip from a video
from IPython.display import Video #to play video in notebook


#settings

pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

Image.MAX_IMAGE_PIXELS = None

import warnings
warnings.filterwarnings("ignore")

import pytorch_lightning as pl
random_seed=1234
pl.seed_everything(random_seed)


1234

Load and understand train data

df_train = pd.read_csv("/kaggle/input/dfl-bundesliga-data-shootout/train.csv")
df_train.shape
(11218, 4)
df_train.head(2)
video_id time event event_attributes
0 1606b0e6_0 200.265822 start NaN
1 1606b0e6_0 201.150000 challenge ['ball_action_forced']
df_train['event'].value_counts()
play         3586
start        3418
end          3418
challenge     624
throwin       172
Name: event, dtype: int64
df_train['event_attributes'].value_counts()
['pass', 'openplay']                  3337
['ball_action_forced']                 239
['pass']                               154
['opponent_dispossessed']              138
['pass', 'freekick']                   127
['fouled']                             111
['cross', 'openplay']                   80
['challenge_during_ball_transfer']      53
['possession_retained']                 44
['opponent_rounded']                    39
['cross', 'corner']                     33
['cross']                               18
['cross', 'freekick']                    5
['pass', 'corner']                       4
Name: event_attributes, dtype: int64
df_train['video_id'].value_counts()
1606b0e6_1    1249
35bd9041_0    1075
3c993bd2_0    1042
1606b0e6_0    1000
ecf251d4_0     980
3c993bd2_1     966
35bd9041_1     933
407c5a9e_1     858
cfbe2e94_0     823
4ffd5986_0     792
cfbe2e94_1     763
9a97dae4_1     737
Name: video_id, dtype: int64
df_train[df_train['video_id']=='1606b0e6_0'].head(10)
video_id time event event_attributes
0 1606b0e6_0 200.265822 start NaN
1 1606b0e6_0 201.150000 challenge ['ball_action_forced']
2 1606b0e6_0 202.765822 end NaN
3 1606b0e6_0 210.124111 start NaN
4 1606b0e6_0 210.870000 challenge ['opponent_dispossessed']
5 1606b0e6_0 212.624111 end NaN
6 1606b0e6_0 217.850213 start NaN
7 1606b0e6_0 219.230000 throwin ['pass']
8 1606b0e6_0 220.350213 end NaN
9 1606b0e6_0 223.930850 start NaN
df_train[df_train['video_id']=='1606b0e6_0']['event'].value_counts()
play         319
start        302
end          302
challenge     56
throwin       21
Name: event, dtype: int64
df_train.sort_values(by=['video_id', 'time'], ascending=[True, True], inplace=True)
df_train['seq'] = 1
df_train['seq'] = df_train[['video_id', 'seq']].groupby('video_id').cumsum()
start_df = df_train[df_train['event']=='start']
end_df = df_train[df_train['event']=='end']
event_df = df_train[~df_train['event'].isin(['start', 'end'])]
print(start_df.shape, end_df.shape, event_df.shape)

start_df.columns = [f's_{c}' for c in start_df.columns]
end_df.columns = [f'e_{c}' for c in end_df.columns]
display(start_df.head(2))
display(end_df.head(2))

event_df['s_seq'] = event_df['seq'] - 1
event_df['e_seq'] = event_df['seq'] + 1

display(event_df.head(2))
(3418, 5) (3418, 5) (4382, 5)
s_video_id s_time s_event s_event_attributes s_seq
0 1606b0e6_0 200.265822 start NaN 1
3 1606b0e6_0 210.124111 start NaN 4
e_video_id e_time e_event e_event_attributes e_seq
2 1606b0e6_0 202.765822 end NaN 3
5 1606b0e6_0 212.624111 end NaN 6
video_id time event event_attributes seq s_seq e_seq
1 1606b0e6_0 201.15 challenge ['ball_action_forced'] 2 1 3
4 1606b0e6_0 210.87 challenge ['opponent_dispossessed'] 5 4 6
print(event_df.shape)
display(event_df.head(2))
event_df = event_df.merge(start_df, left_on=['video_id', 's_seq'], right_on=['s_video_id', 's_seq'], how='left')
print(event_df.shape)
display(event_df.head(2))
event_df = event_df.merge(end_df, left_on=['video_id', 'e_seq'], right_on=['e_video_id', 'e_seq'], how='left')
print(event_df.shape)
display(event_df.head(2))
(4382, 7)
video_id time event event_attributes seq s_seq e_seq
1 1606b0e6_0 201.15 challenge ['ball_action_forced'] 2 1 3
4 1606b0e6_0 210.87 challenge ['opponent_dispossessed'] 5 4 6
(4382, 11)
video_id time event event_attributes seq s_seq e_seq s_video_id s_time s_event s_event_attributes
0 1606b0e6_0 201.15 challenge ['ball_action_forced'] 2 1 3 1606b0e6_0 200.265822 start NaN
1 1606b0e6_0 210.87 challenge ['opponent_dispossessed'] 5 4 6 1606b0e6_0 210.124111 start NaN
(4382, 15)
video_id time event event_attributes seq s_seq e_seq s_video_id s_time s_event s_event_attributes e_video_id e_time e_event e_event_attributes
0 1606b0e6_0 201.15 challenge ['ball_action_forced'] 2 1 3 1606b0e6_0 200.265822 start NaN 1606b0e6_0 202.765822 end NaN
1 1606b0e6_0 210.87 challenge ['opponent_dispossessed'] 5 4 6 1606b0e6_0 210.124111 start NaN 1606b0e6_0 212.624111 end NaN
#validate data: 
#start timestamp should be no larger than the event's timestamp
#end timestamp should be no smaller than the event's timestamp
event_df[(event_df['s_time']>event_df['time'])|(event_df['e_time']<event_df['time'])]
video_id time event event_attributes seq s_seq e_seq s_video_id s_time s_event s_event_attributes e_video_id e_time e_event e_event_attributes
#events without start timestamp
print(event_df[~event_df['s_time'].isna()].shape)
event_df[event_df['s_time'].isna()]
(3418, 18)
video_id time event event_attributes seq s_seq e_seq s_video_id s_time s_event s_event_attributes e_video_id e_time e_event e_event_attributes gap_event_start gap_event_end gap_start_end
6 1606b0e6_0 239.350 play ['pass', 'openplay'] 18 17 19 NaN NaN NaN NaN 1606b0e6_0 240.401851 end NaN NaN 1.051851 NaN
8 1606b0e6_0 244.590 play ['pass', 'openplay'] 22 21 23 NaN NaN NaN NaN 1606b0e6_0 246.030453 end NaN NaN 1.440453 NaN
11 1606b0e6_0 253.470 play ['pass', 'openplay'] 29 28 30 NaN NaN NaN NaN 1606b0e6_0 253.990761 end NaN NaN 0.520761 NaN
14 1606b0e6_0 261.310 play ['pass', 'openplay'] 36 35 37 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 1606b0e6_0 263.150 play ['pass', 'openplay'] 37 36 38 NaN NaN NaN NaN 1606b0e6_0 265.019283 end NaN NaN 1.869283 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4360 ecf251d4_0 2958.587 play ['pass', 'openplay'] 924 923 925 NaN NaN NaN NaN ecf251d4_0 2959.156345 end NaN NaN 0.569345 NaN
4363 ecf251d4_0 2968.147 play ['pass', 'openplay'] 931 930 932 NaN NaN NaN NaN ecf251d4_0 2969.319076 end NaN NaN 1.172076 NaN
4367 ecf251d4_0 2997.827 play ['pass', 'openplay'] 941 940 942 NaN NaN NaN NaN ecf251d4_0 2998.283227 end NaN NaN 0.456227 NaN
4375 ecf251d4_0 3029.707 play ['pass', 'openplay'] 963 962 964 NaN NaN NaN NaN ecf251d4_0 3030.127462 end NaN NaN 0.420462 NaN
4379 ecf251d4_0 3053.067 play ['pass', 'openplay'] 973 972 974 NaN NaN NaN NaN ecf251d4_0 3053.744023 end NaN NaN 0.677023 NaN

964 rows × 18 columns

#events without end timestamp
print(event_df[~event_df['e_time'].isna()].shape)
event_df[event_df['e_time'].isna()]
(3418, 18)
video_id time event event_attributes seq s_seq e_seq s_video_id s_time s_event s_event_attributes e_video_id e_time e_event e_event_attributes gap_event_start gap_event_end gap_start_end
5 1606b0e6_0 236.710 play ['pass', 'openplay'] 17 16 18 1606b0e6_0 236.248227 start NaN NaN NaN NaN NaN 0.461773 NaN NaN
7 1606b0e6_0 242.390 play ['pass', 'openplay'] 21 20 22 1606b0e6_0 241.635933 start NaN NaN NaN NaN NaN 0.754067 NaN NaN
10 1606b0e6_0 250.750 play ['pass', 'openplay'] 28 27 29 1606b0e6_0 250.223514 start NaN NaN NaN NaN NaN 0.526486 NaN NaN
13 1606b0e6_0 258.830 play ['pass', 'openplay'] 35 34 36 1606b0e6_0 258.273235 start NaN NaN NaN NaN NaN 0.556765 NaN NaN
14 1606b0e6_0 261.310 play ['pass', 'openplay'] 36 35 37 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4359 ecf251d4_0 2955.027 play ['pass', 'openplay'] 923 922 924 ecf251d4_0 2954.506795 start NaN NaN NaN NaN NaN 0.520205 NaN NaN
4362 ecf251d4_0 2964.747 play ['pass', 'openplay'] 930 929 931 ecf251d4_0 2964.347000 start NaN NaN NaN NaN NaN 0.400000 NaN NaN
4366 ecf251d4_0 2994.987 play ['pass', 'openplay'] 940 939 941 ecf251d4_0 2993.931590 start NaN NaN NaN NaN NaN 1.055410 NaN NaN
4374 ecf251d4_0 3026.987 play ['pass', 'openplay'] 962 961 963 ecf251d4_0 3025.405235 start NaN NaN NaN NaN NaN 1.581765 NaN NaN
4378 ecf251d4_0 3050.347 play ['pass', 'openplay'] 972 971 973 ecf251d4_0 3049.497881 start NaN NaN NaN NaN NaN 0.849119 NaN NaN

964 rows × 18 columns

#events without start and end timestamp
event_df[(event_df['e_time'].isna()) & (event_df['s_time'].isna())]
video_id time event event_attributes seq s_seq e_seq s_video_id s_time s_event s_event_attributes e_video_id e_time e_event e_event_attributes
14 1606b0e6_0 261.310 play ['pass', 'openplay'] 36 35 37 NaN NaN NaN NaN NaN NaN NaN NaN
25 1606b0e6_0 298.790 play ['pass', 'openplay'] 59 58 60 NaN NaN NaN NaN NaN NaN NaN NaN
48 1606b0e6_0 454.670 play ['pass', 'openplay'] 116 115 117 NaN NaN NaN NaN NaN NaN NaN NaN
52 1606b0e6_0 480.830 play ['pass', 'openplay'] 124 123 125 NaN NaN NaN NaN NaN NaN NaN NaN
168 1606b0e6_0 1222.510 play ['pass', 'openplay'] 424 423 425 NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4182 ecf251d4_0 1448.227 play ['pass', 'openplay'] 452 451 453 NaN NaN NaN NaN NaN NaN NaN NaN
4289 ecf251d4_0 2234.627 play ['pass', 'openplay'] 737 736 738 NaN NaN NaN NaN NaN NaN NaN NaN
4354 ecf251d4_0 2939.387 play ['pass', 'openplay'] 914 913 915 NaN NaN NaN NaN NaN NaN NaN NaN
4355 ecf251d4_0 2942.347 play ['pass', 'openplay'] 915 914 916 NaN NaN NaN NaN NaN NaN NaN NaN
4356 ecf251d4_0 2944.227 play ['pass', 'openplay'] 916 915 917 NaN NaN NaN NaN NaN NaN NaN NaN

168 rows × 15 columns

event_df[(event_df['e_time'].isna()) & (event_df['s_time'].isna())]['event'].value_counts()
play         151
challenge     17
Name: event, dtype: int64
#the gap between start and event, event and end, and start and end
event_df['gap_event_start'] = event_df['time']- event_df['s_time']
event_df['gap_event_end'] = event_df['e_time']- event_df['time']
event_df['gap_start_end'] = event_df['e_time']- event_df['s_time']
event_df[['gap_event_start', 'gap_event_end', 'gap_start_end']].describe()
gap_event_start gap_event_end gap_start_end
count 3418.000000 3418.000000 2.622000e+03
mean 1.122575 1.278801 2.500000e+00
std 0.496239 0.502026 2.275300e-14
min 0.329385 0.400928 2.500000e+00
25% 0.667994 0.851162 2.500000e+00
50% 1.066239 1.278637 2.500000e+00
75% 1.525130 1.713679 2.500000e+00
max 2.099072 2.163005 2.500000e+00
event_df[['gap_event_start', 'gap_event_end', 'gap_start_end']].hist(bins=50)
array([[<AxesSubplot:title={'center':'gap_event_start'}>,
        <AxesSubplot:title={'center':'gap_event_end'}>],
       [<AxesSubplot:title={'center':'gap_start_end'}>, <AxesSubplot:>]],
      dtype=object)

png

a = event_df['event'].value_counts()
a.name='cnt'
b = event_df['event'].value_counts()/event_df.shape[0]
b.name='pct'
pd.concat([a, b], axis=1)
cnt pct
play 3586 0.818348
challenge 624 0.142401
throwin 172 0.039251
event_df['gap_start_end'].unique()
array([2.5, 2.5, nan, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5])

Explore the video data

%%time
video_path = '/kaggle/input/dfl-bundesliga-data-shootout/train/1606b0e6_0.mp4'
vid = imageio.get_reader(video_path,  'ffmpeg')
fps = vid.get_meta_data()['fps']#frames per second (FPS)
print(f'frames per second (FPS): {fps}')
print('meta data of the video')
print(vid.get_meta_data())
n_frames = vid.count_frames()
print(f'number of frames: {n_frames}')
frames per second (FPS): 25.0
meta data of the video
{'plugin': 'ffmpeg', 'nframes': inf, 'ffmpeg_version': '5.1 built with gcc 10.3.0 (conda-forge gcc 10.3.0-16)', 'codec': 'h264', 'pix_fmt': 'yuv420p(progressive)', 'fps': 25.0, 'source_size': (1920, 1080), 'size': (1920, 1080), 'rotate': 0, 'duration': 3436.6}
number of frames: 85915
CPU times: user 10.2 ms, sys: 27.2 ms, total: 37.5 ms
Wall time: 1.23 s
%%time
#display a few frames from the video

nums = [5006, 287, 5028, 5069]
for num in nums:
    image = vid.get_data(num)
    
    plt.figure(figsize=(8, 8))
    plt.imshow(image)
    timestamp = float(num)/ fps
    plt.title(f'image #{num}, timestamp={timestamp}', fontsize=20)
    plt.show()

png

png

png

png

CPU times: user 2.6 s, sys: 1.18 s, total: 3.78 s
Wall time: 5.92 s
for i, img in enumerate(vid):
    print('Mean of frame %i is %1.1f.' % (i, img.mean()))
    print(f'shape of the frame is {img.shape}')
    if i>5:
        break
Mean of frame 0 is 76.6.
shape of the frame is (1080, 1920, 3)
Mean of frame 1 is 76.6.
shape of the frame is (1080, 1920, 3)
Mean of frame 2 is 76.6.
shape of the frame is (1080, 1920, 3)
Mean of frame 3 is 76.6.
shape of the frame is (1080, 1920, 3)
Mean of frame 4 is 76.6.
shape of the frame is (1080, 1920, 3)
Mean of frame 5 is 76.6.
shape of the frame is (1080, 1920, 3)
Mean of frame 6 is 76.6.
shape of the frame is (1080, 1920, 3)
#show a short clip from the video
tmp_file = f"0.mp4"
ffmpeg_extract_subclip(
    video_path, 214.23, 224.23,
    targetname=tmp_file
)
    
Video(tmp_file, width=800)
Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
vid.close()
del vid
gc.collect()
13002