DFL Exploratory Data Analysis and Video Processing

Part I: install packages for video data processing and analysis

packages for video data analysis: imageio package and moviepy package
imageio package
- imageio package does not require additional installation. Kaggle notebook currently have imageio==2.19.3
- however, a imageio-ffmpeg, which is a required dependency for video data processing, needs to be installed
- follow the below steps to install imageio-ffmpeg offline
  - click + Add data button on the upper right corner of your notebook
  - click on Notebook Output Files
  - enter imageio-ffmpeg in the search box and click search
  - you will see that imageio-ffmpeg notebook on the top of the list, click Add
  - copy !conda install /kaggle/input/imageio-ffmpeg/*.tar.bz2 to a cell and run it
moviepy pakcage
- similar to imageio-ffmpeg package, additional effort is required to install the package offline
- follow the below steps to install it
  - click + Add data button on the upper right corner of your notebook
  - click on Notebook Output Files
  - enter moviepy in the search box and click search
  - you will see that moviepy notebook on the top of the list, click Add
  - copy !conda install /kaggle/input/moviepy/*.tar.bz2 to a cell and run it

Part II: exploratory data analysis for video data

Understand the train data

task description: detect three kinds of player events, both the time of occurrence and the type, within these videos.
three kinds of player events:

Plays: A Play describes a player's attempt to switch ball control to another member of his team. A play event may be executed as a Pass or as a Cross.

Throw-Ins: A Throw-In refers to a situation where the game is restarted after the ball went out of play over the sideline following the touch of the opposite team. The ball must be thrown with hands, from behind and over the head of executing player.

Challenge: A Challenge is a player action during which two players of opposing teams are physically capable of either gaining or receiving ball control and attempt to do so. A Challenge requires one of the two players to touch the ball or to foul the opposing player.

Training Data
- train/ - Folder containing videos to be used as training data, comprising video recordings from eight games.
- train.csv - Event annotations for videos in the train/ folder.
  - video_id - Identifies which video the event occurred in.
  - event - The type of event occurrence, one of challenge, play, or throwin. Also present are labels start and end indicating the scoring intervals of the video.
  - event_attributes - Additional descriptive attributes for the event.
  - time - The time, in seconds, the event occurred within the video.
Understand the train data
- in train data, there are 4382 samples of event data (play, throwin, challenge), 3418 start timestamps and 3418 end timestamps for events.
- among the 4382 event samples: 81% are play events, only ~4% of samples are throwin events
- the gap between start timestamp of an event and the timestamp of the event can be as long as 2 seconds and as short as half a second
- the gap between end timestamp of an event and the timestamp of the event can be as long as 2 seconds and as short as half a second
- the gap between start timestamp of an event and end timestamp of the event, when both timestamps present, are around 2.5 seconds

Explore the video data

references:

https://imageio.readthedocs.io/en/stable/examples.html#iterate-over-frames-in-a-movie
https://imageio.readthedocs.io/en/latest/reference/userapi.html?highlight=get_data#imageio.core.format.Reader.get_meta_data
https://stackoverflow.com/questions/72773615/how-to-seek-a-frame-in-video-using-imageio
https://stackoverflow.com/questions/52257731/extract-part-of-a-video-using-ffmpeg-extract-subclip-black-frames
https://stackoverflow.com/questions/29718238/how-to-read-mp4-video-to-be-processed-by-scikit-image
https://zulko.github.io/moviepy/ref/ffmpeg.html?highlight=ffmpeg_extract_subclip#moviepy.video.io.ffmpeg_tools.ffmpeg_extract_subclip

install packages for video data processing and analysis

!conda install /kaggle/input/imageio-ffmpeg/*.tar.bz2
  

Downloading and Extracting Packages
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
  

!conda install /kaggle/input/moviepy/*.tar.bz2
  

Downloading and Extracting Packages
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
######################################################################## | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
  

load packages

#basic libs

import pandas as pd
import numpy as np
import os
from pathlib import Path

from datetime import datetime, timedelta
import time
from dateutil.relativedelta import relativedelta

import gc
import copy


#additional data processing


from sklearn.preprocessing import StandardScaler, MinMaxScaler


#visualization
import seaborn as sns
import matplotlib.pyplot as plt

#load images
import matplotlib.image as mpimg
import PIL
from PIL import Image

#for loading videos
import imageio
import imageio.v2 as iio
from moviepy.video.io.ffmpeg_tools import ffmpeg_extract_subclip  #to extract a sub clip from a video
from IPython.display import Video #to play video in notebook


#settings

pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

Image.MAX_IMAGE_PIXELS = None

import warnings
warnings.filterwarnings("ignore")

import pytorch_lightning as pl
random_seed=1234
pl.seed_everything(random_seed)


  

Load and understand train data

df_train = pd.read_csv("/kaggle/input/dfl-bundesliga-data-shootout/train.csv")
df_train.shape
  

(11218, 4)

df_train.head(2)
  

	video_id	time	event	event_attributes
0	1606b0e6_0	200.265822	start	NaN
1	1606b0e6_0	201.150000	challenge	['ball_action_forced']

df_train['event'].value_counts()
  

play         3586
start        3418
end          3418
challenge     624
throwin       172
Name: event, dtype: int64
  

df_train['event_attributes'].value_counts()
  

['pass', 'openplay']                  3337
['ball_action_forced']                 239
['pass']                               154
['opponent_dispossessed']              138
['pass', 'freekick']                   127
['fouled']                             111
['cross', 'openplay']                   80
['challenge_during_ball_transfer']      53
['possession_retained']                 44
['opponent_rounded']                    39
['cross', 'corner']                     33
['cross']                               18
['cross', 'freekick']                    5
['pass', 'corner']                       4
Name: event_attributes, dtype: int64
  

df_train['video_id'].value_counts()
  

1606b0e6_1    1249
35bd9041_0    1075
3c993bd2_0    1042
1606b0e6_0    1000
ecf251d4_0     980
3c993bd2_1     966
35bd9041_1     933
407c5a9e_1     858
cfbe2e94_0     823
4ffd5986_0     792
cfbe2e94_1     763
9a97dae4_1     737
Name: video_id, dtype: int64
  

df_train[df_train['video_id']=='1606b0e6_0'].head(10)
  

	video_id	time	event	event_attributes
0	1606b0e6_0	200.265822	start	NaN
1	1606b0e6_0	201.150000	challenge	['ball_action_forced']
2	1606b0e6_0	202.765822	end	NaN
3	1606b0e6_0	210.124111	start	NaN
4	1606b0e6_0	210.870000	challenge	['opponent_dispossessed']
5	1606b0e6_0	212.624111	end	NaN
6	1606b0e6_0	217.850213	start	NaN
7	1606b0e6_0	219.230000	throwin	['pass']
8	1606b0e6_0	220.350213	end	NaN
9	1606b0e6_0	223.930850	start	NaN

df_train[df_train['video_id']=='1606b0e6_0']['event'].value_counts()
  

play         319
start        302
end          302
challenge     56
throwin       21
Name: event, dtype: int64
  

df_train.sort_values(by=['video_id', 'time'], ascending=[True, True], inplace=True)
df_train['seq'] = 1
df_train['seq'] = df_train[['video_id', 'seq']].groupby('video_id').cumsum()
  

start_df = df_train[df_train['event']=='start']
end_df = df_train[df_train['event']=='end']
event_df = df_train[~df_train['event'].isin(['start', 'end'])]
print(start_df.shape, end_df.shape, event_df.shape)

start_df.columns = [f's_{c}' for c in start_df.columns]
end_df.columns = [f'e_{c}' for c in end_df.columns]
display(start_df.head(2))
display(end_df.head(2))

event_df['s_seq'] = event_df['seq'] - 1
event_df['e_seq'] = event_df['seq'] + 1

display(event_df.head(2))
  

(3418, 5) (3418, 5) (4382, 5)

	s_video_id	s_time	s_event	s_event_attributes	s_seq
0	1606b0e6_0	200.265822	start	NaN	1
3	1606b0e6_0	210.124111	start	NaN	4

	e_video_id	e_time	e_event	e_event_attributes	e_seq
2	1606b0e6_0	202.765822	end	NaN	3
5	1606b0e6_0	212.624111	end	NaN	6

	video_id	time	event	event_attributes	seq	s_seq	e_seq
1	1606b0e6_0	201.15	challenge	['ball_action_forced']	2	1	3
4	1606b0e6_0	210.87	challenge	['opponent_dispossessed']	5	4	6

print(event_df.shape)
display(event_df.head(2))
event_df = event_df.merge(start_df, left_on=['video_id', 's_seq'], right_on=['s_video_id', 's_seq'], how='left')
print(event_df.shape)
display(event_df.head(2))
event_df = event_df.merge(end_df, left_on=['video_id', 'e_seq'], right_on=['e_video_id', 'e_seq'], how='left')
print(event_df.shape)
display(event_df.head(2))
  

(4382, 7)

	video_id	time	event	event_attributes	seq	s_seq	e_seq
1	1606b0e6_0	201.15	challenge	['ball_action_forced']	2	1	3
4	1606b0e6_0	210.87	challenge	['opponent_dispossessed']	5	4	6

(4382, 11)

	video_id	time	event	event_attributes	seq	s_seq	e_seq	s_video_id	s_time	s_event	s_event_attributes
0	1606b0e6_0	201.15	challenge	['ball_action_forced']	2	1	3	1606b0e6_0	200.265822	start	NaN
1	1606b0e6_0	210.87	challenge	['opponent_dispossessed']	5	4	6	1606b0e6_0	210.124111	start	NaN

(4382, 15)

	video_id	time	event	event_attributes	seq	s_seq	e_seq	s_video_id	s_time	s_event	s_event_attributes	e_video_id	e_time	e_event	e_event_attributes
0	1606b0e6_0	201.15	challenge	['ball_action_forced']	2	1	3	1606b0e6_0	200.265822	start	NaN	1606b0e6_0	202.765822	end	NaN
1	1606b0e6_0	210.87	challenge	['opponent_dispossessed']	5	4	6	1606b0e6_0	210.124111	start	NaN	1606b0e6_0	212.624111	end	NaN

#validate data: 
#start timestamp should be no larger than the event's timestamp
#end timestamp should be no smaller than the event's timestamp
event_df[(event_df['s_time']>event_df['time'])|(event_df['e_time']<event_df['time'])]
  

	video_id	time	event	event_attributes	seq	s_seq	e_seq	s_video_id	s_time	s_event	s_event_attributes	e_video_id	e_time	e_event	e_event_attributes

#events without start timestamp
print(event_df[~event_df['s_time'].isna()].shape)
event_df[event_df['s_time'].isna()]
  

(3418, 18)

	video_id	time	event	event_attributes	seq	s_seq	e_seq	s_video_id	s_time	s_event	s_event_attributes	e_video_id	e_time	e_event	e_event_attributes	gap_event_start	gap_event_end	gap_start_end
6	1606b0e6_0	239.350	play	['pass', 'openplay']	18	17	19	NaN	NaN	NaN	NaN	1606b0e6_0	240.401851	end	NaN	NaN	1.051851	NaN
8	1606b0e6_0	244.590	play	['pass', 'openplay']	22	21	23	NaN	NaN	NaN	NaN	1606b0e6_0	246.030453	end	NaN	NaN	1.440453	NaN
11	1606b0e6_0	253.470	play	['pass', 'openplay']	29	28	30	NaN	NaN	NaN	NaN	1606b0e6_0	253.990761	end	NaN	NaN	0.520761	NaN
14	1606b0e6_0	261.310	play	['pass', 'openplay']	36	35	37	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	1606b0e6_0	263.150	play	['pass', 'openplay']	37	36	38	NaN	NaN	NaN	NaN	1606b0e6_0	265.019283	end	NaN	NaN	1.869283	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4360	ecf251d4_0	2958.587	play	['pass', 'openplay']	924	923	925	NaN	NaN	NaN	NaN	ecf251d4_0	2959.156345	end	NaN	NaN	0.569345	NaN
4363	ecf251d4_0	2968.147	play	['pass', 'openplay']	931	930	932	NaN	NaN	NaN	NaN	ecf251d4_0	2969.319076	end	NaN	NaN	1.172076	NaN
4367	ecf251d4_0	2997.827	play	['pass', 'openplay']	941	940	942	NaN	NaN	NaN	NaN	ecf251d4_0	2998.283227	end	NaN	NaN	0.456227	NaN
4375	ecf251d4_0	3029.707	play	['pass', 'openplay']	963	962	964	NaN	NaN	NaN	NaN	ecf251d4_0	3030.127462	end	NaN	NaN	0.420462	NaN
4379	ecf251d4_0	3053.067	play	['pass', 'openplay']	973	972	974	NaN	NaN	NaN	NaN	ecf251d4_0	3053.744023	end	NaN	NaN	0.677023	NaN

964 rows × 18 columns

#events without end timestamp
print(event_df[~event_df['e_time'].isna()].shape)
event_df[event_df['e_time'].isna()]
  

(3418, 18)

	video_id	time	event	event_attributes	seq	s_seq	e_seq	s_video_id	s_time	s_event	s_event_attributes	e_video_id	e_time	e_event	e_event_attributes	gap_event_start	gap_event_end	gap_start_end
5	1606b0e6_0	236.710	play	['pass', 'openplay']	17	16	18	1606b0e6_0	236.248227	start	NaN	NaN	NaN	NaN	NaN	0.461773	NaN	NaN
7	1606b0e6_0	242.390	play	['pass', 'openplay']	21	20	22	1606b0e6_0	241.635933	start	NaN	NaN	NaN	NaN	NaN	0.754067	NaN	NaN
10	1606b0e6_0	250.750	play	['pass', 'openplay']	28	27	29	1606b0e6_0	250.223514	start	NaN	NaN	NaN	NaN	NaN	0.526486	NaN	NaN
13	1606b0e6_0	258.830	play	['pass', 'openplay']	35	34	36	1606b0e6_0	258.273235	start	NaN	NaN	NaN	NaN	NaN	0.556765	NaN	NaN
14	1606b0e6_0	261.310	play	['pass', 'openplay']	36	35	37	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4359	ecf251d4_0	2955.027	play	['pass', 'openplay']	923	922	924	ecf251d4_0	2954.506795	start	NaN	NaN	NaN	NaN	NaN	0.520205	NaN	NaN
4362	ecf251d4_0	2964.747	play	['pass', 'openplay']	930	929	931	ecf251d4_0	2964.347000	start	NaN	NaN	NaN	NaN	NaN	0.400000	NaN	NaN
4366	ecf251d4_0	2994.987	play	['pass', 'openplay']	940	939	941	ecf251d4_0	2993.931590	start	NaN	NaN	NaN	NaN	NaN	1.055410	NaN	NaN
4374	ecf251d4_0	3026.987	play	['pass', 'openplay']	962	961	963	ecf251d4_0	3025.405235	start	NaN	NaN	NaN	NaN	NaN	1.581765	NaN	NaN
4378	ecf251d4_0	3050.347	play	['pass', 'openplay']	972	971	973	ecf251d4_0	3049.497881	start	NaN	NaN	NaN	NaN	NaN	0.849119	NaN	NaN

964 rows × 18 columns

#events without start and end timestamp
event_df[(event_df['e_time'].isna()) & (event_df['s_time'].isna())]
  

	video_id	time	event	event_attributes	seq	s_seq	e_seq	s_video_id	s_time	s_event	s_event_attributes	e_video_id	e_time	e_event	e_event_attributes
14	1606b0e6_0	261.310	play	['pass', 'openplay']	36	35	37	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25	1606b0e6_0	298.790	play	['pass', 'openplay']	59	58	60	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
48	1606b0e6_0	454.670	play	['pass', 'openplay']	116	115	117	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
52	1606b0e6_0	480.830	play	['pass', 'openplay']	124	123	125	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
168	1606b0e6_0	1222.510	play	['pass', 'openplay']	424	423	425	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4182	ecf251d4_0	1448.227	play	['pass', 'openplay']	452	451	453	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4289	ecf251d4_0	2234.627	play	['pass', 'openplay']	737	736	738	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4354	ecf251d4_0	2939.387	play	['pass', 'openplay']	914	913	915	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4355	ecf251d4_0	2942.347	play	['pass', 'openplay']	915	914	916	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4356	ecf251d4_0	2944.227	play	['pass', 'openplay']	916	915	917	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

168 rows × 15 columns

event_df[(event_df['e_time'].isna()) & (event_df['s_time'].isna())]['event'].value_counts()
  

play         151
challenge     17
Name: event, dtype: int64
  

#the gap between start and event, event and end, and start and end
event_df['gap_event_start'] = event_df['time']- event_df['s_time']
event_df['gap_event_end'] = event_df['e_time']- event_df['time']
event_df['gap_start_end'] = event_df['e_time']- event_df['s_time']
  

event_df[['gap_event_start', 'gap_event_end', 'gap_start_end']].describe()
  

	gap_event_start	gap_event_end	gap_start_end
count	3418.000000	3418.000000	2.622000e+03
mean	1.122575	1.278801	2.500000e+00
std	0.496239	0.502026	2.275300e-14
min	0.329385	0.400928	2.500000e+00
25%	0.667994	0.851162	2.500000e+00
50%	1.066239	1.278637	2.500000e+00
75%	1.525130	1.713679	2.500000e+00
max	2.099072	2.163005	2.500000e+00

event_df[['gap_event_start', 'gap_event_end', 'gap_start_end']].hist(bins=50)
  

array([[<AxesSubplot:title={'center':'gap_event_start'}>,
        <AxesSubplot:title={'center':'gap_event_end'}>],
       [<AxesSubplot:title={'center':'gap_start_end'}>, <AxesSubplot:>]],
      dtype=object)
  

png

a = event_df['event'].value_counts()
a.name='cnt'
b = event_df['event'].value_counts()/event_df.shape[0]
b.name='pct'
pd.concat([a, b], axis=1)
  

	cnt	pct
play	3586	0.818348
challenge	624	0.142401
throwin	172	0.039251

event_df['gap_start_end'].unique()
  

array([2.5, 2.5, nan, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5])

Explore the video data

%%time
video_path = '/kaggle/input/dfl-bundesliga-data-shootout/train/1606b0e6_0.mp4'
vid = imageio.get_reader(video_path,  'ffmpeg')
fps = vid.get_meta_data()['fps']#frames per second (FPS)
print(f'frames per second (FPS): {fps}')
print('meta data of the video')
print(vid.get_meta_data())
n_frames = vid.count_frames()
print(f'number of frames: {n_frames}')
  

frames per second (FPS): 25.0
meta data of the video
{'plugin': 'ffmpeg', 'nframes': inf, 'ffmpeg_version': '5.1 built with gcc 10.3.0 (conda-forge gcc 10.3.0-16)', 'codec': 'h264', 'pix_fmt': 'yuv420p(progressive)', 'fps': 25.0, 'source_size': (1920, 1080), 'size': (1920, 1080), 'rotate': 0, 'duration': 3436.6}
number of frames: 85915
CPU times: user 10.2 ms, sys: 27.2 ms, total: 37.5 ms
Wall time: 1.23 s
  

%%time
#display a few frames from the video

nums = [5006, 287, 5028, 5069]
for num in nums:
    image = vid.get_data(num)
    
    plt.figure(figsize=(8, 8))
    plt.imshow(image)
    timestamp = float(num)/ fps
    plt.title(f'image #{num}, timestamp={timestamp}', fontsize=20)
    plt.show()
  

png

CPU times: user 2.6 s, sys: 1.18 s, total: 3.78 s
Wall time: 5.92 s
  

for i, img in enumerate(vid):
    print('Mean of frame %i is %1.1f.' % (i, img.mean()))
    print(f'shape of the frame is {img.shape}')
    if i>5:
        break
  

Mean of frame 0 is 76.6.
shape of the frame is (1080, 1920, 3)
Mean of frame 1 is 76.6.
shape of the frame is (1080, 1920, 3)
Mean of frame 2 is 76.6.
shape of the frame is (1080, 1920, 3)
Mean of frame 3 is 76.6.
shape of the frame is (1080, 1920, 3)
Mean of frame 4 is 76.6.
shape of the frame is (1080, 1920, 3)
Mean of frame 5 is 76.6.
shape of the frame is (1080, 1920, 3)
Mean of frame 6 is 76.6.
shape of the frame is (1080, 1920, 3)
  

#show a short clip from the video
tmp_file = f"0.mp4"
ffmpeg_extract_subclip(
    video_path, 214.23, 224.23,
    targetname=tmp_file
)
    
Video(tmp_file, width=800)
  

Moviepy - Running:
>>> "+ " ".join(cmd)
Moviepy - Command successful
  

vid.close()
del vid
gc.collect()