1. download data from yahoo finance
In this notebook, I will use Pfizer Inc. (PFE) and GlaxoSmithKline plc (GSK) as example.
Here shows how to download historical stock price data from yahoo finance. The two downloaded files are named PFE.csv
and GSK.csv
by default.
2. load raw csv file
The raw data downloaded from yahoo finance looks like the following screenshot. I will combine PFE.csv
and GSK.csv
into one pandas dataframe.
df1 = pd.read_csv('data/PFE.csv')
df1['id'] = 'PFE'
df1.head(3)
| Date | Open | High | Low | Close | Adj Close | Volume | id |
0 | 2009-01-02 | 16.963947 | 17.362429 | 16.793169 | 17.333965 | 10.938112 | 30274000 | PFE |
1 | 2009-01-05 | 17.457306 | 17.533207 | 16.963947 | 17.229601 | 10.872255 | 44439200 | PFE |
2 | 2009-01-06 | 17.381405 | 17.495256 | 16.802656 | 16.888046 | 10.656727 | 53984400 | PFE |
df2 = pd.read_csv('data/GSK.csv')
df2['id'] = 'GSK'
df2.head(3)
| Date | Open | High | Low | Close | Adj Close | Volume | id |
0 | 2009-01-02 | 36.520000 | 37.099998 | 36.439999 | 36.970001 | 19.927216 | 905400 | GSK |
1 | 2009-01-05 | 36.110001 | 36.529999 | 35.779999 | 36.380001 | 19.609205 | 1563200 | GSK |
2 | 2009-01-06 | 36.880001 | 38.000000 | 36.270000 | 37.759998 | 20.353031 | 2270600 | GSK |
df = pd.concat([df1, df2])
df.head(3)
| Date | Open | High | Low | Close | Adj Close | Volume | id |
0 | 2009-01-02 | 16.963947 | 17.362429 | 16.793169 | 17.333965 | 10.938112 | 30274000 | PFE |
1 | 2009-01-05 | 17.457306 | 17.533207 | 16.963947 | 17.229601 | 10.872255 | 44439200 | PFE |
2 | 2009-01-06 | 17.381405 | 17.495256 | 16.802656 | 16.888046 | 10.656727 | 53984400 | PFE |
| Date | Open | High | Low | Close | Adj Close | Volume | id |
3017 | 2020-12-28 | 36.790001 | 36.790001 | 36.169998 | 36.290001 | 36.290001 | 2876300 | GSK |
3018 | 2020-12-29 | 37.209999 | 37.380001 | 36.860001 | 36.980000 | 36.980000 | 4666300 | GSK |
3019 | 2020-12-30 | 37.169998 | 37.240002 | 36.900002 | 37.040001 | 37.040001 | 3071400 | GSK |
print(df1.shape, df2.shape, df.shape)
(3020, 8) (3020, 8) (6040, 8)
df.sort_values(by=['id', 'Date'], ascending=[True, True], inplace=True)
df.head(3)
| Date | Open | High | Low | Close | Adj Close | Volume | id |
0 | 2009-01-02 | 36.520000 | 37.099998 | 36.439999 | 36.970001 | 19.927216 | 905400 | GSK |
1 | 2009-01-05 | 36.110001 | 36.529999 | 35.779999 | 36.380001 | 19.609205 | 1563200 | GSK |
2 | 2009-01-06 | 36.880001 | 38.000000 | 36.270000 | 37.759998 | 20.353031 | 2270600 | GSK |
| Date | Open | High | Low | Close | Adj Close | Volume | id |
3017 | 2020-12-28 | 37.360001 | 37.580002 | 36.680000 | 36.820000 | 36.820000 | 26993700 | PFE |
3018 | 2020-12-29 | 36.900002 | 37.200001 | 36.790001 | 37.049999 | 37.049999 | 23152100 | PFE |
3019 | 2020-12-30 | 37.029999 | 37.240002 | 36.700001 | 36.740002 | 36.740002 | 24837900 | PFE |
sel_cols = ['id', 'Date', 'Adj Close', 'Volume']
df = df[sel_cols].copy(deep=True)
| id | Date | Adj Close | Volume |
0 | GSK | 2009-01-02 | 19.927216 | 905400 |
1 | GSK | 2009-01-05 | 19.609205 | 1563200 |
2 | GSK | 2009-01-06 | 20.353031 | 2270600 |
3. prepare data by `roll_time_series` and create features via `extract_features`
use roll_time_series
function to prepare the data: this function convert the data int the following format so it can be consumed by extract_features
function.
key parameters of:roll_time_series
- column_id: use
id
column - column_sort: use
Date
column - max_timeshift: set as 21 (the function will roll 22 rows). This parameter can be set as any integer between 1 to the number of rows in the raw dataframe. I set it as 21 to get features that capture the patterns in the past 30 days (~ 22 business days).
- min_timeshift: set as 21. This parameter must be a number no larger than
max_timeshift
. When it is set to less than max_timeshift
, the roll_time_series
will roll the data - for the initial few rows of each id
- starting from the row where there are min_timeshift
of rows to roll. I set this number same as max_timeshift
. - rolling_direction: default value is
1
. if set as -1
the rolling will be the reverse order.
roll_time_series
transforms data in the following format. This function will duplicate rows and increase number of rows to (original_rows_count
- 21xnumber of unique id
)x22 (i.e. (6040 - 21x2)x22 = 131956
)
extract_features
will turn the dataframe back to original_rows_count - 21 x number of unique id
rows.
from tsfresh.utilities.dataframe_functions import roll_time_series
from tsfresh import extract_features
roll_rows = 21
df_rolled = roll_time_series(df, column_id="id", column_sort="Date", max_timeshift=roll_rows, min_timeshift=roll_rows, disable_progressbar=True)
df_features = extract_features(df_rolled, column_id="id", column_sort="Date")
Feature Extraction: 100%|██████████████████████████████████████████████████████████████| 20/20 [03:04<00:00, 9.24s/it]
print(df.shape, df_rolled.shape, df_features.shape, df_features.shape[0]-df.shape[0])
(6040, 4) (131956, 4) (5998, 1558) -42
| id | Date | Adj Close | Volume |
0 | (GSK, 2009-02-03) | 2009-01-02 | 19.927216 | 905400 |
1 | (GSK, 2009-02-03) | 2009-01-05 | 19.609205 | 1563200 |
2 | (GSK, 2009-02-03) | 2009-01-06 | 20.353031 | 2270600 |
df_rolled[df_rolled['Date']=='2009-02-03']
| id | Date | Adj Close | Volume |
21 | (GSK, 2009-02-03) | 2009-02-03 | 19.593031 | 1350600 |
64 | (GSK, 2009-02-04) | 2009-02-03 | 19.593031 | 1350600 |
107 | (GSK, 2009-02-05) | 2009-02-03 | 19.593031 | 1350600 |
150 | (GSK, 2009-02-06) | 2009-02-03 | 19.593031 | 1350600 |
193 | (GSK, 2009-02-09) | 2009-02-03 | 19.593031 | 1350600 |
236 | (GSK, 2009-02-10) | 2009-02-03 | 19.593031 | 1350600 |
279 | (GSK, 2009-02-11) | 2009-02-03 | 19.593031 | 1350600 |
322 | (GSK, 2009-02-12) | 2009-02-03 | 19.593031 | 1350600 |
365 | (GSK, 2009-02-13) | 2009-02-03 | 19.593031 | 1350600 |
408 | (GSK, 2009-02-17) | 2009-02-03 | 19.593031 | 1350600 |
451 | (GSK, 2009-02-18) | 2009-02-03 | 19.593031 | 1350600 |
494 | (GSK, 2009-02-19) | 2009-02-03 | 19.593031 | 1350600 |
537 | (GSK, 2009-02-20) | 2009-02-03 | 19.593031 | 1350600 |
580 | (GSK, 2009-02-23) | 2009-02-03 | 19.593031 | 1350600 |
623 | (GSK, 2009-02-24) | 2009-02-03 | 19.593031 | 1350600 |
666 | (GSK, 2009-02-25) | 2009-02-03 | 19.593031 | 1350600 |
709 | (GSK, 2009-02-26) | 2009-02-03 | 19.593031 | 1350600 |
752 | (GSK, 2009-02-27) | 2009-02-03 | 19.593031 | 1350600 |
795 | (GSK, 2009-03-02) | 2009-02-03 | 19.593031 | 1350600 |
838 | (GSK, 2009-03-03) | 2009-02-03 | 19.593031 | 1350600 |
881 | (GSK, 2009-03-04) | 2009-02-03 | 19.593031 | 1350600 |
924 | (GSK, 2009-03-05) | 2009-02-03 | 19.593031 | 1350600 |
43 | (PFE, 2009-02-03) | 2009-02-03 | 9.100128 | 66814700 |
86 | (PFE, 2009-02-04) | 2009-02-03 | 9.100128 | 66814700 |
129 | (PFE, 2009-02-05) | 2009-02-03 | 9.100128 | 66814700 |
172 | (PFE, 2009-02-06) | 2009-02-03 | 9.100128 | 66814700 |
215 | (PFE, 2009-02-09) | 2009-02-03 | 9.100128 | 66814700 |
258 | (PFE, 2009-02-10) | 2009-02-03 | 9.100128 | 66814700 |
301 | (PFE, 2009-02-11) | 2009-02-03 | 9.100128 | 66814700 |
344 | (PFE, 2009-02-12) | 2009-02-03 | 9.100128 | 66814700 |
387 | (PFE, 2009-02-13) | 2009-02-03 | 9.100128 | 66814700 |
430 | (PFE, 2009-02-17) | 2009-02-03 | 9.100128 | 66814700 |
473 | (PFE, 2009-02-18) | 2009-02-03 | 9.100128 | 66814700 |
516 | (PFE, 2009-02-19) | 2009-02-03 | 9.100128 | 66814700 |
559 | (PFE, 2009-02-20) | 2009-02-03 | 9.100128 | 66814700 |
602 | (PFE, 2009-02-23) | 2009-02-03 | 9.100128 | 66814700 |
645 | (PFE, 2009-02-24) | 2009-02-03 | 9.100128 | 66814700 |
688 | (PFE, 2009-02-25) | 2009-02-03 | 9.100128 | 66814700 |
731 | (PFE, 2009-02-26) | 2009-02-03 | 9.100128 | 66814700 |
774 | (PFE, 2009-02-27) | 2009-02-03 | 9.100128 | 66814700 |
817 | (PFE, 2009-03-02) | 2009-02-03 | 9.100128 | 66814700 |
860 | (PFE, 2009-03-03) | 2009-02-03 | 9.100128 | 66814700 |
903 | (PFE, 2009-03-04) | 2009-02-03 | 9.100128 | 66814700 |
946 | (PFE, 2009-03-05) | 2009-02-03 | 9.100128 | 66814700 |
4. `impute` function and deal with missing value
I will first remove features (i.e. columns) with more than 10% of data missing, and then deal with the remaining features.
The impute
function replaces all NaNs
and infs
inplace. Note this function only accepts numeric features and errors will be triggered if the dataframe passed to this function has non-numeric columns.
| | Adj Close__variance_larger_than_standard_deviation | Adj Close__has_duplicate_max | Adj Close__has_duplicate_min | Adj Close__has_duplicate | Adj Close__sum_values | Adj Close__abs_energy | Adj Close__mean_abs_change | Adj Close__mean_change | Adj Close__mean_second_derivative_central | Adj Close__median | ... | Volume__fourier_entropy__bins_2 | Volume__fourier_entropy__bins_3 | Volume__fourier_entropy__bins_5 | Volume__fourier_entropy__bins_10 | Volume__fourier_entropy__bins_100 | Volume__permutation_entropy__dimension_3__tau_1 | Volume__permutation_entropy__dimension_4__tau_1 | Volume__permutation_entropy__dimension_5__tau_1 | Volume__permutation_entropy__dimension_6__tau_1 | Volume__permutation_entropy__dimension_7__tau_1 |
GSK | 2009-02-03 | 0.0 | 0.0 | 0.0 | 0.0 | 431.952223 | 8494.084498 | 0.359854 | -0.015914 | 0.024660 | 19.601118 | ... | 0.286836 | 0.286836 | 0.823959 | 1.424130 | 2.369382 | 1.692281 | 2.260234 | 2.707270 | 2.833213 | 2.772589 |
2009-02-04 | 0.0 | 0.0 | 0.0 | 0.0 | 431.763572 | 8486.601509 | 0.351641 | 0.006160 | -0.014957 | 19.601118 | ... | 0.286836 | 0.721464 | 0.983088 | 1.539654 | 2.369382 | 1.748067 | 2.378620 | 2.813355 | 2.833213 | 2.772589 |
2 rows × 1558 columns
na_cnt = df_features.isna().sum().sort_values(ascending=False)
na_cnt.head()
Adj Close__fft_coefficient__attr_"abs"__coeff_47 5998
Volume__fft_coefficient__attr_"real"__coeff_17 5998
Volume__fft_coefficient__attr_"real"__coeff_26 5998
Volume__fft_coefficient__attr_"real"__coeff_25 5998
Volume__fft_coefficient__attr_"real"__coeff_24 5998
dtype: int64
ratio = 0.1
print(ratio*df_features.shape[0])
drop_cols = na_cnt[na_cnt>ratio*df_features.shape[0]].index.tolist()
print('before dropping columns: ', df_features.shape)
df_features.drop(columns=drop_cols, inplace=True)
print('after dropping columns: ', df_features.shape)
599.8000000000001
before dropping columns: (5998, 1558)
after dropping columns: (5998, 815)
na_cnt = df_features.isna().sum().sort_values(ascending=False)
na_cnt.head()
Volume__sample_entropy 245
Adj Close__sample_entropy 180
Volume__max_langevin_fixed_point__m_3__r_30 18
Volume__friedrich_coefficients__coeff_3__m_3__r_30 18
Volume__friedrich_coefficients__coeff_2__m_3__r_30 18
dtype: int64
from tsfresh.utilities.dataframe_functions import impute
impute(df_features)
na_cnt = df_features.isna().sum().sort_values(ascending=False)
na_cnt.head()
Volume__permutation_entropy__dimension_7__tau_1 0
Adj Close__fft_coefficient__attr_"real"__coeff_4 0
Adj Close__fft_coefficient__attr_"imag"__coeff_2 0
Adj Close__fft_coefficient__attr_"imag"__coeff_1 0
Adj Close__fft_coefficient__attr_"imag"__coeff_0 0
dtype: int64
df_features.reset_index(inplace=True)
df_features.head(2)
| level_0 | level_1 | Adj Close__variance_larger_than_standard_deviation | Adj Close__has_duplicate_max | Adj Close__has_duplicate_min | Adj Close__has_duplicate | Adj Close__sum_values | Adj Close__abs_energy | Adj Close__mean_abs_change | Adj Close__mean_change | ... | Volume__fourier_entropy__bins_2 | Volume__fourier_entropy__bins_3 | Volume__fourier_entropy__bins_5 | Volume__fourier_entropy__bins_10 | Volume__fourier_entropy__bins_100 | Volume__permutation_entropy__dimension_3__tau_1 | Volume__permutation_entropy__dimension_4__tau_1 | Volume__permutation_entropy__dimension_5__tau_1 | Volume__permutation_entropy__dimension_6__tau_1 | Volume__permutation_entropy__dimension_7__tau_1 |
0 | GSK | 2009-02-03 | 0.0 | 0.0 | 0.0 | 0.0 | 431.952223 | 8494.084498 | 0.359854 | -0.015914 | ... | 0.286836 | 0.286836 | 0.823959 | 1.424130 | 2.369382 | 1.692281 | 2.260234 | 2.707270 | 2.833213 | 2.772589 |
1 | GSK | 2009-02-04 | 0.0 | 0.0 | 0.0 | 0.0 | 431.763572 | 8486.601509 | 0.351641 | 0.006160 | ... | 0.286836 | 0.721464 | 0.983088 | 1.539654 | 2.369382 | 1.748067 | 2.378620 | 2.813355 | 2.833213 | 2.772589 |
2 rows × 817 columns
df_features.rename(columns={'level_0':'id', 'level_1':'Date'}, inplace=True)
df_features.head(2)
| id | Date | Adj Close__variance_larger_than_standard_deviation | Adj Close__has_duplicate_max | Adj Close__has_duplicate_min | Adj Close__has_duplicate | Adj Close__sum_values | Adj Close__abs_energy | Adj Close__mean_abs_change | Adj Close__mean_change | ... | Volume__fourier_entropy__bins_2 | Volume__fourier_entropy__bins_3 | Volume__fourier_entropy__bins_5 | Volume__fourier_entropy__bins_10 | Volume__fourier_entropy__bins_100 | Volume__permutation_entropy__dimension_3__tau_1 | Volume__permutation_entropy__dimension_4__tau_1 | Volume__permutation_entropy__dimension_5__tau_1 | Volume__permutation_entropy__dimension_6__tau_1 | Volume__permutation_entropy__dimension_7__tau_1 |
0 | GSK | 2009-02-03 | 0.0 | 0.0 | 0.0 | 0.0 | 431.952223 | 8494.084498 | 0.359854 | -0.015914 | ... | 0.286836 | 0.286836 | 0.823959 | 1.424130 | 2.369382 | 1.692281 | 2.260234 | 2.707270 | 2.833213 | 2.772589 |
1 | GSK | 2009-02-04 | 0.0 | 0.0 | 0.0 | 0.0 | 431.763572 | 8486.601509 | 0.351641 | 0.006160 | ... | 0.286836 | 0.721464 | 0.983088 | 1.539654 | 2.369382 | 1.748067 | 2.378620 | 2.813355 | 2.833213 | 2.772589 |
2 rows × 817 columns
5. create y (the target variable)
Target variable: a binary variable to indicate if the maximum of ‘Adj Close' in the next 30 business days is no less than 4% higher than current date's ‘Adj Close'.
Steps to engineer the target variable:
- calculate the maximum
Adj Close
in the next 30 rows (data sorted by Date
in ascending order). - denote the maximum
Adj Close
in the next 30 rows as max_adj_close_30
- denote the current date's
Adj Close
as current_close
- calculate the % change of maximum value compared to current date's value.
- denote the % change as
delta_pct
delta_pct = (max_adj_close_30 - current_close)/current_close*100
- convert the % change to a binary value
- denote the target variable as
target
- if
delta_pct >=5
, then the target=1
, else target=0
| id | Date | Adj Close | Volume |
0 | GSK | 2009-01-02 | 19.927216 | 905400 |
1 | GSK | 2009-01-05 | 19.609205 | 1563200 |
2 | GSK | 2009-01-06 | 20.353031 | 2270600 |
3 | GSK | 2009-01-07 | 20.778858 | 1621000 |
4 | GSK | 2009-01-08 | 21.150766 | 1837100 |
roll_rows = 30
df_rolled2 = roll_time_series(df, column_id="id", column_sort="Date",
max_timeshift=roll_rows, min_timeshift=roll_rows,
rolling_direction=-1, disable_progressbar=True)
| id | Date | Adj Close | Volume |
0 | (GSK, 2009-01-02) | 2009-01-02 | 19.927216 | 905400 |
1 | (GSK, 2009-01-02) | 2009-01-05 | 19.609205 | 1563200 |
2 | (GSK, 2009-01-02) | 2009-01-06 | 20.353031 | 2270600 |
3 | (GSK, 2009-01-02) | 2009-01-07 | 20.778858 | 1621000 |
4 | (GSK, 2009-01-02) | 2009-01-08 | 21.150766 | 1837100 |
df_features2 = extract_features(df_rolled2, column_id="id", column_sort="Date", column_value = 'Adj Close')
Feature Extraction: 100%|██████████████████████████████████████████████████████████████| 20/20 [01:41<00:00, 5.09s/it]
| | Adj Close__variance_larger_than_standard_deviation | Adj Close__has_duplicate_max | Adj Close__has_duplicate_min | Adj Close__has_duplicate | Adj Close__sum_values | Adj Close__abs_energy | Adj Close__mean_abs_change | Adj Close__mean_change | Adj Close__mean_second_derivative_central | Adj Close__median | ... | Adj Close__fourier_entropy__bins_2 | Adj Close__fourier_entropy__bins_3 | Adj Close__fourier_entropy__bins_5 | Adj Close__fourier_entropy__bins_10 | Adj Close__fourier_entropy__bins_100 | Adj Close__permutation_entropy__dimension_3__tau_1 | Adj Close__permutation_entropy__dimension_4__tau_1 | Adj Close__permutation_entropy__dimension_5__tau_1 | Adj Close__permutation_entropy__dimension_6__tau_1 | Adj Close__permutation_entropy__dimension_7__tau_1 |
GSK | 2009-01-02 | 0.0 | 0.0 | 0.0 | 1.0 | 607.464175 | 11919.455113 | 0.332559 | -0.055507 | -0.009214 | 19.609205 | ... | 0.376770 | 0.463414 | 0.822265 | 1.037392 | 2.014036 | 1.523876 | 2.224925 | 2.617929 | 2.991501 | 3.163424 |
2009-01-05 | 0.0 | 0.0 | 0.0 | 1.0 | 605.941029 | 11861.070968 | 0.326694 | -0.040171 | -0.010375 | 19.593031 | ... | 0.376770 | 0.376770 | 0.831403 | 1.127483 | 2.100679 | 1.523876 | 2.243613 | 2.637309 | 2.991501 | 3.163424 |
2009-01-06 | 0.0 | 0.0 | 0.0 | 0.0 | 604.539174 | 11808.057642 | 0.308457 | -0.071523 | -0.010734 | 19.474447 | ... | 0.376770 | 0.376770 | 0.831403 | 1.276720 | 2.133382 | 1.551250 | 2.262300 | 2.637309 | 2.938182 | 3.107972 |
2009-01-07 | 0.0 | 0.0 | 0.0 | 0.0 | 601.972734 | 11710.174590 | 0.308289 | -0.099742 | -0.013667 | 19.398584 | ... | 0.376770 | 0.463414 | 0.822265 | 1.160186 | 2.393312 | 1.513303 | 2.262300 | 2.585965 | 2.884863 | 3.052521 |
2009-01-08 | 0.0 | 0.0 | 0.0 | 0.0 | 598.630745 | 11582.458051 | 0.307549 | -0.123797 | -0.002870 | 19.305689 | ... | 0.233792 | 0.463414 | 0.702919 | 1.160186 | 2.166085 | 1.452462 | 2.142622 | 2.534620 | 2.831544 | 2.997069 |
5 rows × 779 columns
sel_cols = ['Adj Close__maximum']
df_target = df_features2[sel_cols].copy(deep=True)
| | Adj Close__maximum |
GSK | 2009-01-02 | 21.150766 |
2009-01-05 | 21.150766 |
2009-01-06 | 21.150766 |
2009-01-07 | 21.150766 |
2009-01-08 | 21.150766 |
2009-01-09 | 20.967508 |
2009-01-12 | 20.730349 |
2009-01-13 | 20.185947 |
2009-01-14 | 20.137426 |
2009-01-15 | 20.137426 |
df_target['max_adj_close_30'] = df_target['Adj Close__maximum'].shift(-1)
| | Adj Close__maximum | max_adj_close_30 |
GSK | 2009-01-02 | 21.150766 | 21.150766 |
2009-01-05 | 21.150766 | 21.150766 |
2009-01-06 | 21.150766 | 21.150766 |
2009-01-07 | 21.150766 | 21.150766 |
2009-01-08 | 21.150766 | 20.967508 |
2009-01-09 | 20.967508 | 20.730349 |
2009-01-12 | 20.730349 | 20.185947 |
2009-01-13 | 20.185947 | 20.137426 |
2009-01-14 | 20.137426 | 20.137426 |
2009-01-15 | 20.137426 | 20.137426 |
for i in range(0,10):
print(df.iloc[i,1], df.iloc[i:(30+i),]['Adj Close'].max())
2009-01-02 21.150766
2009-01-05 21.150766
2009-01-06 21.150766
2009-01-07 21.150766
2009-01-08 21.150766
2009-01-09 20.967508
2009-01-12 20.730349
2009-01-13 20.185947
2009-01-14 20.137426
2009-01-15 20.137426
df_target.reset_index(inplace=True)
df_target.rename(columns={'level_0':'id', 'level_1':'Date'}, inplace=True)
df_target.head(2)
| id | Date | Adj Close__maximum | max_adj_close_30 |
0 | GSK | 2009-01-02 | 21.150766 | 21.150766 |
1 | GSK | 2009-01-05 | 21.150766 | 21.150766 |
print(df_target.shape, df.shape)
df_target = df_target.merge(df, on=['id', 'Date'], how='inner')
print(df_target.shape, df.shape)
df_target.head(2)
(5980, 4) (6040, 4)
(5980, 6) (6040, 4)
| id | Date | Adj Close__maximum | max_adj_close_30 | Adj Close | Volume |
0 | GSK | 2009-01-02 | 21.150766 | 21.150766 | 19.927216 | 905400 |
1 | GSK | 2009-01-05 | 21.150766 | 21.150766 | 19.609205 | 1563200 |
df_target['delta_pct'] = (df_target['max_adj_close_30'] - df_target['Adj Close'] )/df_target['Adj Close'] *100
df_target.head(2)
| id | Date | Adj Close__maximum | max_adj_close_30 | Adj Close | Volume | delta_pct |
0 | GSK | 2009-01-02 | 21.150766 | 21.150766 | 19.927216 | 905400 | 6.140095 |
1 | GSK | 2009-01-05 | 21.150766 | 21.150766 | 19.609205 | 1563200 | 7.861415 |
df_target['target'] = 0
df_target.loc[df_target['delta_pct']>=5, 'target'] = 1
df_target['target'].value_counts()
0 3506
1 2474
Name: target, dtype: int64
| id | Date | Adj Close__maximum | max_adj_close_30 | Adj Close | Volume | delta_pct | target |
0 | GSK | 2009-01-02 | 21.150766 | 21.150766 | 19.927216 | 905400 | 6.140095 | 1 |
1 | GSK | 2009-01-05 | 21.150766 | 21.150766 | 19.609205 | 1563200 | 7.861415 | 1 |
2 | GSK | 2009-01-06 | 21.150766 | 21.150766 | 20.353031 | 2270600 | 3.919490 | 0 |
3 | GSK | 2009-01-07 | 21.150766 | 21.150766 | 20.778858 | 1621000 | 1.789838 | 0 |
4 | GSK | 2009-01-08 | 21.150766 | 20.967508 | 21.150766 | 1837100 | -0.866437 | 0 |
6. select features by `select_features` function
select_features(X, y, test_for_binary_target_binary_feature='fisher', test_for_binary_target_real_feature='mann', test_for_real_target_binary_feature='mann', test_for_real_target_real_feature='kendall', fdr_level=0.05, hypotheses_independent=False, n_jobs=1, show_warnings=False, chunksize=None, ml_task='auto', multiclass=False, n_significant=1)
from tsfresh.feature_selection.selection import select_features
Index(['id', 'Date', 'Adj Close__maximum', 'max_adj_close_30', 'Adj Close',
'Volume', 'delta_pct', 'target'],
dtype='object')
sel_cols = ['id', 'Date', 'target', 'max_adj_close_30', 'delta_pct', 'Adj Close','Volume']
#combine features with target variable
print(df_target.shape, df_features.shape)
df_final = df_target[sel_cols].merge(df_features, on = ['id', 'Date'], how='inner')
print(df_final.shape, df_target.shape, df_features.shape)
(5980, 8) (5998, 817)
(5938, 822) (5980, 8) (5998, 817)
| id | Date | target | max_adj_close_30 | delta_pct | Adj Close | Volume | Adj Close__variance_larger_than_standard_deviation | Adj Close__has_duplicate_max | Adj Close__has_duplicate_min | ... | Volume__fourier_entropy__bins_2 | Volume__fourier_entropy__bins_3 | Volume__fourier_entropy__bins_5 | Volume__fourier_entropy__bins_10 | Volume__fourier_entropy__bins_100 | Volume__permutation_entropy__dimension_3__tau_1 | Volume__permutation_entropy__dimension_4__tau_1 | Volume__permutation_entropy__dimension_5__tau_1 | Volume__permutation_entropy__dimension_6__tau_1 | Volume__permutation_entropy__dimension_7__tau_1 |
0 | GSK | 2009-02-03 | 0 | 20.137426 | 2.778513 | 19.593031 | 1350600 | 0.0 | 0.0 | 0.0 | ... | 0.286836 | 0.286836 | 0.823959 | 1.424130 | 2.369382 | 1.692281 | 2.260234 | 2.707270 | 2.833213 | 2.772589 |
1 | GSK | 2009-02-04 | 0 | 20.137426 | 2.020719 | 19.738565 | 2432700 | 0.0 | 0.0 | 0.0 | ... | 0.286836 | 0.721464 | 0.983088 | 1.539654 | 2.369382 | 1.748067 | 2.378620 | 2.813355 | 2.833213 | 2.772589 |
2 | GSK | 2009-02-05 | 0 | 20.008068 | -0.642376 | 20.137426 | 3323300 | 0.0 | 0.0 | 0.0 | ... | 0.450561 | 0.721464 | 1.098612 | 1.539654 | 2.484907 | 1.735434 | 2.406160 | 2.813355 | 2.833213 | 2.772589 |
3 | GSK | 2009-02-06 | 0 | 19.937990 | -0.350249 | 20.008068 | 2266700 | 0.0 | 0.0 | 0.0 | ... | 0.450561 | 0.721464 | 1.118743 | 1.632631 | 2.484907 | 1.752424 | 2.479122 | 2.813355 | 2.833213 | 2.772589 |
4 | GSK | 2009-02-09 | 0 | 19.609205 | -1.649038 | 19.937990 | 1254900 | 0.0 | 0.0 | 0.0 | ... | 0.286836 | 0.721464 | 0.983088 | 1.473502 | 2.253858 | 1.752424 | 2.506662 | 2.813355 | 2.833213 | 2.772589 |
5 rows × 822 columns
df_final.isna().sum().sort_values(ascending=False).head(5)
max_adj_close_30 1
delta_pct 1
Volume__permutation_entropy__dimension_7__tau_1 0
Adj Close__fft_coefficient__attr_"real"__coeff_0 0
Adj Close__fft_coefficient__attr_"real"__coeff_10 0
dtype: int64
print(df_final.shape)
df_final.dropna(how='any', inplace=True)
print(df_final.shape)
df_final.isna().sum().sort_values(ascending=False).head(5)
(5938, 822)
(5937, 822)
Volume__permutation_entropy__dimension_7__tau_1 0
Adj Close__change_quantiles__f_agg_"var"__isabs_True__qh_1.0__ql_0.8 0
Adj Close__fft_coefficient__attr_"real"__coeff_9 0
Adj Close__fft_coefficient__attr_"real"__coeff_8 0
Adj Close__fft_coefficient__attr_"real"__coeff_7 0
dtype: int64
X = df_final.iloc[:, 5:].copy(deep=True)
y = df_final['target']
print(y[:3])
X.head(3)
0 0
1 0
2 0
Name: target, dtype: int64
| Adj Close | Volume | Adj Close__variance_larger_than_standard_deviation | Adj Close__has_duplicate_max | Adj Close__has_duplicate_min | Adj Close__has_duplicate | Adj Close__sum_values | Adj Close__abs_energy | Adj Close__mean_abs_change | Adj Close__mean_change | ... | Volume__fourier_entropy__bins_2 | Volume__fourier_entropy__bins_3 | Volume__fourier_entropy__bins_5 | Volume__fourier_entropy__bins_10 | Volume__fourier_entropy__bins_100 | Volume__permutation_entropy__dimension_3__tau_1 | Volume__permutation_entropy__dimension_4__tau_1 | Volume__permutation_entropy__dimension_5__tau_1 | Volume__permutation_entropy__dimension_6__tau_1 | Volume__permutation_entropy__dimension_7__tau_1 |
0 | 19.593031 | 1350600 | 0.0 | 0.0 | 0.0 | 0.0 | 431.952223 | 8494.084498 | 0.359854 | -0.015914 | ... | 0.286836 | 0.286836 | 0.823959 | 1.424130 | 2.369382 | 1.692281 | 2.260234 | 2.707270 | 2.833213 | 2.772589 |
1 | 19.738565 | 2432700 | 0.0 | 0.0 | 0.0 | 0.0 | 431.763572 | 8486.601509 | 0.351641 | 0.006160 | ... | 0.286836 | 0.721464 | 0.983088 | 1.539654 | 2.369382 | 1.748067 | 2.378620 | 2.813355 | 2.833213 | 2.772589 |
2 | 20.137426 | 3323300 | 0.0 | 0.0 | 0.0 | 0.0 | 432.291793 | 8507.596514 | 0.335214 | -0.010267 | ... | 0.450561 | 0.721464 | 1.098612 | 1.539654 | 2.484907 | 1.735434 | 2.406160 | 2.813355 | 2.833213 | 2.772589 |
3 rows × 817 columns
y = df_final['target']
print(X.shape)
for fdr_level in [0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005, 0.00001, 0.000005, 0.000001, 0.0000005, 0.0000001]:
df_filtered_binary = select_features(X, y, fdr_level=fdr_level, ml_task='classification')
print(fdr_level, df_filtered_binary.shape)
(5937, 817)
0.05 (5937, 416)
0.01 (5937, 371)
0.005 (5937, 361)
0.001 (5937, 341)
0.0005 (5937, 331)
0.0001 (5937, 313)
5e-05 (5937, 311)
1e-05 (5937, 296)
5e-06 (5937, 292)
1e-06 (5937, 286)
5e-07 (5937, 282)
1e-07 (5937, 277)
y = df_final['delta_pct']
for fdr_level in [0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005, 0.00001, 0.000005, 0.000001, 0.0000005, 0.0000001]:
df_filtered_reg = select_features(X, y, fdr_level=fdr_level, ml_task='regression')
print(fdr_level, df_filtered_reg.shape)
0.05 (5937, 399)
0.01 (5937, 372)
0.005 (5937, 361)
0.001 (5937, 339)
0.0005 (5937, 329)
0.0001 (5937, 317)
5e-05 (5937, 311)
1e-05 (5937, 300)
5e-06 (5937, 296)
1e-06 (5937, 284)
5e-07 (5937, 282)
1e-07 (5937, 275)
binary_cols = df_filtered_binary.columns.tolist()
reg_cols = df_filtered_reg.columns.tolist()
both_cols = list(set(binary_cols+reg_cols))
print(len(binary_cols), len(reg_cols), len(both_cols))
'Adj Close' in both_cols, 'Volume'in both_cols
y_cols = ['id', 'Date', 'target', 'delta_pct']
df_final[y_cols + both_cols].head(3)
| id | Date | target | delta_pct | Volume__cwt_coefficients__coeff_11__w_20__widths_(2, 5, 10, 20) | Volume__change_quantiles__f_agg_"mean"__isabs_True__qh_1.0__ql_0.8 | Adj Close__change_quantiles__f_agg_"mean"__isabs_False__qh_0.6__ql_0.2 | Volume__cwt_coefficients__coeff_0__w_10__widths_(2, 5, 10, 20) | Adj Close__sum_values | Adj Close__cwt_coefficients__coeff_13__w_20__widths_(2, 5, 10, 20) | ... | Adj Close__cwt_coefficients__coeff_1__w_20__widths_(2, 5, 10, 20) | Adj Close__cwt_coefficients__coeff_2__w_10__widths_(2, 5, 10, 20) | Adj Close__minimum | Adj Close__cwt_coefficients__coeff_0__w_10__widths_(2, 5, 10, 20) | Volume__fft_coefficient__attr_"abs"__coeff_0 | Volume__cwt_coefficients__coeff_1__w_20__widths_(2, 5, 10, 20) | Adj Close__cwt_coefficients__coeff_1__w_10__widths_(2, 5, 10, 20) | Volume__cwt_coefficients__coeff_5__w_5__widths_(2, 5, 10, 20) | Volume__mean | Adj Close__linear_trend__attr_"slope" |
0 | GSK | 2009-02-03 | 0 | 2.778513 | 6.827091e+06 | 320000.0 | -0.007189 | 3.020586e+06 | 431.952223 | 66.731853 | ... | 41.004967 | 44.522909 | 18.374868 | 33.597595 | 40175100.0 | 3.762778e+06 | 39.161548 | 3.142434e+06 | 1.826141e+06 | -0.084799 |
1 | GSK | 2009-02-04 | 0 | 2.020719 | 6.985642e+06 | 320000.0 | -0.007189 | 3.318212e+06 | 431.763572 | 66.577759 | ... | 40.883390 | 44.542326 | 18.374868 | 33.716539 | 41702400.0 | 4.073276e+06 | 39.240166 | 3.307973e+06 | 1.895564e+06 | -0.079756 |
2 | GSK | 2009-02-05 | 0 | -0.642376 | 7.174840e+06 | 320000.0 | -0.007189 | 3.409041e+06 | 432.291793 | 66.449007 | ... | 40.775041 | 44.568357 | 18.374868 | 33.882053 | 43462500.0 | 4.166172e+06 | 39.343889 | 3.367270e+06 | 1.975568e+06 | -0.073900 |
3 rows × 292 columns
file_name = 'PFE_GSK_final.csv'
df_final[y_cols + both_cols].to_csv('data/'+file_name, sep='|', compression='gzip')