Date Extractor Tutorial
The date_extractor_mds package offers helper functions for extracting individual components from datetime strings that are formatted according to the ISO 8601 standard. On this page, you’ll find complete documentation for the functions in the package, including real-world examples employed by Renee, a data engineer working on a machine learning project.
Rennee’s Journey
Extracting Time Features for a Time Series Model
Rennee, a data engineer at a telecommunications company, is working with a large dataset of timestamps for customer usage patterns. To build a time series model that can predict customer behavior, she needs to extract specific time-related features from strings that contain both dates and times (a.k.a. “datetime strings”). These features include the year, month, day, and time (composed of hour, minute, and second). These individual components of a datetime string could be crucial on their own for identifying trends, seasonality, or daily usage patterns.
Rennee starts by preparing her dataset, where each entry contains a datetime string in the ISO 8601 format (e.g., 2023-07-16T12:34:56). She needs to break these datetime strings down into more manageable features to feed into her machine learning model.
Here’s how Rennee can use Python and the helper functions from the date_extractor_mds package in her time series analysis project.
Extracting a Year
The function extract_year() allows users to extract the year from datetime strings formatted according to ISO 8601. It takes one argument, which can be either a single string or a pandas.Series of strings. The function returns the year as either an integer or a pandas.Series of integers, depending on the input data type.
Getting Years From a String
Rennee verifies that she gets the output she expects by extracting the year from a single example string that she manually defines.
from date_extractor_mds.date_extractor_mds import extract_year
my_datetime = "2023-07-16T12:34:56"
extracted_year = extract_year(my_datetime)
extracted_year
2023
Renee gets the correct year out as expected. She next verifies that the output data type is also as expected:
type(extracted_year)
int
As expected, the type is int.
Getting Years From a pandas.Series
Next, Rennee tests extract_year() on a pandas.Series of datetime strings. In a data analytics context, the typical use case of this functionality would be to pass in the contents of a column from a pandas.DataFrame, which is itself stored as a pandas.Series.
This means Rennee can subscript her DataFrame by column name, which returns a series, and pass this to extract_year(). She can then use the output to either modify an existing column in place or create a new column.
First, she sets up a test DataFrame containing a date column with two datetime strings.
import pandas as pd
# Set up the DataFrame
data = {'date': ["2023-07-16T12:34:56", "2024-03-25T08:15:30"]}
my_dataframe = pd.DataFrame(data)
my_dataframe
| date | |
|---|---|
| 0 | 2023-07-16T12:34:56 |
| 1 | 2024-03-25T08:15:30 |
Above, she can see the test DataFrame.
Rennee decides to create a new column in the DataFrame called year and populate it with just the extracted years as integers.
my_dataframe['year'] = extract_year(my_dataframe['date'])
my_dataframe
| date | year | |
|---|---|---|
| 0 | 2023-07-16T12:34:56 | 2023 |
| 1 | 2024-03-25T08:15:30 | 2024 |
Now Renee can see the DataFrame has an additional column, year, which contains the correct year for each row. Finally, she verifies that the column contains the expected type (integers):
my_dataframe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 2 non-null object
1 year 2 non-null int64
dtypes: int64(1), object(1)
memory usage: 160.0+ bytes
As expected, the column year has the data type int64. Looks good!
Extracting a Month
Like extract_year(), extract_month() also accepts only a single argument, an ISO 8601 date string or a pandas.Series of such strings.
Getting Months From a String
Rennee again tests out to extracting from a string:
from date_extractor_mds.date_extractor_mds import extract_month
my_datetime = "2023-07-16T12:34:56"
extracted_month = extract_month(my_datetime)
extracted_month
7
Looks good. The data type is also correct:
type(extracted_month)
int
Getting Months From a pandas.Series
Renee performs another test on her DataFrame, adding a month column this time:
my_dataframe['month'] = extract_month(my_dataframe['date'])
my_dataframe
| date | year | month | |
|---|---|---|---|
| 0 | 2023-07-16T12:34:56 | 2023 | 7 |
| 1 | 2024-03-25T08:15:30 | 2024 | 3 |
Finally, she again confirms the new column is of the correct type:
my_dataframe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 2 non-null object
1 year 2 non-null int64
2 month 2 non-null int64
dtypes: int64(2), object(1)
memory usage: 176.0+ bytes
Extracting a Day
Like the last two functions Renee has tested, extract_day() returns the day as an integer.
Getting Years From a String
Once again, she first tests the extract_day() function on a single string:
from date_extractor_mds.date_extractor_mds import extract_day
my_datetime = "2023-07-16T12:34:56"
extracted_day = extract_day(my_datetime)
extracted_day
16
Looks good, and the datatype is still as expected:
type(extracted_day)
int
Getting Days From a pandas.Series
Let’s make sure extract_day() works properly on a pandas.Series like the previous functions did:
my_dataframe['day'] = extract_day(my_dataframe['date'])
my_dataframe
| date | year | month | day | |
|---|---|---|---|---|
| 0 | 2023-07-16T12:34:56 | 2023 | 7 | 16 |
| 1 | 2024-03-25T08:15:30 | 2024 | 3 | 25 |
Great! And she notes the new column is also int64, as expected:
my_dataframe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 2 non-null object
1 year 2 non-null int64
2 month 2 non-null int64
3 day 2 non-null int64
dtypes: int64(3), object(1)
memory usage: 192.0+ bytes
Extracting a Time
Like the other extract functions Renee has tested, extract_time also accepts only a single argument, an ISO 8601 date string or a pandas.Series of such strings. However, now the extract_time function will return either a datetime.time object, or a series of datetime.time object. This change is because datetime.time objects are easier to work with than time strings, which might require additional processing for the user.
Getting a Time From a String
Rennee tests out to extracting time from a string:
from date_extractor_mds.date_extractor_mds import extract_time
my_datetime = "2023-07-16T12:34:56"
extracted_time = extract_time(my_datetime)
extracted_time
datetime.time(12, 34, 56)
print(f"Hours: {extracted_time.hour}")
print(f"Minutes: {extracted_time.minute}")
print(f"Seconds: {extracted_time.second}")
Hours: 12
Minutes: 34
Seconds: 56
It works! Lets now check the type is as expected (datetime.time):
type(extracted_time)
datetime.time
Getting Times From a pandas.Series
Now, let’s make sure extract_time() works properly on a pandas.Series like the previous functions did:
my_dataframe['time'] = extract_time(my_dataframe['date'])
my_dataframe
| date | year | month | day | time | |
|---|---|---|---|---|---|
| 0 | 2023-07-16T12:34:56 | 2023 | 7 | 16 | 12:34:56 |
| 1 | 2024-03-25T08:15:30 | 2024 | 3 | 25 | 08:15:30 |
It works! And she notes the new column is also an object column, as expected:
my_dataframe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 2 non-null object
1 year 2 non-null int64
2 month 2 non-null int64
3 day 2 non-null int64
4 time 2 non-null object
dtypes: int64(3), object(2)
memory usage: 208.0+ bytes
Final Remarks
We hoped you had fun walking through these examples, and seeing how Renee used these functions. We have more documentation inside the function definitions if you would like to read more. Also, if you see any concerns please raise an issue in our GitHub repository, we’ll be able to respond to you there!