{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Date Extractor Tutorial\n", "\n", "The `date_extractor_mds` package offers helper functions for extracting individual components from datetime strings that are formatted according to the ISO 8601 standard. On this page, you'll find complete documentation for the functions in the package, including real-world examples employed by Renee, a data engineer working on a machine learning project." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Rennee's Journey\n", "\n", "### Extracting Time Features for a Time Series Model\n", "\n", "Rennee, a data engineer at a telecommunications company, is working with a large dataset of timestamps for customer usage patterns. To build a time series model that can predict customer behavior, she needs to extract specific time-related features from strings that contain both dates and times (a.k.a. \"datetime strings\"). These features include the year, month, day, and time (composed of hour, minute, and second). These individual components of a datetime string could be crucial on their own for identifying trends, seasonality, or daily usage patterns.\n", "\n", "Rennee starts by preparing her dataset, where each entry contains a datetime string in the ISO 8601 format (e.g., `2023-07-16T12:34:56`). She needs to break these datetime strings down into more manageable features to feed into her machine learning model.\n", "\n", "Here's how Rennee can use Python and the helper functions from the `date_extractor_mds` package in her time series analysis project." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting a Year\n", "\n", "The function `extract_year()` allows users to extract the year from datetime strings formatted according to ISO 8601. It takes one argument, which can be either a single string or a `pandas.Series` of strings. The function returns the year as either an integer or a `pandas.Series` of integers, depending on the input data type.\n", "\n", "#### Getting Years From a String\n", "Rennee verifies that she gets the output she expects by extracting the year from a single example string that she manually defines." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2023" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from date_extractor_mds.date_extractor_mds import extract_year\n", "\n", "my_datetime = \"2023-07-16T12:34:56\"\n", "extracted_year = extract_year(my_datetime)\n", "\n", "extracted_year" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Renee gets the correct year out as expected. She next verifies that the output data type is also as expected:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "int" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(extracted_year)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, the type is `int`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Getting Years From a `pandas.Series`\n", "\n", "Next, Rennee tests `extract_year()` on a `pandas.Series` of datetime strings. In a data analytics context, the typical use case of this functionality would be to pass in the contents of a column from a `pandas.DataFrame`, which is itself stored as a `pandas.Series`.\n", "\n", "This means Rennee can subscript her DataFrame by column name, which returns a series, and pass this to `extract_year()`. She can then use the output to either modify an existing column in place or create a new column.\n", "\n", "First, she sets up a test DataFrame containing a `date` column with two datetime strings." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date
02023-07-16T12:34:56
12024-03-25T08:15:30
\n", "
" ], "text/plain": [ " date\n", "0 2023-07-16T12:34:56\n", "1 2024-03-25T08:15:30" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "# Set up the DataFrame\n", "data = {'date': [\"2023-07-16T12:34:56\", \"2024-03-25T08:15:30\"]}\n", "my_dataframe = pd.DataFrame(data)\n", "\n", "my_dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above, she can see the test DataFrame.\n", "\n", "Rennee decides to create a new column in the DataFrame called `year` and populate it with just the extracted years as integers." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateyear
02023-07-16T12:34:562023
12024-03-25T08:15:302024
\n", "
" ], "text/plain": [ " date year\n", "0 2023-07-16T12:34:56 2023\n", "1 2024-03-25T08:15:30 2024" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_dataframe['year'] = extract_year(my_dataframe['date'])\n", "\n", "my_dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now Renee can see the DataFrame has an additional column, `year`, which contains the correct year for each row. Finally, she verifies that the column contains the expected type (integers):" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 2 entries, 0 to 1\n", "Data columns (total 2 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 date 2 non-null object\n", " 1 year 2 non-null int64 \n", "dtypes: int64(1), object(1)\n", "memory usage: 160.0+ bytes\n" ] } ], "source": [ "my_dataframe.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, the column `year` has the data type `int64`. Looks good!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting a Month\n", "\n", "Like `extract_year()`, `extract_month()` also accepts only a single argument, an ISO 8601 date string or a `pandas.Series` of such strings.\n", "\n", "#### Getting Months From a String\n", "\n", "Rennee again tests out to extracting from a string:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from date_extractor_mds.date_extractor_mds import extract_month\n", "\n", "my_datetime = \"2023-07-16T12:34:56\"\n", "extracted_month = extract_month(my_datetime)\n", "\n", "extracted_month\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks good. The data type is also correct:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "int" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(extracted_month)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Getting Months From a `pandas.Series`\n", "\n", "Renee performs another test on her DataFrame, adding a `month` column this time:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateyearmonth
02023-07-16T12:34:5620237
12024-03-25T08:15:3020243
\n", "
" ], "text/plain": [ " date year month\n", "0 2023-07-16T12:34:56 2023 7\n", "1 2024-03-25T08:15:30 2024 3" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_dataframe['month'] = extract_month(my_dataframe['date'])\n", "\n", "my_dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, she again confirms the new column is of the correct type:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 2 entries, 0 to 1\n", "Data columns (total 3 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 date 2 non-null object\n", " 1 year 2 non-null int64 \n", " 2 month 2 non-null int64 \n", "dtypes: int64(2), object(1)\n", "memory usage: 176.0+ bytes\n" ] } ], "source": [ "my_dataframe.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting a Day\n", "\n", "Like the last two functions Renee has tested, `extract_day()` returns the day as an integer.\n", "\n", "#### Getting Years From a String\n", "\n", "Once again, she first tests the `extract_day()` function on a single string:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from date_extractor_mds.date_extractor_mds import extract_day\n", "\n", "my_datetime = \"2023-07-16T12:34:56\"\n", "extracted_day = extract_day(my_datetime)\n", "\n", "extracted_day" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks good, and the datatype is still as expected:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "int" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(extracted_day)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Getting Days From a `pandas.Series`\n", "\n", "Let's make sure `extract_day()` works properly on a `pandas.Series` like the previous functions did:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateyearmonthday
02023-07-16T12:34:562023716
12024-03-25T08:15:302024325
\n", "
" ], "text/plain": [ " date year month day\n", "0 2023-07-16T12:34:56 2023 7 16\n", "1 2024-03-25T08:15:30 2024 3 25" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_dataframe['day'] = extract_day(my_dataframe['date'])\n", "\n", "my_dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! And she notes the new column is also `int64`, as expected:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 2 entries, 0 to 1\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 date 2 non-null object\n", " 1 year 2 non-null int64 \n", " 2 month 2 non-null int64 \n", " 3 day 2 non-null int64 \n", "dtypes: int64(3), object(1)\n", "memory usage: 192.0+ bytes\n" ] } ], "source": [ "my_dataframe.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting a Time\n", "\n", "Like the other extract functions Renee has tested, `extract_time` also accepts only a single argument, an ISO 8601 date string or a `pandas.Series` of such strings. However, now the `extract_time` function will return either a datetime.time object, or a series of datetime.time object. This change is because datetime.time objects are easier to work with than time strings, which might require additional processing for the user.\n", "\n", "#### Getting a Time From a String\n", "\n", "Rennee tests out to extracting time from a string:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "datetime.time(12, 34, 56)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from date_extractor_mds.date_extractor_mds import extract_time\n", "\n", "my_datetime = \"2023-07-16T12:34:56\"\n", "extracted_time = extract_time(my_datetime)\n", "\n", "extracted_time" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hours: 12\n", "Minutes: 34\n", "Seconds: 56\n" ] } ], "source": [ "print(f\"Hours: {extracted_time.hour}\")\n", "print(f\"Minutes: {extracted_time.minute}\")\n", "print(f\"Seconds: {extracted_time.second}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It works! Lets now check the type is as expected (datetime.time):" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "datetime.time" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(extracted_time)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Getting Times From a `pandas.Series`\n", "\n", "Now, let's make sure `extract_time()` works properly on a `pandas.Series` like the previous functions did:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateyearmonthdaytime
02023-07-16T12:34:56202371612:34:56
12024-03-25T08:15:30202432508:15:30
\n", "
" ], "text/plain": [ " date year month day time\n", "0 2023-07-16T12:34:56 2023 7 16 12:34:56\n", "1 2024-03-25T08:15:30 2024 3 25 08:15:30" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_dataframe['time'] = extract_time(my_dataframe['date'])\n", "\n", "my_dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It works! And she notes the new column is also an object column, as expected:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 2 entries, 0 to 1\n", "Data columns (total 5 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 date 2 non-null object\n", " 1 year 2 non-null int64 \n", " 2 month 2 non-null int64 \n", " 3 day 2 non-null int64 \n", " 4 time 2 non-null object\n", "dtypes: int64(3), object(2)\n", "memory usage: 208.0+ bytes\n" ] } ], "source": [ "my_dataframe.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Final Remarks\n", "\n", "We hoped you had fun walking through these examples, and seeing how Renee used these functions. We have more documentation inside the function definitions if you would like to read more. Also, if you see any concerns please raise an issue in our GitHub repository, we'll be able to respond to you there!" ] } ], "metadata": { "kernelspec": { "display_name": "524", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 4 }