Pandas makes it easy to quickly load, manipulate, align, merge, and even visualize data tables directly in Python. Credit: Thinkstock When it comes to working with data in a tabular form, most people reach for a spreadsheet. That’s not a bad choice: Microsoft Excel and similar programs are familiar and loaded with functionality for massaging tables of data. But what if you want more control, precision, and power than Excel alone delivers? In that case, the open source Pandas library for Python might be what you are looking for. It outfits Python with new data types for loading data fast from tabular sources, and for manipulating, aligning, merging, and doing other processing at scale. Your first Pandas data set Pandas is not part of the Python standard library. It’s a third-party project, so you’ll need to install it in your Python runtime with pip install pandas. Once installed, you can import it into Python with import pandas. Pandas gives you two new data types: Series and DataFrame. The DataFrame represents your entire spreadsheet or rectangular data, whereas the Series is a single column of the DataFrame. You can also think of the Pandas DataFrame as a dictionary or collection of Series objects. You’ll find later that you can use dictionary- and list-like methods for finding elements in a DataFrame. You typically work with Pandas by importing data in some other format. A common external tabular data format is CSV, a text file with values separated by commas. If you have a CSV handy, you can use it. For this article, we’ll be using an excerpt from the Gapminder data set prepared by Jennifer Bryan from the University of British Columbia. To begin using Pandas, we first import the library. Note that it’s a common practice to alias the Pandas library as pd to save some typing: import pandas as pd To start working with the sample data in CSV format, we can load it in as a dataframe using the pd.read_csv function: df = pd.read_csv("./gapminder/inst/extdata/gapminder.tsv", sep='t') The sep parameter lets us specify that this particular file is tab-delimited rather than comma-delimited. Once the data’s been loaded, you can peek at its formatting to make sure it’s loaded correctly by using the .head() method on the dataframe: print(df.head()) country continent year lifeExp pop gdpPercap 0 Afghanistan Asia 1952 28.801 8425333 779.445314 1 Afghanistan Asia 1957 30.332 9240934 820.853030 2 Afghanistan Asia 1962 31.997 10267083 853.100710 3 Afghanistan Asia 1967 34.020 11537966 836.197138 4 Afghanistan Asia 1972 36.088 13079460 739.981106 Dataframe objects have a shape attribute that reports the number of rows and columns in the dataframe: print(df.shape) (1704, 6) # rows, cols To list the names of the columns themselves, use .columns: print(df.columns) Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object') Dataframes in Pandas work much the same way as dataframes in other languages, such as Julia and R. Each column, or Series, must be the same type, whereas each row can contain mixed types. For instance, in the current example, the country column will always be a string, and the year column is always an integer. We can verify this by using .dtypes to list the data type of each column: print(df.dtypes) country object continent object year int64 lifeExp float64 pop int64 gdpPercap float64 dtype: object For an even more explicit breakdown of your dataframe’s types, you can use .info(): df.info() # information is written to console, so no print required <class 'pandas.core.frame.DataFrame'> RangeIndex: 1704 entries, 0 to 1703 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 1704 non-null object 1 continent 1704 non-null object 2 year 1704 non-null int64 3 lifeExp 1704 non-null float64 4 pop 1704 non-null int64 5 gdpPercap 1704 non-null float64 dtypes: float64(2), int64(2), object(2) memory usage: 80.0+ KB Each Pandas data type maps to a native Python data type: object is handled as a Python str type. int64 is handled as a Python int. Note that not all Python ints can be converted to int64 types; anything larger than (2 ** 63)-1 will not convert to int64. float64 is handled as a Python float (which is a 64-bit float natively). datetime64 is handled as a Python datetime.datetime object. Note that Pandas does not automatically try to convert things that look like dates into date values; you must tell Pandas you want to do this for a specific column. Pandas columns, rows, and cells Now that you’re able to load a simple data file, you want to be able to inspect its contents. You could print the contents of the dataframe, but most dataframes will be too big to inspect by printing. A better approach is to look at subsets of the data, as we did with df.head(), but with more control. Pandas lets you make excerpts from dataframes, using Python’s existing syntax for indexing and creating slices. Extracting Pandas columns To examine columns in a Pandas dataframe, you can extract them by their names, positions, or by ranges. For instance, if you want a specific column from your data, you can request it by name using square brackets: # extract the column "country" into its own dataframe country_df = df["country"] # show the first five rows print(country_df.head()) | 0 Afghanistan | 1 Afghanistan | 2 Afghanistan | 3 Afghanistan | 4 Afghanistan Name: country, dtype: object # show the last five rows print(country_df.tail()) | 1699 Zimbabwe | 1700 Zimbabwe | 1701 Zimbabwe | 1702 Zimbabwe | 1703 Zimbabwe | Name: country, dtype: object If you want to extract multiple columns, pass a list of the column names: # Looking at country, continent, and year subset = df[['country', 'continent', 'year']] print(subset.head()) country continent year | 0 Afghanistan Asia 1952 | 1 Afghanistan Asia 1957 | 2 Afghanistan Asia 1962 | 3 Afghanistan Asia 1967 | 4 Afghanistan Asia 1972 print(subset.tail()) country continent year | 1699 Zimbabwe Africa 1987 | 1700 Zimbabwe Africa 1992 | 1701 Zimbabwe Africa 1997 | 1702 Zimbabwe Africa 2002 | 1703 Zimbabwe Africa 2007 Subsetting rows If you want to extract rows from a dataframe, you can use one of two methods. .iloc[] is the simplest method. It extracts rows based on their position, starting at 0. For fetching the first row in the above dataframe example, you’d use df.iloc[0]. If you want to fetch a range of rows, you can use .iloc[] with Python’s slicing syntax. For instance, for the first 10 rows, you’d use df.iloc[0:10]. And if you wanted to obtain the last 10 rows in reverse order, you’d use df.iloc[::-1]. If you want to extract specific rows, you can use a list of the row IDs; for example, df.iloc[[0,1,2,5,7,10,12]]. (Note the double brackets—that means you’re providing a list as the first argument.) Another way to extract rows is with .loc[]. This extracts a subset based on labels for rows. By default, rows are labeled with an incrementing integer value starting with 0. But data can also be labeled manually by setting the dataframe’s .index property. For instance, if we wanted to re-index the above dataframe so that each row had an index using multiples of 100, we could use df.index = range(0, len(df)*100, 100). Then, if we used, df.loc[100], we’d get the second row. Subsetting columns If you want to retrieve only a certain subset of columns along with your row slices, you do this by passing a list of columns as a second argument: df.loc[[rows], [columns]] For instance, with the above dataset, if we want to get only the country and year columns for all rows, we’d do this: df.loc[:, ["country","year"]] The : in the first position means “all rows” (it’s Python’s slicing syntax). The list of columns follows after the comma. You can also specify columns by position when using .iloc: df.iloc[:, [0,2]] Or, to get just the first three columns: df.iloc[:, 0:3] All of these approaches can be combined, as long as you remember loc is used for labels and column names, and iloc is used for numeric indexes. The following tells Pandas to extract the first 100 rows by their numeric labels, and then from that to extract the first three columns by their indexes: df.loc[0:100].iloc[:, 0:3] It’s generally least confusing to use actual column names when subsetting data. It makes the code easier to read, and you don’t have to refer back to the dataset to figure out which column corresponds to what index. It also protects you from mistakes if columns are re-ordered. Grouped and aggregated calculations Spreadsheets and number-crunching libraries all come with methods for generating statistics about data. Consider the Gapminder data again: print(df.head(n=10)) | country continent year lifeExp pop gdpPercap | 0 Afghanistan Asia 1952 28.801 8425333 779.445314 | 1 Afghanistan Asia 1957 30.332 9240934 820.853030 | 2 Afghanistan Asia 1962 31.997 10267083 853.100710 | 3 Afghanistan Asia 1967 34.020 11537966 836.197138 | 4 Afghanistan Asia 1972 36.088 13079460 739.981106 | 5 Afghanistan Asia 1977 38.438 14880372 786.113360 | 6 Afghanistan Asia 1982 39.854 12881816 978.011439 | 7 Afghanistan Asia 1987 40.822 13867957 852.395945 | 8 Afghanistan Asia 1992 41.674 16317921 649.341395 | 9 Afghanistan Asia 1997 41.763 22227415 635.341351 Here are some examples of questions we could ask about this data: What’s the average life expectancy for each year in this data? What if I want averages across the years and the continents? How do I count how many countries in this data are in each continent? The way to answer these questions with Pandas is to perform a grouped or aggregated calculation. We can split the data along certain lines, apply some calculation to each split segment, and then re-combine the results into a new dataframe. Grouped means counts The first method we’d use for this is Pandas’s df.groupby() operation. We provide a column we want to split the data by: df.groupby("year") This allows us to treat all rows with the same year value together, as a distinct object from the dataframe itself. From there, we can use the “life expectancy” column and calculate its per-year mean: print(df.groupby('year')['lifeExp'].mean()) year 1952 49.057620 1957 51.507401 1962 53.609249 1967 55.678290 1972 57.647386 1977 59.570157 1982 61.533197 1987 63.212613 1992 64.160338 1997 65.014676 2002 65.694923 2007 67.007423 This gives us the mean life expectancy for all populations, by year. We could perform the same kinds of calculations for population and GDP by year: print(df.groupby('year')['pop'].mean()) print(df.groupby('year')['gdpPercap'].mean()) So far, so good. But what if we want to group our data by more than one column? We can do this by passing columns in lists: print(df.groupby(['year', 'continent']) [['lifeExp', 'gdpPercap']].mean()) lifeExp gdpPercap year continent 1952 Africa 39.135500 1252.572466 Americas 53.279840 4079.062552 Asia 46.314394 5195.484004 Europe 64.408500 5661.057435 Oceania 69.255000 10298.085650 1957 Africa 41.266346 1385.236062 Americas 55.960280 4616.043733 Asia 49.318544 5787.732940 Europe 66.703067 6963.012816 Oceania 70.295000 11598.522455 1962 Africa 43.319442 1598.078825 Americas 58.398760 4901.541870 Asia 51.563223 5729.369625 Europe 68.539233 8365.486814 Oceania 71.085000 12696.452430 This .groupby() operation takes our data and groups it first by year, and then by continent. Then it generates mean values from the life-expectancy and GDP columns. This way, you can create groups in your data and rank how they are to be presented and calculated. If you want to “flatten” the results into a single, incrementally indexed frame, you can use the .reset_index() method on the results: gb = df.groupby(['year', 'continent']) [['lifeExp', 'gdpPercap']].mean() flat = gb.reset_index() print(flat.head()) | year continent lifeExp gdpPercap | 0 1952 Africa 39.135500 1252.572466 | 1 1952 Americas 53.279840 4079.062552 | 2 1952 Asia 46.314394 5195.484004 | 3 1952 Europe 64.408500 5661.057435 | 4 1952 Oceana 69.255000 10298.085650 Grouped frequency counts Another thing we often do with data is compute frequencies. The nunique and value_counts methods can be used to get unique values in a series, and their frequencies. For instance, here’s how to find out how many countries we have in each continent: print(df.groupby('continent')['country'].nunique()) continent Africa 52 Americas 25 Asia 33 Europe 30 Oceana 2 Basic plotting with Pandas and Matplotlib Most of the time, when you want to visualize data, you’ll use another library such as Matplotlib to generate those graphics. However, you can use Matplotlib directly (along with some other plotting libraries) to generate visualizations from within Pandas. To use the simple Matplotlib extension for Pandas, first make sure you’ve installed Matplotlib with pip install matplotlib. Now let’s look at the yearly life expectancies for the world population again: global_yearly_life_expectancy = df.groupby('year')['lifeExp'].mean() print(global_yearly_life_expectancy) | year | 1952 49.057620 | 1957 51.507401 | 1962 53.609249 | 1967 55.678290 | 1972 57.647386 | 1977 59.570157 | 1982 61.533197 | 1987 63.212613 | 1992 64.160338 | 1997 65.014676 | 2002 65.694923 | 2007 67.007423 | Name: lifeExp, dtype: float64 To create a basic plot from this, use: import matplotlib.pyplot as plt global_yearly_life_expectancy = df.groupby('year')['lifeExp'].mean() c = global_yearly_life_expectancy.plot().get_figure() plt.savefig("output.png") The plot will be saved to a file in the current working directory as output.png. The axes and other labeling on the plot can all be set manually, but for quick exports this method works fine. Conclusion Python and Pandas offer many features you can’t get from spreadsheets alone. For one, they let you automate your work with data and make the results reproducible. Rather than write spreadsheet macros, which are clunky and limited, you can use Pandas to analyze, segment, and transform data—and use Python’s expressive power and package ecosystem (for instance, for graphing or rendering data to other formats) to do even more than you could with Pandas alone. Related content analysis 7 steps to improve analytics for data-driven organizations Effective data-driven decision-making requires good tools, high-quality data, efficient processes, and prepared people. Here’s how to achieve it. By Isaac Sacolick Jul 01, 2024 10 mins Analytics news Maker of RStudio launches new R and Python IDE Posit, formerly RStudio, has released a beta of Positron, a ‘next generation’ data science development environment based on Visual Studio Code. By Sharon Machlis Jun 27, 2024 3 mins Integrated Development Environments Python R Language feature 4 highlights from EDB Postgres AI New platform product supports transactional, analytical, and AI workloads. By Aislinn Shea Wright Jun 13, 2024 6 mins PostgreSQL Generative AI Databases analysis Microsoft Fabric evolves from data lake to application platform Microsoft delivers a one-stop shop for big data applications with its latest updates to its data platform. By Simon Bisson Jun 13, 2024 7 mins Microsoft Azure Natural Language Processing Data Architecture Resources Videos