Skip to content

Inspect a dataset

We have seen how to create and transform a Dataset instance. This section guides you through inspecting your data to ensure its integrity and the success of your transformations.

Inspect data in a Dataset

The df attribute of a Dataset instance is key to interacting with and inspecting the stored data.

By accessing the df attribute, you can apply any method that you would typically use on a PySpark DataFrame. See the PySpark documentation for more information.

View data samples

# Inspect the first 10 rows of the data
summary_stats.df.show(10)

This method displays the first 10 rows of your dataset, giving you a snapshot of your data's structure and content.

Filter data

import pyspark.sql.functions as f

# Filter summary statistics to only include associations in chromosome 22
filtered = summary_stats.filter(condition=f.col("chromosome") == "22")

This method allows you to filter your data based on specific conditions, such as the value of a column. The application of any filter will create a new instance of the Dataset with the filtered data.

Understand the schema

# Get the Spark schema of any `Dataset` as a `StructType` object
schema = summary_stats.get_schema()

# Inspect the first 10 rows of the data
summary_stats.df.show(10)

Write a Dataset to disk

# Write the data to disk in parquet format
summary_stats.df.write.parquet("path/to/summary/stats")

# Write the data to disk in csv format
summary_stats.df.write.csv("path/to/summary/stats")

Consider the format's compatibility with your tools, and the partitioning strategy for large datasets to optimize performance.