Inspect a dataset
We have seen how to create and transform a Dataset
instance. This section guides you through inspecting your data to ensure its integrity and the success of your transformations.
Inspect data in a Dataset
¶
The df
attribute of a Dataset instance is key to interacting with and inspecting the stored data.
By accessing the df attribute, you can apply any method that you would typically use on a PySpark DataFrame. See the PySpark documentation for more information.
View data samples¶
# Inspect the first 10 rows of the data
summary_stats.df.show(10)
This method displays the first 10 rows of your dataset, giving you a snapshot of your data's structure and content.
Filter data¶
import pyspark.sql.functions as f
# Filter summary statistics to only include associations in chromosome 22
filtered = summary_stats.filter(condition=f.col("chromosome") == "22")
This method allows you to filter your data based on specific conditions, such as the value of a column. The application of any filter will create a new instance of the Dataset
with the filtered data.
Understand the schema¶
# Get the Spark schema of any `Dataset` as a `StructType` object
schema = summary_stats.get_schema()
# Inspect the first 10 rows of the data
summary_stats.df.show(10)
Write a Dataset
to disk¶
# Write the data to disk in parquet format
summary_stats.df.write.parquet("path/to/summary/stats")
# Write the data to disk in csv format
summary_stats.df.write.csv("path/to/summary/stats")
Consider the format's compatibility with your tools, and the partitioning strategy for large datasets to optimize performance.