pyspark.pandas.groupby.DataFrameGroupBy.describe¶
-
DataFrameGroupBy.describe() → pyspark.pandas.frame.DataFrame[source]¶ Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding
NaNvalues.Analyzes both numeric and object series, as well as
DataFramecolumn sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.Note
Unlike pandas, the percentiles in pandas-on-Spark are based upon approximate percentile computation because computing percentiles across a large dataset is extremely expensive.
- Returns
- DataFrame
Summary statistics of the DataFrame provided.
See also
DataFrame.countDataFrame.maxDataFrame.minDataFrame.meanDataFrame.std
Examples
>>> df = ps.DataFrame({'a': [1, 1, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}) >>> df a b c 0 1 4 7 1 1 5 8 2 3 6 9
Describing a
DataFrame. By default only numeric fields are returned.>>> described = df.groupby('a').describe() >>> described.sort_index() b c count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max a 1 2.0 4.5 0.707107 4.0 4.0 4.0 5.0 5.0 2.0 7.5 0.707107 7.0 7.0 7.0 8.0 8.0 3 1.0 6.0 NaN 6.0 6.0 6.0 6.0 6.0 1.0 9.0 NaN 9.0 9.0 9.0 9.0 9.0