Building a Dashboard for a Python Package
I made a simple dashboard for visualizing marshmallow PyPI downloads.
Why
While maintaining marshmallow, I found myself running one-off queries against PyPI's BigQuery dataset to answer some recurring questions:
- How many marshmallow users are using Python 2 vs Python 3?
- Can we drop support for a minor Python version without pissing off too many users?
- Are users migrating to marshmallow 3?
- How long should we support old marshmallow versions?
Executing and modifying BigQuery queries by hand quickly became tedious.
How
I run a daily scheduled query on BigQuery that subsets the PyPI dataset with the specific data that I need. This keeps my BigQuery usage well within the limits of the free tier.
The query starts with a CTE that selects the date, Python major and minor versions, marshmallow major and minor versions, and platform.
WITH
dls AS (
SELECT
DATE_SUB(DATE(@run_time), INTERVAL 1 DAY) AS date,
file.project AS package,
details.installer.name AS installer,
-- Full python version
details.python AS python_version,
-- Python major version
CAST(SPLIT(details.python, '.')[
OFFSET
(0)] AS string) AS python_major,
-- Python minor version
CAST(CONCAT(SPLIT(details.python, '.')[
OFFSET
(0)],'.',SPLIT(details.python, '.')[
OFFSET
(1)]) AS string) AS python_minor,
-- Full marshmallow version
file.version AS marshmallow_version,
-- marshmallow major version
CAST(SPLIT(file.version, '.')[
OFFSET
(0)] AS string) AS marshmallow_major,
-- Platform
details.system.name AS system
FROM
`the-psf.pypi.downloads*`
WHERE
-- Past day
_TABLE_SUFFIX = FORMAT_DATE('%Y%m%d', DATE_SUB(DATE(@run_time), INTERVAL 1 DAY))
AND file.project = 'marshmallow'
-- Exclude mirrors
AND details.installer.name NOT IN ("bandersnatch",
"z3c.pypimirror",
"Artifactory",
"devpi")
AND details.python IS NOT NULL )
-- ...
Then the query groups the download counts by category:
- Python major version
- Python minor version
- marshmallow major version
- marshmallow minor version
- Python minor version x marshmallow major version
-- WITH dls AS ...
-- Python 2 vs 3
SELECT
date,
'python_major' AS category_label,
python_major AS category_value,
COUNT(*) AS downloads
FROM
dls
GROUP BY
date,
package,
category_value
UNION ALL
-- Python minor versions
SELECT
date,
'python_minor' AS category_label,
python_minor AS category_value,
COUNT(*) AS downloads
FROM
dls
GROUP BY
date,
package,
category_value
UNION ALL
-- marshmallow major version
SELECT
date,
'marshmallow_major' AS category_label,
marshmallow_major AS category_value,
COUNT(*) AS downloads
FROM
dls
WHERE
marshmallow_major IN ('2', '3')
GROUP BY
date,
package,
category_value
UNION ALL
-- marshmallow minor version
SELECT
date,
'marshmallow_version' AS category_label,
marshmallow_version AS category_value,
COUNT(*) AS downloads
FROM
dls
WHERE marshmallow_major IN ('2', '3')
GROUP BY
date,
package,
category_value
UNION ALL
-- Python minor version x marshmallow major version
SELECT
date,
'combined' AS category_label,
CAST(CONCAT('py', python_minor, '-', 'marshmallow', marshmallow_major) AS string) AS category_value,
COUNT(*) AS downloads
FROM
dls
WHERE
marshmallow_major IN ('2', '3')
GROUP BY
date,
package,
category_value
There is no way to directly control for downloads on CI servers. One solution is to exclude downloads on Linux and assume that downloads on Windows and macOS machines are a representative sample of local development installations.
To achieve this, the query includes duplicates of the above SELECT
s with an additional WHERE system != "Linux"
clause.
-- ... Same as above, excluding Linux downloads
SELECT
date,
'python_major' AS category_label,
CONCAT(python_major, '-', 'no_linux') AS category_value,
COUNT(*) AS downloads
FROM
dls
WHERE
system != "Linux"
GROUP BY
date,
package,
category_value
UNION ALL
-- and so on ...
The full query is here. The resulting dataset looks like this:
After backfilling the dataset, I wrote an app to query the dataset and visualize it with Dash, a Python framework for creating interactive visualizations. With pandas at hand to take care of slicing and transforming the dataset, writing the visualization code in Python turned out to be a breeze.
The source code for the Dash app is here.
Takeaways
The majority of marshmallow users are using Python 3.6 and Python 3.7. Python 3.5 usage is low (even lower than Python 2).
Users upgrade to new versions quickly. marshmallow users install the latest point versions of the supported release lines as they're released.
Don't expect users to use pre-releases. marshmallow 3 was "release-candidate stable" for several months, and we recommended using v3 for new projects. Even so, v3 remained at < 20% until the stable release.
Please send comments by email. I welcome your feedback, advice, and criticism.