Week 15
Update – HF Datasets Docs Contribution
This week, I updated my pull request #7532 to improve the Hugging Face Datasets documentation related to cache management.
What I worked on
- Clarified the role of the
HF_DATASETS_CACHE
environment variable, which was previously undocumented
-
Explained the difference between:
HF_DATASETS_CACHE
: for datasets cache (e.g., Arrow files)HF_HUB_CACHE
: for model/tokenizer/raw dataset downloads from the HubHF_HOME
: a parent variable that controls both subdirectories
- Provided real-world examples and configurations for environments with shared file systems like NFS
- Integrated feedback from #7480 to ensure the docs reflect actual behavior and user expectations
- Added an inline section directly under the existing
HF_HOME
docs to preserve structure and clarity
Why it matters
Many users (myself included) were confused when setting HF_DATASETS_CACHE
didn’t affect the Hub cache location. This distinction is important in large-scale or multi-user environments where caching strategy can significantly impact performance, I/O, and disk usage.
Written before or on May 4, 2025