Week 15

Update – HF Datasets Docs Contribution

This week, I updated my pull request #7532 to improve the Hugging Face Datasets documentation related to cache management.

What I worked on

  • Clarified the role of the HF_DATASETS_CACHE environment variable, which was previously undocumented
  • Explained the difference between:

    • HF_DATASETS_CACHE: for datasets cache (e.g., Arrow files)
    • HF_HUB_CACHE: for model/tokenizer/raw dataset downloads from the Hub
    • HF_HOME: a parent variable that controls both subdirectories
  • Provided real-world examples and configurations for environments with shared file systems like NFS
  • Integrated feedback from #7480 to ensure the docs reflect actual behavior and user expectations
  • Added an inline section directly under the existing HF_HOME docs to preserve structure and clarity

Why it matters

Many users (myself included) were confused when setting HF_DATASETS_CACHE didn’t affect the Hub cache location. This distinction is important in large-scale or multi-user environments where caching strategy can significantly impact performance, I/O, and disk usage.

Written before or on May 4, 2025