I am interested in accessing the data of igdb.com for a hobby project. They have an api that is “free”, ie it requires a twitch account with 2fa enabled which means i would need to hand over my phone number to twitch. That is a hard no from my side.

Anyone know of a way to get a dump of their database or a similar source where i could get api access without revealing personal information? For my project i am after only title, release year, some sort of rating and some sort of playtime.

  • BakedCatboy@lemmy.ml
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    28 days ago

    Is playtime called something else in igdb? I’m not seeing that one in the docs.

    Edit: I guess igdb doesn’t have playtime, but if all you need from igdb is the /games endpoint, here’s a scrape of 363020 entries (it seems there may be duplicate IDs as loading them into a dict yields 362969 entries)

    Edit2: I just realized that igdb has /game_time_to_beats so I guess you’d want that as well

    Updated dump of the main games endpoint and the time to beat endpoint (contains only 8604 entries in comparison):

    https://mega.nz/file/xdJnEJLD#PlblwLr22Yfea4GUERBLwTsuunbwE3pGsq41OBqIaDg

      • BakedCatboy@lemmy.ml
        link
        fedilink
        English
        arrow-up
        0
        ·
        edit-2
        28 days ago

        You’re welcome! It was a fun hyperfixation project. I ended up making the script so easy to use I decided to just scrape every other endpoint too, so if anyone wants it, here’s a full dump of every endpoint, it’s only like 4x bigger:

        https://mega.nz/file/YF4F3bCS#pkS8Ki9QuucMGJF65YwGUE-NQZ78QEWs73fmF71qa18

        And if anyone wants to do their own scraping to get more up to date data later, just pip install:

        python-dotenv==1.2.2
        Requests==2.34.2
        tqdm==4.67.1
        

        Put API keys in .env or export env vars:

        CLIENT_ID=<client_id>
        # Provide to fetch new token
        CLIENT_SECRET=<client_secret>
        # Optional, provide to reuse existing access token, secret will not be used
        ACCESS_TOKEN=<access_token>
        

        And just run python dump.py games or any other endpoint in the api docs like release_dates etc. It outputs the json and a simple log to an output folder wherever you ran it. No error handling or checkpointing so if it fails partway through you don’t get anything, but I didn’t have a single error the whole time.

        usage: dump.py [-h] api_route
        
        IGDB Dump Script
        
        positional arguments:
          api_route   The API route to scrape, eg: games or game_time_to_beats
        
        options:
          -h, --help  show this help message and exit
        

        dump.py:

        import argparse
        import json
        import logging
        import os
        import pathlib
        import time
        from dotenv import dotenv_values
        import requests
        from tqdm import tqdm
        
        API_PAGE_SIZE = 500
        OUT_DIR = "output"
        
        config = {
            **dotenv_values(".env"),
            **os.environ,
        }
        
        # Set up flags / args
        parser = argparse.ArgumentParser(
            prog="dump.py", description="IGDB Dump Script"
        )
        parser.add_argument(
            "api_route",
            help="The API route to scrape, eg: games or game_time_to_beats",
        )
        args = parser.parse_args()
        
        # Create out dir
        pathlib.Path.mkdir(OUT_DIR, parents=False, exist_ok=True)
        
        # Set up logging to the route's file
        tqdmHandler = logging.StreamHandler(tqdm)
        tqdmHandler.terminator = ""
        logging.basicConfig(
            level=logging.INFO,
            format="%(message)s",
            handlers=[
                logging.FileHandler(f"{OUT_DIR}/{args.api_route}.log"),
                tqdmHandler
            ],
        )
        
        # Check for existing json to prevent overwriting existing dumps
        outFile = f"{OUT_DIR}/{args.api_route}.json"
        if pathlib.Path(outFile).exists():
            print(f"Existing json found {outFile}, please move or remove it before proceeding")
            exit(1)
        
        if config['CLIENT_ID'] and config['ACCESS_TOKEN']:
            logging.info("Using CLIENT_ID and existing ACCESS_TOKEN")
        elif config['CLIENT_ID'] and config['CLIENT_SECRET'] and not config['ACCESS_TOKEN']:
            logging.info("Fetching new access token...")
            response = requests.post(
                url="https://id.twitch.tv/oauth2/token",
                params={
                    "client_id": config['CLIENT_ID'],
                    "client_secret": config['CLIENT_SECRET'],
                    "grant_type": "client_credentials"
                },
                timeout=30
            )
            config['ACCESS_TOKEN'] = response.json()['access_token']
        else:
            logging.info("Missing CLIENT_ID and CLIENT_SECRET or ACCESS_TOKEN")
            exit(1)
        
        # Re-check access token in case fetch failed
        if config['CLIENT_ID'] and config['ACCESS_TOKEN']:
            items = []
            offset = 0
            logging.info(f"Fetching batches of {API_PAGE_SIZE} on endpoint {args.api_route}")
            with tqdm() as pbar:
                while True:
                    response = requests.post(
                        url=f"https://api.igdb.com/v4/%7Bargs.api_route%7D",
                        headers={
                            "Client-ID": config['CLIENT_ID'],
                            "Authorization": f"Bearer {config['ACCESS_TOKEN']}"
                        },
                        data=f"fields *; limit {API_PAGE_SIZE}; offset {offset};",
                        timeout=30
                    )
                    newItems = response.json()
                    fetchCount = len(newItems)
                    pbar.update(fetchCount)
                    if fetchCount != API_PAGE_SIZE:
                        logging.info(f"WARN: Requested {API_PAGE_SIZE}, got {fetchCount}")
                    offset += API_PAGE_SIZE
                    items.extend(newItems)
                    if fetchCount < API_PAGE_SIZE:
                        logging.info("Received partial page, ending")
                        break
                    time.sleep(1)
        
            logging.info(f"Total fetched: {len(items)}")
            with open(outFile, "w", encoding="utf-8") as file:
                logging.info("Writing to json...")
                json.dump(items, file, ensure_ascii=False, indent=2)
        
            # Print some stats
            logging.info(f"\nChecking json output: {args.api_route}.json")
        
            entries = []
            with open(outFile, "r", encoding="utf-8") as file:
                entries = json.load(file)
        
            logging.info(f"{len(entries)} entries in json")
        
            entryDict = {}
            for entry in entries:
                entryDict.update({entry['id']: entry})
        
            logging.info(f"{len(entryDict)} unique IDs in json")
        else:
            logging.error("Client ID or Access Token not available")
            exit(1)