A few details on the US EIA (and FERC) Electricity data

I notice on the DBnomics site you’re pulling in US EIA data. That’s a lot of the data which Catalyst is integrating as well. It looks like you’re pulling directly from the API that they publish and getting the multitude of individual time series. Have you looked at pulling their bulk download data as well? Or providing a similar all-in-one data product? Is that within the scope of your project?

Also, you might be interested to know that there’s a bunch of valuable US EIA data which is not contained within the API. The most fine grained electricity production and fuel consumption data – at the individual generator and boiler level – are not published in a cleanly structured machine readable form. E.g. for the Comanche coal fired power plant in Colorado you have plant level fuel consumption and net generation at monthly resolution from the EIA’s API, but they also have that data split out individually for Comanche units 1, 2, and 3 – but only via spreadsheets. Similarly the most detailed data about oil and gas production is excluded from the API, but available via spreadsheets (which are, sadly, apparently the original, authoritative source of data). Catalyst is integrating that more detailed data and other energy related data (like the FERC Form 1, available from FERC only as undocumented binary database files) in the US for open publication, and linking the datasets from different agencies together to make them more useful.

Is that a kind of data that you would want to re-publish? It might be more useful but would definitely be less official than what’s coming directly from the reporting agencies. If it were something you’d be interested in aggregating and re-publishing, what would be the easiest way to make that happen? If we were to publish it directly as data packages via a platform like https://datahub.io is that something you might index? You seem focused on time series, but are you also publishing data related to the entities referenced in the time series (e.g. power plants, generating units, utilities that own them, etc.)? That data is also often updated on an annual or quarterly basis – would that qualify as a time series for your purposes?

3 Likes

This is very dated at this point, but I would sincerely like to know the same. Any updates?

Hi there! Sorry for the delay.

I think that those extra EIA datasets would fit in DBnomics. However, due to a lack of time, we could not dive into them.

DBnomics fetchers automate data acquisition in order to minimize human intervention. What’s most time-consuming is reacting to URL changes, data formats changes (e.g. column names in Excel files), processing unstructured data, and having to reconstitute the dimensions of a multi-dimensional dataset when they are not given by the provider.

If I understand correctly, the Catalyst project, via PUDL, transforms those spreadsheets in more structured data formats?
If so, that would speed-up their integration in DBnomics.

I don’t know whether publishing to datahub.io would ease the job. Downloading from the datasette instance would be OK.

You seem focused on time series, but are you also publishing data related to the entities referenced in the time series (e.g. power plants, generating units, utilities that own them, etc.)? That data is also often updated on an annual or quarterly basis – would that qualify as a time series for your purposes?

That’s clearly the question to dive into. I can’t answer as I am personally a computer scientist, but an economist who works with energy data could.

Basically: what would be the datasets and what dimension would they have?

Thanks again for your interest in DBnomics. I hope we’ll be able to sort this out!

Hey there,

We’re still publishing this FERC & EIA data and much more, and have bulk quarterly releases that are archived on Zenodo, as well as being available through the AWS Open Data Registry and a Kaggle dataset. Data access instructions here.