Research on CSV

Hello Github,

introduction
I’m doing research to support python pandas in regard to ticket: Too much faf getting csv reader to work.
Research done by an Austrian team [1] found that appx. 50% of the csv files in their 415Gb sample were possible to parse according to RFC4180.
Evidence from data analysis on non-public data suggests that (1) csv is still a common format and (2) it is a major source of time wasted for engineers who need to analyse the data.

github api
The github search query https://github.com/search?l=CSV&q=.csv&type=Code reveals that there are appx. 270k CSV files on github, which would be a suitable large scale sample for research and testing.

I would like to test the proposed csv-analyzer on the set of file but am hesitant/reluctant to read the license in each repo and collect the files, where permitted, through the api for testing the analyser (if this amount of traffic even is permitted by the github api?).

As Raymond Hettinger would say: “There must be a better way?”

  • Do you have the ability to extract the data?
  • Can I? Am I allowed?
  • Could I for example send the algorithm to the data instead of moving the data to the algorithm?
  • Given the size of the datasets in question (I have ~2Tb excluding GitHub data) parallel processing would be helpful.

Kind regards
Bjorn Madsen

[1] Mitlöhner, J., Neumaier, S., Umbrich, J. and Polleres, A., 2016, August. Characteristics of open data CSV files. In 2016 2nd International Conference on Open and Big Data (OBD) (pp. 72-79). IEEE.