In the Imagenet project, researchers collect millions of photos that are assigned to specific categories. This means that deep-learning systems for image recognition can not only be trained but also compare their quality very well because of the same data. The code hoster Github now obviously wants to adapt this concept for the semantic search in source code and starts the project Code Calculations.
The announcement by Github states that source code search engines are often frustrating and never fully understand what they are asking for. And despite improvements in technology through the use of modern machine learning approaches, there has been a lack of a consistent set of data to evaluate the results. That’s exactly what codes should do.
Github provides the dataset for download in an Amazon S3 bucket. Overall, according to the provider, this involves around six million methods, of which two million have associated documentation. There are also metadata such as the location of the code. The code for the created model is of course also found Github. Further details are described in a scientific paper.
Via | Github