6.5 C
New York
Thursday, October 21, 2021

Github wants to improve code search with competition

Apparently following on from Imagenet, code-host Github wants to establish a deep learning competition for the semantic search in code with Codesearchnet. The dataset contains 6 million methods, some with documentation and metadata.

Must Read

US CDC Alert: Fresh whole onions linked to Salmonella outbreak

US CDC warned, adding, people who have unlabeled whole red, white, or yellow onions at home to...

A new infectious disease is spreading in New York – experts warn

The city of New York has recorded an upsurge in human cases of leptospirosis, a bacterial disease...

People with this blood group are more exposed to COVID-19 – says research

Recent research adds more evidence that blood type may influence a person's risk for COVID-19 infection and...
Kamal Saini
Kamal S. has been Journalist and Writer for Business, Hardware and Gadgets at Revyuh.com since 2018. He deals with B2b, Funding, Blockchain, Law, IT security, privacy, surveillance, digital self-defense and network policy. As part of his studies of political science, sociology and law, he researched the impact of technology on human coexistence. Email: kamal (at) revyuh (dot) com

In the Imagenet project, researchers collect millions of photos that are assigned to specific categories. This means that deep-learning systems for image recognition can not only be trained but also compare their quality very well because of the same data. The code hoster Github now obviously wants to adapt this concept for the semantic search in source code and starts the project Code Calculations.

The announcement by Github states that source code search engines are often frustrating and never fully understand what they are asking for. And despite improvements in technology through the use of modern machine learning approaches, there has been a lack of a consistent set of data to evaluate the results. That’s exactly what codes should do.

The data set created by Github feeds from the code of open source projects on the platform of the provider and includes functions with documentation of the languages ​​Go, Java, JavaScript, PHP, Python and Ruby. For preprocessing the code data, Github relies on its own parser generator Tree-sitter and its function parser, which is used to generate ASTs and any documentation and metadata about the individual functions.

Github provides the dataset for download in an Amazon S3 bucket. Overall, according to the provider, this involves around six million methods, of which two million have associated documentation. There are also metadata such as the location of the code. The code for the created model is of course also found Github. Further details are described in a scientific paper.

Via | Github

- Advertisement -


Please enter your comment!
Please enter your name here

- Advertisement -

Latest News

- Advertisement -

More Articles Like This

- Advertisement -