Data sources, like the recently released linked open European Patent data from the European patent office, are rich sources of data becoming available for potential machine learning and AI applications. This type of patent data is a good example of labelled training data that can be used to train an algorithm to carry out a particular task, so that even when it is presented with an example that it has not seen before, it is able to perform well on the task. Because the patent data is linked it has more information about potential "labels".
Suppose the task is to predict whether a patent application will receive a particular type of objection from the patent office. In this case the examples are patent applications and the labels are the types of objections raised against the patent applications by the patent office. An algorithm could be trained using examples of patent applications, the documents cited against those patent applications by the patent office, and the information about what types of objections were raised by the patent office. All this data is available on public databases at present.
Once the algorithm has been trained, it can then be used to predict whether a new patent application will receive a certain type of objection or not from the patent office. The algorithm would need to search for prior art documents itself from the public database and this is something that could be done using a rule based algorithm which extracts keywords from the patent specification and uses them to search for documents in the database. This type of tool would enable applicants to gain a good idea of whether a given invention is likely to be patentable or not.
An example of a machine learning classifier that has already been created in this type of field, is the classifier which predicts whether a US patent application will receive an Alice rejection or not.
Finding available sources of labelled training data is not easy. The linked open EP data is a relatively rare source of labelled training data which is readily available in large quantities.
Linked open EP data creates a public web of interlinked patent data from EPO and other data publishers that can be queried, retrieved and viewed using standardized web technologies like HTTP, URI, RDF and SPARQL