AGI -- Artificial General Intelligence

This page was last updated in 2009 and is effectively obsolete. Historical record only.

A messy and incomplete list of open source (and some notable closed-source) Artificial General Intelligence projects, as well as lists of various components and tools that can be used within existing, or in new AGI projects. These components cover everything from NLP and language generation to data clustering and machine-learning algorithms, large data stores, knowledgebases, reasoning engines, program-learning systems, and the like.

A good overview is given by Pei Wang's Artificial General Intelligence : A Gentle Introduction. See also the Wikipedia article on Artificial Consiousness and Strong AI.

See also a large list of free/open-source "narrow AI" software, at the GNU/Linux AI & Alife HOWTO.

Suggested Education for Future AGI Researchers.

AGI: The Whole Enchilada, with all the trimmings

A list of projects that attempt to put together the right ingredients to cook up AGI in full generality. This includes solving the problems of reasoning, learning, plannng, acting, and internally modeling the external world (including complex external objects, such as people and thier emotional state). These commonly include work on language comprehension and speech, and subsystems for sensory input and motor control. Some of these are open source, some are not. Some are academic projects, some are commercial attempts.

The most advanced open-source general cognition/reasoning system. Includes an NLP subsystem, reasoning, learning, 3d virtual avatar interfaces, robotics interfaces. Open-source, GPL license.

Critique: It's an experimental research platform. That is, it consists of a collection of parts that can be assembled, with some considerable difficulty, into working systems, which can then be used in practical applications, or to perform experiments.

OpenCog has many warts and serious architectural failings. However, it does more things, more correctly than any other system that I know of. In fact, it it the only system that I know of that correctly unifies logic and (Bayesian) probability, anchoring it on a solid theoretical foundation of model theory (term algebras and relational algebras) category theory (pushouts, functors, colimits) and type theory.

Nominally associated with Artificial General Intelligence Research Institute, SIAI and Novamente.

Demo: AI Virtual Pet Answering Simple Questions

IBM Watson


Pei Wang's NARS project

NARS, the Non-Axiomatic Reasoning System, aims to explain a large variety of cognitive phenomena with a unified theory, and, in particular, reasoning, learning, and planning. Site holds a number of white-papers. Was inspiration for OpenCog. (OpenCog claims to overcome certain limitations in NARS) OpenNARS is Pei Wang's implementation. Released under GPLv2.

Stan Franklin's LIDA

An intelligent agent, communicating by email. Built for the US Navy. Based on Baar's Global Workspace Theory. Answers only one question: "What do I do next?". See Tutorial


General framework for running cognitive experiments(?). Java source code available under unspecified license.


Aims to couple common-sense knowledge-base systems to natural langauge text processing. Open source project.

John Weng's SAIL architecture

Seems primarily aimed at robots.


Cognitive architecture research platform, aimed at simulating and understanding human cognition.

Nick Cassimatis's PolyScheme

Polyscheme is a cognitive framework intended to achieve human-level artificial intelligence and to explain the power of human intelligence. Variety of research papers published, no source code available.

Jeff Hawkins Numenta

Commercialized "Heierarchical Temporal Memory"


SNePS is a knowledge representation, reasoning, and acting (KRRA) system. See also the Wikipedia page See also a paper by Shapiro, part of the SNePS group.

General Intelligence Research Group
Yan King Yin (YKY)'s project, an attempt to build a semi-open, semi-closed source AGI project.

Semantic systems

Systems which handle the natural-language and semantic aspects of AGI, without striving for full-fledged "consiousness", 3D embodiment, natural language output (speech), planning of actions, activities, awareness and modelling of complex external systems (i.e. awareness of external human actors), self-awareness, etc. Note that some of the systems listed above also fail to handle e.g. embodiment, but none-the-less seem to have a more expansive vision and set of goals. By contrast, the systems below really try to stick to the "straight and narrow", without making overly broad claims.

Primarily an implementation of Markov Logic Networks (MLN). MLN are remarkable because they unify, in a single conceptual framework, both statistical and logical (reasoning, first-order-logic) approaches to AI. This seems to endow the theory with a partucularly strong set of powers, and in particular, the ability to learn, without supervision, some of the harder NLP tasks, such as dependency grammars, automatic thesaurus/ synonym-set learning, entity extraction, reasoning, textual entailment, etc.

the Beast

Primarily an implementation of Markov Logic Networks, for Statistical Relational Learning, including dependency parsing, semantic role leabelling, etc. Perhaps more NLP focused than Alchemy.


YAGO is a huge semantic knowledge base, consisting primarily of information about entities. Contains 2M entities, and 20M facts about them. The YAGO-NAGA project also includes SOFIE, a system for automatically extending an ontology via NLP and reasoning.


FreeHAL is a ... ?? chatbot and stuff ... ?? TODO -- figure this one out. Hard to tell if this is "real" or a hack.

Nutcracker and Boxer

Nutcracker performs textual entailment using a first-order-logic (FOL) theorem prover, and an FOL model builder. Built on top of Boxer, which takes the output of a combinatory categorical grammar (CCG grammar) parser, and converts this first-order logic based on Hans Kamp's "Discourse Representation Theory".

Written in prolog. Non-free license, bars commercial use.


Question-Answering system. Probably works,well, but my biggest criticism is that it's hand-crafted, rather than trying to actually learn anything. Viz, no attempt to learn grammars, no attempt to learn how to normalize a question. GPL license.


The MultiNet paradigm - Knowledge Representation with Multilayered Extended Semantic Networks by Hermann Helbig. Wires up NLP processing to hard-wired upper ontology, and adds reasoning. No source code available.

Project Halo

Developed by Vulcan Inc. in association with SRI International, Cyc Corp. and the UTexas/Austin CS/AI labs, aims to provide reasoning and question-answering over large data sets. All knowlege entry is done manually, by experts. Some research results are available publicly.

OntoSem - Ontological Semantics

Developed by Hakia Labs, proprietary, commercial software for taking NLP input and generating ontological frames/expressions from it. See also

Powers Hakia search.

Reasoning engines/Inference engines

There are two primary ways in which reasoning is being approached these days: through crisp logic (using boolean true/false truth values) and different approaches to fuzzy logic.

Below is a list of reasoning and/or inference engines only, without accompanying ontologies/datasets.

What am I (personally) looking for? I am looking for a system that represents a logical expression as a (hyper-)graph. Why? First, because the natural setting for logic is model theory. The natural setting for algebraic structure is a term algebra. The universal term algebra or free theory is the free term algebra. The natural way to express an equivalence relation, production rule, or a re-write rule for a term algebra is as a hypergraph. Thus, if one wants to apply machine learning technology to learning new equivalence relations or reduction rules, one must be able to represent one's systems as hypergraphs. Unfortunately, very few have made this leap or connection. The only system that I know of that represents both logical relations and re-write rules as hypergraphs is OpenCog.

Probablistic Reasoning engines/Inference engines

There appear to be five primary approaches: PLN, NARS, MLN, CRF, HMI. One of the primary difficulties is probablistic reasoning is inference control: since nothing is ever strictly true or false, there is a huge combinatorial explosion during reasoning, and effective strategies must be found to control this. Another important problem is loss of precision: after a few inference steps, the uncertainaties can compound in such a way that all confidence in the resulting truth value is lost. Thus, inference control sometimes focuses on maximizing the certainty or confidence of a deduction, rather than maximizing it's truth.
Ben Geortzel's PLN Probabilistic Logic Network

Implements a probabalistic analog of first-order logic. Ideal for uncertain inference. Beta available now. In the process of being ported to Opencog. First-order logic statements are expressed in terms of hypergraphs. The nodes and edges of the hypergraphs can hold various different "truth value" structures. A set of basic types define how truth values are to be combined, resulting in the primitives needed for uncertain reasoning. These are described in Ben Geortzel's book of the same name. A specific claim is that the rules are explictly founded on probablility theory.

Truth values are probability distributions, usually represented as compound objects, e.g. having not only a probability, but also having upper and lower on the uncertainty of the probability estimate.

Actual implementation works primarily by applying typed pattern matching to hypergraphs, to implement a backward-chainer. That is, PLN defines a typed firt-order logic; it does not (yet?) define a typed functional programming language (although it comes close to doing so). Inference control is through various aglorithms, including "economic attention allocation" and Hebbian activation nets.

GNU GPLv3 Affero license.

Pei Wang's NARS Non-Axiomatic Reasoning System

Similar to PLN in various ways, but uses a different set of formulas for inference. Truth values are represented with a pair of real numbers: strength and confidence.

Open source, written in Lisp.

Pedro Domingos' MLN Markov Logic Networks

An extension of Markov networks to first-order logic. Ungrounded first-order logic expressions are hooked togethr into a graph. Each expression may have a variety of different groundings. The "most likely grounding" is obtained by applying maximum entropy principles aka Boltzmann statistics computed from a partition function that describes the network. One important stumbling block is that computing the partition function can be intractable. Thus, sometimes a data representation is used such that certain probabilities are solvable in closed form, and the hard (combinatorial) problems are pushed off to clustering algorithms. (See e.g. Hoifung Poon).

MLN's stick to a very simple "truth value" -- a real number, ranging from 0.0 to 1.0 -- indicating the probability of an expression being true. Normally, no attempt is made to bound the uncertainty of this truth value, except possibly by analogy to physics (e.g. second-order derivatives expressing permeability, susceptibility, etc. or strong order when far from the Curie temperature, etc.) That is, maximum entropy principles are used to maximize the number of "true" formulas that fit the maximum amount of the (contraditory) input data. However, it is unclear how confident one should be of a given deduction

Several implementations, including "Alchemy" listed below.

CRF Conditional Random Fields

Similar to MLN, but avoids making certain assumptions about Bayesian priors. Rarely applied to logic/reasoning directly. Uses a single real number to represent the probability.

HMI Hierarchical Mutual Information

Similar to MLN, but abandons maximum entropy for clustering/classification based on mutual information. That is, datasets are search emprically for small patterns that have a high value of mutual information. These are then clustered together as approrpriate, and then the search is repeated on patterns based on the clusters.

For a theortical foundation, be sure to examine Jeff Paris work on uncertain deduction. Note also the interplay between paraconsistent logic and intuitionistic logic when moving to probabilities. So for example: if there is an 80% likelihood that P, then there is a 20% likelihood that not P. For 'crisp logic semantics' viz. axiomatic set theory, then P and 'not P do not interset, and its never the case that (P and not P) so the need for paraconsistent logic can be temporarily avoided. However, for intuitionistic logic, we want to abandon the law of the excluded middle, and be able to say: "there is an 80% chance that P, but we know nothing about not P". This is solved by introducing "confidence", a la Goertzel etal or Pei Wang. This reamins unsolved by MLN and CRF.

Probablistic Prgramming Languages

The probabilistic programming website is devoted to probabilistic programming languages. A list of probabilistic programming languages is also given on the wikipedia page Probabilistic relational programming language.

What am I looking for? I am looking for a system that represents a probabilistic program operatation with a very simple syntax, so that a machine learning system can learn new probilisitc programs. The only such system that I know of, that is capable of doing this, is the OpenCog system. Although, at the current time, opencog rather sucks for programming.

Inductive Logic Programming

Logic programming is the act of specifying programs as logical statements/assertions. Examples of logic programming languages include prolog, datalog. Inductive logic programming is the act of automatically learning new logic programming rules.


Crisp-logic reasoning engines/Inference engines

Reasoning engines that employ crisp logic -- i.e. boolean true/false truth values only. In general, crisp logic reasoning is a lot simpler than uncertain reasoning, since the combinatoric explosion is far far smaller, and loss of precision is not a concern.

Logic engine for guile. Has Kanren-like interfaces... prolog-like interfaces. See docs for a short overview/introduction of what it is.

Prolog engine, open source. Supports tabling/memoing, well-founded negation. This is one of the fastest inference engines out there, per results of the Madrid 2009 Semantic Web OpenRuleBench results. Personally, I suspect that this is because of a strong grounding in inference and language design theory on the part of the developers.


Prolog engine. For performance, adds "demand-driven indexing". This is one of the fastest inference engines out there, per results of the Madrid 2009 Semantic Web OpenRuleBench results. Personally, I suspect that this is because of a strong grounding in inference and language design theory on the part of the developers.


Inference engine, bottom-up. Implements the datalog query system. Has "Magic Set" optimization. Implemented in Java. Immature? LGPL license.


PowerLoom uses a fully expressive, logic-based representation language (a variant of KIF). It uses a natural deduction inference engine that combines forward and backward chaining to derive what logically follows from the facts and rules asserted in the knowledge base. Has interfaces to common-lisp, C++ and Java. GPL license.

CLIPS - A Tool for Building Expert Systems

Among the first expert system/rule engines ever. Originally from NASA, now public domain. C language. Designed for embeding expert systems into devices, etc. See also Wikipedia page. Extensive number of features.


Inference engine, specifically tailored to work well with Python. Features:

The Scone Knowledge-Base Project

Sigma Knowledge Engineering Environment

Primarily an inference engine coupled to an ontology. GPL license.


Drools is a business rule management system (BRMS) and an enhanced Rules Engine implementation, ReteOO, based on Charles Forgy's Rete algorithm tailored for the Java language. Despite using RETE, this is possibly the slowest inference engines out there, as well as the least stable (per WWW Madrid 2009 Semantic Web OpenRuleBench results).


Function symbols. Meant for event processing, not data processing ...

Boolean SAT, SMT Propositional logic solvers

Use Boolean SAT for traditional propositional logic solvers, use SMT for solvers that include arithmetic expressions.

Algernon - Rule-Based Programming

Java, on sourceforge. Recommended for small-to-medium systems. A frame-slot type system.

Theorem provers

Teorem provers are primarily meant for formal verification of systems, commenly of hardware designs, but also of mathematical statements, etc.
The E Equational Theorem Prover

Theorem prover.

HOL Higher Order Logic

Theorem prover. Usually used for formal verification. BSD license.

Prover9 is a theorem prover for first-order and equational logic. Mace4 searches for finite models and counterexamples.
SPASS Automated Theorem Prover for First-Order Logic

Theorem prover.

PVS Specification and Verification System

With integrated theorem prover. CMU Lisp. GPL license.

Graph Re-writing systems

It seems that one of the best ways to represent knowledge is as a graph or hypergraph. Doing something with that knowledge requires taking that graph, and transforming it in various ways. Thus, one needs to have a graph re-writing system.

Some of the logic and reasoning systems above make explicit use of a graph re-writing system. Most do not. The RelEx language system explicitly makes use of one to perform dependency parsing.

What am I looking for? I want a graph rewriting system that expresses the re-write rules themselves as graphs. The rules should also be expressible as strings, and should have a very simple syntax, so that a machine-learning system could learn new rules. Ideally, the graphs would actually be hypergraphs, as it is difficult and cumbersome to implement certain constructs with ordinary graphs. In particular, it is difficult to implement certain dependency relations in natural languages with ordinary graphs. It is also difficult to specify functors as ordinary graphs (since the arguments to a functor are typed, and the type itself is usually a graph. Thus, one needs to allow the nodes of a graph to be graphs themselves, i.e. to be hypergraphs.) Put another way: in model theory, the universal algebra is the free term algebra: given a fixed signature, terms may be freely composed in any way; there are no reductions or relations. A free term algebra is most easily represented as a directed tree graph. Any equivalence relation or re-write rule is then a hypergraph! (In fact, re-write rules that replace functors by other functors are functors themselves; this leads to the concept of a 2-category) Currently, there is only one such system that I know of: it is the pattern matcher in OpenCog.

A list of graph rewriting systems can be found at in the Wikipedia Graph rewriting page. These include:

Java. A graph processor gor java gremlin graphical database. Stuff like neo4j, etc. The java side of things here is booming.
Python. Easy-to-use graph database.
GMTE: Graph Matching and Transformation Engine

Works on graphs with labelled edges and nodes (i.e. category-theoretic). Written in C++. From CNRS. License: free of charge, but proprietary.

AGG: Attributed Graph Grammar

Java. License unclear. The graph transformation rules themselves must be written in Java. Category-theoretic approach, single push-out. Meant to be embedable in other projects. Source available, license unclear.

GROOVE: GRaphs for Object-Oriented VErification

Wrtten in Java. Not obviously extensible, scalable, or usable as a component within another system. (??)


All very nice but useless if you cannot see.

A "sketch understanding system". For example, you can give it kinnect data, it can tell you about the motions. Built on top of QSRlib, below.


QSRlib: a software library for online acquisition of Qualitative Spatial Relations from Video. Awesome!


Its painfully clear that creating AI requires a subtle interplay between querying, pattern matching, and imperative, algorithmic processing. For example, rule engines require one to write rules, whose first part, the predicate, is meant to be a pattern-match against the output of other rules. More generally, a lot of the effort in the "semantic web" and sparql, etc. is about creating queries (such as in SQL) -- but the result of pattern matches return a large glob of data. Once one has this data, one then has to apply some algorithm to it. And then .. lather, rinse, repeat. One is thus faced with an infrastructure problem: what infrastructure is best for doing all of the above? One of the most interesting, new approaches to this is Barry Jay's "Pattern Caluclus" and the Bondi programming language, which promises to provide a foundation on which all of the above can be built, correctly, this time. Other "hot" programming languages that attempt to solve many of the irritating, horrid problems experienced in older, more popular langauges:

Fast! Small exectuables! ML/OCaml-like type system. Supports several programming styles: Functional programming (both lazy and eager evaluation) Imperative programming (including safety via theorem proving), Concurrent programming (multi-core GC) Weakness: very very new, current version is 0.1.6.


Purely functional programming, good concurrency support, good FFI, goood compiler. Lazy evaluation. Weakness: difficult for programmer to predict time/space performance.


Concurrent, functional, fault-tolerant programming. See also wikipedia article.


Fast! Unifies functional, imperative, and object-oriented programming styles. Provides a strong type and type-inference system derived from ML. Weakness: no multi-core/concurrent support. Type system can be subtle. Poor FFI, modules system. See also Wikipedia page.


Object-oriented, functional programming. Focus on scalability. Targets JVM. Good Java integration. Weakness: no tail recursion in JVM!! which means mutually recursive proceedures are icky/slow.


Modern Lisp dialect, targeted at JVM. Good Java integration. Weakness: no tail recursion in JVM!!

NLP - Natural Language Processing

A wiki containing an extensive listing of software and other things is at ACLWeb, and in particular, at the Tools and Software page. A small list is at the NLP Resources wiki page at A general overview of the state of the art is at AAAI Natural Language page.

A particularly important theory is Dick Hudson's Word Grammar.

Other NLP resources include:

Morphology software
A wiki of morphology s/w
A set of semantic-like verb frames.
A set of semantic-like frames. Free for personal use, but has commercial license.
Dictionary of synonyms, antonyms, etc.
Dictionary of synonyms, antonyms, etc.

See also

General NLP Tool Sets

Grammatical Framework

Grammatical Framework is actaully programming language for writing grammars. Its built on the categorical grammar formalism. As a programming language, its a functional language with type support. Code is GPL, libraries are LGPL and BSD.


CRF++ is an implementation of Conditional Random Fields. Has pre-existing modules for text chunking, named entity recognition, information extraction. Open source, written in C++.


Includes a shallow parser, a sentence splitter, entity detection, sense annotation (using wordnet senses), etc. Strong Spanish/Latin language support.


OpenNLP is more of a directory of other NLP projects. Includes some good maximum-entropy implementations.

NLTK -- Natural Language Toolkit

Has a book, multiple articles. Integration into WordNet. Written in python. Not clear whether it has an actual parser. Seems to do some sort of entity extraction, esp. for biomedical terms.


The IMS Open Corpus Workbench (CWB) is a collection of tools for managing and querying large text corpora (100 M words and more) with linguistic annotations. Its central component is the flexible and efficient query processor CQP.

GATE - General Architecture for Text Engineering

Java, GPL'ed. Big. Also in use for Dialogue processing and Natural Language Generation.


Chatbots, discourse systems, task planning.

There are many chatbots. There are few discourse planners; i.e. systems that attempt to get something accomplished by talking. So one idea is to couple planners with dialogue trees. By "planning", think of something like a hierarchical task network.

Here are some non-standard ones:


Disco - Collaborative Discourse Manager. Writing dialog trees is tedious, time-consuming, error-prone. Writing AIML lacks discourse control. The idea here is to use a planning system (in this case, industry standard ANSI/CEA-2018 for hierarchically specifying tasks to be accomplished) to automatically generate dialog -- the actualy dialog is guided by the planner.

Put another way: conversations become "collaborative", although they are goal-driven (by the planner).

Github, Java, MIT license.

NLP Parsers

Another kind of useful linguistic resource is the NLP parser. Below, they are classified into two types: rule-based parsers and unsupervised parsers. All parsers are driven by a lexicon or dictionary that contains information about the grammatical structure of the language. In most cases, the lexicon is either manually created by human experts, or obtained by training on text manually marked up by human experts. A very interesting exception are those parsers that induce grammar (i.e. induce a lexicon) without human intervention: these are the unsupervised parsers.

The only unspervised grammar inducers that I know of are:

DAGEEM 1.0 - Dependency and Grammar Estimation with Expectation-Maximization

Unsupervised grammar induction refers to the task of learning a grammar with the input being only sentences in natural language. Upside: this is interesting because it learns without any supervision. Downside: low accuracy, weak dependency grammar (DVM). Accuracy has been measured to be about 50% on 6 different languages, which is currently state-of-the-art. The DVM grammar is a bit lacking: only valence-2 links allowed (thus, no indirect objects, adjectival modifiers) and the number of word-classes (parts of speech) is far too small. Alas. Promising start, though. GPLv3 license.

The below all require either supervised training, or use manually constructed lexicons/dictionaries.
Link Grammar Parser

From Carnegie-Mellon. A parser for English, Russian, Arabic, Persian, German languages, based on "link grammar", a novel theory of natural language syntax. Written in C, with a BSD license. English dictionary includes 90K words. Actively maintained. The most accurate parser out there, I don't know any that are more accurate, free or commercial. (Accuracy is in the 97-99% range) Fast, too.

RelEx Dependency Grammar and Semantic Relationship Extractor

Built on top of the Carnegie Mellon link parser. Extracts dependency relations from link data. Creates FrameNet-like semantic frames from the dependency graphs. Includes ability to handle multi-sentence corpus, entity detection, and perform anaphora (pronoun) resolution via Hobbs algo. Apache v2 license. Written in Java. Actively developed/maintained.

Now includes not one, but two! natural language generation facilities: NLGen/SegSim and NLGen2.


Rule-driven dependency parser. English, Spanish, Galician, French, and Portuguese. Parser-compiler in Ruby; parser is in Perl. GPL license.

Stanford Parser

Dependency parser, generating output similar to RelEx. Statistical parser. Trains on treebank data, has been applied to half-a-dozen different languages. Slow, RelEx+linkgrammar is 3x to 4x faster. Java, GPL v2 license.


Trainable, fast, accurate dependency parser. Has four different training methods. Uses a fast shift-reduce algorithm for single-pass parsing. Reads CoNLL. C++. unclear license? Unclear accuracy?


Maltparser is a system for data-driven dependency parsing, which will learn a parsing model from treebank data, and can then be used to parse new data using the induced model. Java, BSD license. old URL.


Trainable, fast dependency parser. Uses minimum spanning tree methods. Reads CoNLL. Doesn't seem to be very active. Java, CPL license, Apache V2.0 license. download

ISBN Dependency Parser

Incremental Sigmoid Belief Network Dependency Parser. Trainable. GPL license. Unmaintained, last release was in 2008.

Constraint Grammar

Dependency output. Linguist-written rules. GPL license.

Fluild Construction Grammars.

Idea from Luc Steels. There is a LISP implementation at A Java implementation at TexAI.

NLP text generators

Automated NL translation systems typically have generators; however, these are statistical, and cannot be directly controlled. Its nice to have a rule-based NL generator that can "learn" new forms of expression.

There is a large list of NL generators located at the ACLWeb Natural Language Generation Portal.

NLGen, SegSin, NLGen2
Text generation modules compatible with link-grammar/RelEx See the link-grammar/RelEx references for more details.
Penman sentence generation system
Described above, has a generator system.

Text-to-speech, Speech synth

There are many, most proprietary. OpenSource includes: ...

Coreference Resolution

Includes the problem of Anaphora resolution. Best-known is the Hobbs algorithm for anaphora resolution. RelEx implements this algorithm.
BART, short for "Beautiful Anaphora Resolution Toolkit", uses machine-learning and maximum-entropy statistical techniques to learn entities and identify them. Java, Apache license.

Word Sense Disambiguation

Word sense disambiguation attempts to determine which of multiple possible semantic senses are used in a sentence. A good set of references and code are on Rada Mihalcea page. Code is under GPL license. See also:

Named Entity Recognition, Entity Extraction

Other NL tasks include named-entity recognition (NER) or entity extraction. Entity extraction refers to the recognition of names, dates, places in a body of text. Related is the recognition of technical terms.

NER is commonly done in one of several ways:

A gazeteer list or a shallow parser can be created in several ways: A large list of open-source tools are listed in the Wikipedia NER article.
SOFIE A Self-Organizing Framework for Information Extraction

A powerfull system for extracting entities and entity relations from free text. See the YAGO-NAGA listing above.

GATE - General Architecture for Text Engineering

Java, GPL'ed. Big. GATE is supplied with an Information Extraction system called ANNIE, which seems to be focused on "entity extraction".

CRF++ is an open-source tool for conditional random fields. It does named-entity recognition among other things.
Online, commericial service. Free for limited volumes.


Other tools of interest.
Scraping language content out of web forums.

Program learning

The idea behind "program learning" is to take some dataset, and to describe it in a more compact form as an algorithm. Conceptually, it requires deducing an algorithm, given a sample of the input and expected output. For example, program learning *might* be able to "compress" large (hidden) Markov models and/or Bayesian nets into smaller, faster, more manageable algorithms. Intuitively, this would seem to be a critical feature for AGI -- the ability to take some learned, ad-hoc data (Bayes nets, Markov chains) and convert them into small, effective proceedures. One of the most popular program learning algos is genetic programming.
MOSES Meta-Optimizing Semantic Evolutionary Search

From the website: "Meta-optimizing semantic evolutionary search (MOSES) is a new approach to program evolution, based on representation-building and probabilistic modeling. MOSES has been successfully applied to solve hard problems in domains such as computational biology, sentiment evaluation, and agent control. Results tend to be more accurate, and require less objective function evaluations, in comparison to other program evolution systems. Best of all, the result of running MOSES is not a large nested structure or numerical vector, but a compact and comprehensible program written in a simple Lisp-like mini-language." For details, see Moshe Looks' PhD thesis.

Apache License.


Performs clustering using genetic programming techniques. (i.e. attempts to find small algorithmic expressions that will cluster the data). Omniclust is an n-ary agglomerative search algorithm. For details, see, Clustering gene expression data via mining ensembles of classification rules evolved using moses. Looks M, Goertzel B, de Souza Coelho L, Mudado M, Pennachin C. Genetic and Evolutionary Computation Conference. (GECCO 2007): 407-414. Java codebase.

Machine Learning

Misc. machine learning. See also:
HBC: Hierarchical Bayes Compiler
HBC is a toolkit for implementing hierarchical Bayesian models. Model is described using a special markup language, and then code is generated: C, Java, matlab. (Tool itself is written in Haskell.)

Java. Has been used to build a POS tagger, end of sentence detector, tokenizer, name finder. LGPL/Apache license.

Maximum Entropy Modeling Toolkit for Python and C++

LGPL license

Pebl Python Environment For Bayesian Learning

MIT license

HTK Hidden Markov Model Toolkit

Portable toolkit for building and manipulating hidden Markov models. C source code, non-free-license prohibits redistribution.

Data Clustering, simple Classifiers

Linear classifiers, data dimension reduction, data clustering, PCA principal component analysis, etc. An overview includes the The Impoverished Social Scientist's Guide to Free Statistical Software and Resources.

A particularly interesting subset concerns Compositional data, which is data located on a simplex and/or a projective space.

MCL Markov Clustering
MCL- "a clustering algorithm for graphs", appears to be an excellent clustering algorithm -- it does not require supervision (i.e. does not require the number of clusters to be specified a priori) and seems to be very scalable, of performance O(N k^2). The scalability is particularly important, in light of the note below. MCL is covered under the GPLv3. (I have no personal experience with this yet, but expect to "real soon now").

Caution: All of the systems listed below fail horribly when applied to real-world data sets of any reasonable size -- e.g. datasets with 100K entries. This is typically because they try to compute similarity measures between all 100K x 100K = 10 billion pairs of elements, which is intractable on contemporary single-CPU systems. You can win big by avoiding these systems, and exploiting any sort of pre-existing organization in your data set. Only after breaking your problem down to itty-bitty-sized chunks should you consider any of the below.


From thier website: "The VLFeat open source library implements popular computer vision algorithms including SIFT, MSER, k-means, hierarchical k-means, agglomerative information bottleneck, and quick shift. It is written in C for efficiency and compatibility, with interfaces in MATLAB for ease of use, and detailed documentation throughout. It supports Windows, Mac OS X, and Linux."

Appears to be aimed at image processing. GPL license.


Assumes data is located on a simplex, and uses that fact in it's algo's. Includes an algo for PCA analysis, another using a partition clustering algorithm, and an agglomerative hierarchical clustering using the Aitchison distance. Command-line interface. Written in C. (No library interfaces currently defined.) Focused on genetic/bio data. GPL license.


Mfuzz clustering. Aimed at genetic expression time-series data, claimed to be robust against noise. Uses R language. GPLv2 license.


R-based data mining. GPL.

Weka Machine Learning

Data mining, clustering. Java. GPL. From personal experience -- fails totally on any but the very smallest data sets. Dying/dead mailing list.

Learning classifiers

Classifiers that need a distinct "training" or "learning" phase before they can be used on general-purpose data. This includes SVM Support vector Machines, and classifier neural nets.

See also:


Fast, decision-tree-based implementation of k-nearest neighbor classification. Implements half-dozen algo's. GPL'ed. (Might not scale well for large problems?) Used in the MaltParser NLP parser, thus has been applied to NLP tasks.


Library that implements Support Vector Machine, which is one of many ways of doing a linear classifier.

Databases, distributed processing

Datamining, statistical learning and common-sense knowledge-bases need infrastructure for storing all that data in a persitent, searchable, structured manner, ideally so that hundreds or thousands of clients can get at it. A popular paraddigm at this time is MapReduce, or "Distributed processing using key-value generaion and reducing primitives".
STXXL: Standard Template Library for Extra Large Data Sets

Per website: "STXXL implements containers and algorithms that can process huge volumes of data that only fit on disks."


Clustering, runs in memory, thus much faster than Hadoop. Scala interfaces.


Implementation of MapReduce ideas in C++.


Implementation of MapeReduce ideas in Java. Part of the Apache project. Notable things built on Hadoop: Hive, an analyis and query system. HBase, a BigTable-like non-relational database.

Hypergraph DB

Database for storing hypergraphs. Pretty Cool. Java based. Strange BSD-like license, but requires source code! Compatibility of license with GPL is unclear.

Shard databases

Shard overview describes an alternate to centralized, normalized datbases.


Judy trees
Judy arrays provide a very fast array/tree structure, primarily because it was designed to avoid cache misses. This is an important low-level technology. C library, LGPL license.

Ontologies, Knowledge Bases and Reasoning Engines

Some ontologies are understood as "stand-alone", while other ontologies are incomplete when considered without a reasoning engine. An example of the latter is the OpenCyc ontology, which, when examined as a dataset, superficially appears to be incomplete, messy and capricious. However, when coupled with it's reasoning system, it becomes complete. This is because many important "facts" are only one or two or three deductive steps away from the core dataset. Thus, the ontology provides an "armature", while the reasoning system provides the "clay" to fill in the gaps.

I've moved the list of ontologies to near the bottom of this page, because I have come to beleive that they are useless unless they have been learned natively, by some specific learning system. Thus, for example, an AGI system would use an ontology not by loading one of the below, but by learning one, by reading books. Or reading wikipedia.

A giant list can be found at Peter Clark's Some Ongoing KBS/Ontology Projects and Groups. Problems with ontologies are reviewed in Ontology Development Pitfalls.

Big ones include


Common-sense knowledgebase. Large. GPL license. Users can edit data online, at

Open Mind Common Sense

Collection of english-language sentences, rather than using a strict upper ontology. This is actually quite conventient, if you have a good NLP input system, as it helps avoid the strictures of pre-designed ontologies; and rather gets you to deal with the structure of your NLP-to-KR layter. From MIT. -- large -- 700K sentences


YAGO is a huge semantic knowledge base, consisting primarily of information about entities. Contains 2M entities, and 20M facts about them. The YAGO-NAGA project also includes SOFIE, a system for automatically extending an ontology via NLP and reasoning.


Semantic network.

See also: Wordnet::Similarity A perl module implementing various word similarity measures from Wordnet data. i.e. Thesaurus-like.

Historical Thesaurus of English

Licensing is unclear.

SUMO - Suggested Upper Merged Ontology

SUMO WP article. Includes an open source Sigma knowledge engineering environment, includes a theorem prover. Sigma uses KIF.

"The largest formal public ontology in existence", availble under GPL. (although OpenCyc is arguably bigger, and is free.) Has mappings to WordNet.


Large KB under artistic license. Source for engine not available. KB seems messy and capricious. The uppper ontology is not clear. See however, remarks above.


Common sense KB, available in CycL. GPL'ed

Conceptual Nets

A knowledge representation system. Conceptual Graph Interchange Format is an ISO standard. See also "Common Logic Interchange Format (CLIF)", which is more lisp-like.


Seems well-engineered. Actual KB is slim. Source not available. Might be a dead project??

GFO - General Formal Ontology

Provides a firm theoretical foundation for representing ontologies; no actual data. OWL version of GFO under a modified BSD license. Examples include the periodic table of elements, amino acids. See also WP article.

DOLCE - Descriptive Ontology for Linguistic and Cognitive Engineering
SENSUS - An extended, re-organized version of WordNet. Does not appear to be publically available or maintained any more?
PSL - Process Specification Language
BFO - Basic Formal Ontology
SOAR expert system
Obsoleted by OWL
KIF - Knowledge Interchange Format
Obsoleted by SOU-KIF (used in SUMO)

Unstructured data

Datasets that do not rely on an ontology, or do so only weakly.
Named Entity Recognition (NER). Commercial service, free for low volumes.
Mizar is a markup language for expressiong mathematical statements in a machine readable format. There are thousands of theorems written in Mizar. Unfortunately, Mizar is hard to comprehend, and the theorem prover is proprietary. The Mizar2KIF project aims to create a tool to export KIF from Mizar input.

Narrow AI

Misc entries
Easy-to-use and general-purpose machine learning in Python Scikit-learn integrates machine learning algorithms in the tightly-knit scientific Python world, building upon numpy, scipy, and matplotlib.
RapidMiner (YALE) Java data mining
OntoWiki and Powl
Semantic web development. Screenshots show business-type apps: addressbook, calander, etc. Powl seems to be a classes and GUI designer. GPL license
Java interface for the W3C Web Ontology Language OWL. LGPL license.
Siafu: an Open Source Context Simulator
Simulate individual agents
Jamocha - one engine for all your rules.
Rule engine

Test Datasets

Datasets that can be used to evaluate narrow AI algorithms.
TechTC - Technion Repository of Text Categorization Datasets
Text categorization datasets. The overall idea is to read a block of text, and decide if it belongs to category A or to B. For algorithms such as SVM, it is common to simply do a word count, and classify based on that. Thus, the predigested datasets are of the form +1 or -1 (to indicate A or B), and a list of word counts (e.g. word 6578 occurs 2 times; word 6579 occurs 0 times ... etc.)

Embodiment, Avatars, Robotics

The Hanson Robotics heads are quite -- interesting.
Robot control and sensor processing. GPL.
The Mobility Open Architecture Simulation and Tools (MOAST) framework aids in the development of autonomous robots. It includes an architecture, control modules, interface specs, and data sets and is fully integrated with the USARSim simulation system.
Robotics messaging. Military standard.
Study of emotional agents. Simple virtual robotic agents that roam a 3D world and interact in various psycholgically motivated (needs & wants) kinds of ways. Humboldt University of Berlin. Java/Eclipse infrastructure.
GPL. AGISim is a framework for the creation of virtual worlds for artificial intelligence research, allowing AI and human controlled agents to interact in realtime within sensory-rich contexts. AGISim is built on the Crystal Space 3D game engine. Some parts of AGISim are closely related to OpenCog. Possibly/probably not maintained any more, I think the development team moved to OpenCog.


Most of these use very weak/narrow AI techniques. A large list can be found at
Chatterbot, AIML. AIML is a stimulus-response system: a bunch of English sentence patterns are hard-coded, and a bunch of replies to these are hard-coded as well.

Big computers

National Science Foundation: Google+IBM: Cluster Exploratory -- grants for large cluster science.

Journals, Societies

AGI Society
Journal, conferences, events.
Journal of Cognitive Science
Issues from 1980-2004 are online, free.
Symposium on Advances in Cognitive Architectures (2003)
Speakers, Abstracts, and Slides.
Cognitive Science Society
Promotes scientific interchange among the fields of Cognitive Science, Artificial Intelligence, Linguistics, Anthropology, Psychology, Neuroscience, Philosophy, and Education.
Artificial General Intelligence Research Institute. Publisher of the Journal of Artificial General Intelligence.
The Singularity Institute for Artificial Intelligence. Runs the Singularity Summit, a seminar program aimed at explicating AGI concepts to business executives.
Lifeboat Foundation
Countering existential risks, including atomic war, meteors, bioterrorism, grey goo, and singularity/AGI issues.

Misc links

Beautiful Soup
Library for screen-scraping content from HTML pages. Python, Python Software License.
Application framework for writing spiders that screen-scrape content from HTML pages. Python, BSD License. see especially "magnum opus"
Semantics of Business Vocabulary and Business Rules
An attempt to mix natural lanaguage and first-order logic to describe business relationships.

This page is maintained by Linas Vepstas and was last updated in a substantial way in 2009 (minor updates in 2012, 2013, 2017).