Resume Parsing (also known as CV Parsing, Resume Extraction, CV Extraction) is the conversion of free-form resume into structured information suitable for storage, reporting and manipulation by a computer.

MS Word format resumes are still the format of choice for people all over the world to describe the skills, qualifications and experience that they have that make them a suitable candidate for a particular job. These are easy for a human to read and understand, but to a computer they are just a long sequence of letters, numbers and punctuation. A parser is a computer program that tries to analyze this sequence and extract from it elements of what the person who wrote it actually meant to say.

This is a surprisingly difficult task for a computer to do. Although modern computers can add up millions of numbers in the blink of an eye, or win the world championship at chess, understanding language in all its generality even as well as a 5 year old child remains but a pipedream.

Part of the reason for this is that language can be almost infinitely varied. There are tens if not hundreds of ways to write down a date, for example, and countless millions of ways to to write down what you did at your last job. All these different ways of writing the same thing have to be captured by the complex rules and statistical algorithms that make up a parser, and this requires lots of effort and persistence to encode.

But although the sheer variety of ways of saying the same thing is a challenge for the parser-writer, an even bigger problem is ambiguity, where the same word or phrase can mean different things in different contexts. For example, "Director" can be a job title in some contexts, or a software package in others. A 4 digit number can be part of a telephone number, a Swiss zip code, a year, a version of a software package, or many other things depending on the words around it. Seeing "Project Manager" in a resume may indicate that the person was indeed a project manager, but not in the context "I reported to the Project Manager". "Meryll Lynch" may be someone's name, but is more likely to refer to a company. All of these ambiguities have to be resolved by the parser by looking at the context in which they are used.

How a parser works

In general, there are three types of parser; keyword based parsers, grammar based parsers, and statistical parsers.

Keyword-based parsers are the simplest and the least accurate. They work by identifying words, phrases and simple patterns in the text of the resume and then applying simple heuristic algorithms to the text they find around these words. For example, they may look for something that looks like a postal code in the resume and then try to interpret the surrounding words as an address. Or they may look for patterns that look like date ranges and assume that the surrounding text is an employment timeline. These parsers are the least accurate because they can't extract information that is not surrounding one of their keywords, and if their keywords are ambiguous (eg, the skill "Director") then they will frequently make the wrong guess about its interpretation. In general, it is hard to get beyond 70% accuracy with a keyword-based parser.

Grammar-based parsers, by contrast, contain an enormous number of grammatical rules that seek to understand the context of occurrence of every word in the resume. These same grammars also combine words and phrases together to make complex structures that capture the meaning of every sentence in the resume. These parsers are much more complicated than keyword-based parsers, but generally capture much more detail and are also capable of distinguishing between the different meanings that one word or phrase might have in different contexts. Using grammar based parsers, it is possible to build highly accurate parsers with accuracy rates well above 90% (human accuracy is rarely greater than 96%). The downside is that they require a lot of manual encoding by skilled language engineers, and a lot of testing to make sure that improvements in one area do not degrade performance in another.

Statistical parsers attempt to apply numerical models of text to identify structure in resumes. Like grammar-based parsers, they can distinguish between different contexts of the same word or phrase and can also capture a wide variety of structures such as addresses, timelines, and the like. To be most accurate, they require as input a vast number of resumes that are manually marked up with all the information that is required to be extracted. Pure statistical parsers generally perform better than keyword-based parsers, but not so well as grammar-based parsers on data that the parser has not been trained on. Statistical parsers can, however, achieve very high accuracies on data on which they are trained, but this is not usually very useful since this data is by definition old data that will not be seen again.

Daxtra's parser is a hybrid statistical-grammar parser that combines the advantages of both the advanced classes of parser. Large language independent statistical models are applied to achieve accuracy rates of around 85% (which is the best that pure statistical models can achieve), and grammar rules are applied which increase this accuracy to as high as 95%. We at Daxtra are continually improving our parser, striving to keep it as the most accurate parser in the world today.

Distinguishing between different parsers

Different parsers make different claims as to how good they are, and what they are good for. The two key measurements you should look for in a parser are (a) coverage and (b) accuracy.

Coverage - describes what a parser actually tries to extract. All parsers try to extract contact information for the candidates, and most extract skills, work histories and qualifications. Some parsers (including Daxtra's) extract referees, hobbies, candidate summary, desired salary, desired location, nationality, visa status, and various other fields. All of this information is required to create a full record for the candidate, and in general the more information a parser extracts, the better.

Accuracy - describes how good a parser is at identifying information from a resume. Accuracy measures how often the parser is actually right. For example, a precision of 95% on identifying names means that the parser correctly extracts the name of the candidate in 95% of all incoming resumes. This measure is important because the lower the accuracy the more it costs you to correct the errors that the parser makes. Although the difference between 89% and 95% may not seem huge, this difference represents more than a doubling in the rate of errors that will need to be corrected, and hence a doubling of associated costs. In general, if a parser is less than about 90% accurate, the number of errors will be too large to permit it to load data into a database without extensive human supervision.

Debunking some claims

Different parsers will perform with different accuracies on different sets of data, so if accurate parsing is important to you, the only way to find out which is the best parser is to test it on a sample of your data. We always invite prospective customers to test the accuracy of our parser and when they do, they usually find it to be the most accurate among all that they test.

A lot of claims are made about parsers, some of which are more true than others. Here we comment on some of the most common claims

"We have the best/most accurate parser in the world". Although this is indeed a claim we have been known to make, you should be aware that it begs the question "on what data". Resumes vary greatly in their format depending on who wrote them, what type of career the writer has had, their incoming document format, what language they are written in, and so on. For example, a single parser is likely to have very different accuracies for UK or Irish sourced resumes than resumes from the US or Australia. And they will certainly have very different accuracies on resumes written in different languages. The only real way to find out whether a parser is accurate enough for you is to test it on your data, or ask someone who processes very similar data. If accuracy is your concern (and it should be), we at Daxtra are always happy to help you evaluate your data with our parser because we have found that when people evaluate us, we nearly always are measured to be the most accurate.

"You can train our parser automatically to accurately parse any resumes irrespective of location/language/specialism". This half truth propagated by statistical parser companies is often used to explain why they were evaluated as second- or third-best. The truth is that it is indeed always possible to increase accuracy somewhat by training a statistical system on new data, but in general the amount of effort needed to do so well is likely to be very large to generate sufficient accuracy. The statistical algorithms used by these parsers will have already been trained on huge amounts of data, much of which will have looked like your data, so the relevant question to ask is why, if the parser has been around for so long, doesn't it work well "out of the box"? If you send the data that you are evaluating on for "training", you should not be surprised if the parser suddenly performs well on this particular data. That is what training is all about, after all. But it is then important to re-evaluate the parser on similar but different data to see how well the training has improved accuracy on previously unseen data (which is how it will be used for real). Often you will find that the "training" has little impact on performance on unseen data. At Daxtra we continually train and improve our parser to make sure that everyone gets the most accurate parser possible out of the box. Now why don't other companies do that?