KeyError exception caused by a Corporation label returned by propablepeople

I have a KeyError exception, while calculating the threshold in dedupe. One of my record is wrong and has a Corporation name in it. But it shouldn't cause an exception in parseratorvariable. (I'm using the `Person Name` FieldType). 

This issue is raised :  
```
Exception in thread Thread-6:
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/core.py", line 76, in __call__
    filtered_pairs = self.fieldDistance(record_pairs)
  File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/core.py", line 101, in fieldDistance
    distances = self.data_model.distances(records)
  File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/datamodel.py", line 82, in distances
    record_2[field])
  File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/parseratorvariable/__init__.py", line 90, in comparator
    variable_type = self.variable_types[variable_type_1]
KeyError: 'Corporation'
```

Either parseratorvariable is not handling the case of probablepeople is returning a wrong label or probablepeople is not returning an error if the label doesn't correspond to the type 'person'.

for more see https://github.com/datamade/probablepeople/issues/74

For those who wants to patch this with a work around, what I have done is replacing the `comparator` method in `parseratorvariable/__init__` (line 54) : 
Add this lines where you are using dedupe library
```
import dedupe
import numpy
import parseratorvariable
from probableparsing import RepeatedLabelError

def comparator(self, field_1, field_2):
    distances = numpy.zeros(self.expanded_size)
    i = 0

    if not (field_1 and field_2):
        return distances

    distances[i] = 1
    i += 1

    try:
        parsed_variable_1, variable_type_1 = self.tagger(field_1)
        parsed_variable_2, variable_type_2 = self.tagger(field_2)
    except RepeatedLabelError as e:
        if self.log_file:
            import csv
            with open(self.log_file, 'a') as f:
                writer = csv.writer(f)
                writer.writerow([e.original_string.encode('utf8')])
        distances[i:3] = [1, 0]
        distances[-1] = self.compareString(field_1, field_2)
        return distances

    if 'Ambiguous' in (variable_type_1, variable_type_2):
        distances[i:3] = [1, 0]
        distances[-1] = self.compareString(field_1, field_2)
        return distances
    elif variable_type_1 != variable_type_2:
        distances[i:3] = [0, 0]
        distances[-1] = self.compareString(field_1, field_2)
        return distances
    elif variable_type_1 == variable_type_2:
        distances[i:3] = [0, 1]

    if variable_type_1 not in self.variable_types:
        distances[i:3] = [1, 0]
        distances[-1] = self.compareString(field_1, field_2)
        return distances
    
    i += 2

    variable_type = self.variable_types[variable_type_1]

    distances[i:i + self.n_type_indicators] = variable_type['indicator']
    i += self.n_type_indicators

    i += variable_type['offset']
    for j, dist in enumerate(variable_type['compare'](parsed_variable_1,
                                                      parsed_variable_2),
                             i):
        distances[j] = dist

    unobserved_parts = numpy.isnan(distances[i:j + 1])
    distances[i:j + 1][unobserved_parts] = 0
    unobserved_parts = (~unobserved_parts).astype(int)
    distances[(i + self.n_parts):(j + 1 + self.n_parts)] = unobserved_parts

    return distances

parseratorvariable.ParseratorType.comparator = comparator
```
Then you can use dedupe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError exception caused by a Corporation label returned by propablepeople #3

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

KeyError exception caused by a Corporation label returned by propablepeople #3

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions