Extract a numeric attribute from partially unstructured text for each word of a vocabulary
Given a vocabulary
v = {'sales', 'units', 'parts', 'operators', 'revenue'}
and strings such as
s1 = 'total of 1138 units, repaired 7710 parts, sales increased 588 (+34), decreasing of operator 413 (-14)'
s2 = 'part 7710 (repaired), units are 1138, revenue 1212, operators variation is -14, salles increment +34 (588 total)'
I have to associate each key of v with the corresponding number from s1 and s2 (for sales and operators I need the variations (numbers with a sign in the front), i.e. +34 and -14), that is for s1 we have to obtain
| key | attribute |
|---|---|
| sales | +34 |
| units | 1138 |
| parts | 7710 |
| operators | -14 |
| revenue | none |
for s2 is the same table except for 1212 instead of none.
Notice that:
- there is some sort of structure in the text data since each string contains some commas
,dividing the string into different parts, each of them containing a word of the vocabulary and one number (two in the case ofsalesandoperators). - keys contained in the strings may be written badly since they are manually typed, i.e. in
s1there isoperatorinstead ofoperators, ins2there arepartinstead ofpartsandsallesinstead ofsales
I wrote a simple python script (using mainly regex) doing the job in most cases, and now I'd like to try with a machine learning algorithm to learn how it works and compare the results. I have many manually labelled strings (i.e. string + table) which I could use to train a neural network, but since I'm a novice I don't know where to start.
Which model is most suitable for this task? NER? BERT? I searched for examples on keras site, here and on google to see if somebody had already treatened this kind of tasks, but didn't find one, and I think it's because I don't know which terms to use in the search query. I tried with something like data mining unstructured data, example of supervised NER and keras example text extraction, but they all are kind of vague.
Is this a problem of multiclass classification? Text mining? Features generation? I think is not NLP since the algorithm doesn't have to learn the meaning of a sentence.
Topic text-mining neural-network
Category Data Science