Miscellaneous

Probability of duplicates in UUIDs

Random UUID probability of duplicates

Randomly generated UUIDs have 122 random bits. Out of a total of 128 bits, four bits are used for the version ('Randomly generated UUID'), and two bits for the variant ('Leach-Salz'). With random UUIDs, the chance of two having the same value can be calculated using probability theory (Birthday paradox). Using the approximation


these are the probabilities of an accidental clash after calculating n UUIDs, with x=2122:n probability
68,719,476,736 = 236 0.0000000000000004 (4 × 10−16)
2,199,023,255,552 = 241 0.0000000000004 (4 × 10−13)
70,368,744,177,664 = 246 0.0000000004 (4 × 10−10)


To put these numbers into perspective, the annual risk of someone being hit by a meteorite is estimated to be one chance in 17 billion,[37] which means the probability is about 0.00000000006 (6 × 10−11), equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate. In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs.

However, these probabilities only hold when the UUIDs are generated using sufficient entropy. Otherwise, the probability of duplicates could be significantly higher, since the statistical dispersion might be lower.

Proper name parsing



http://www.nltk.org/

from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer

tokenizer = SpaceTokenizer()
toks = tokenizer.tokenize(sentence)
pos = pos_tag(toks)
chunked_nes = ne_chunk(pos)

nes = [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]

or

def get_entities(self,args):

qry = "who is Mahatma Gandhi"
tokens = nltk.tokenize.word_tokenize(qry)
pos = nltk.pos_tag(tokens)
sentt = nltk.ne_chunk(pos, binary = False)
print sentt
person = []
for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):
for leave in subtree.leaves():
person.append(leave)

or

I don't think you need WordNet to find proper nouns, I suggest using the Part-Of-Speech tagger pos_tag.

To find Proper Nouns, look for the NNP tag:

from nltk.tag import pos_tag

sentence = "Michael Jackson likes to eat at McDonalds"
tagged_sent = pos_tag(sentence.split())
# [('Michael', 'NNP'), ('Jackson', 'NNP'), ('likes', 'VBZ'), ('to', 'TO'), ('eat', 'VB'), ('at', 'IN'), ('McDonalds', 'NNP')]

propernouns = [word for word,pos in tagged_sent if pos == 'NNP']
# ['Michael','Jackson', 'McDonalds']

You may not be very satisfied since Michael and Jackson is split into 2 tokens, then you might need something more complex such as Name Entity tagger.

By right, as documented by the penntreebank tagset, for possessive nouns, you can simply look for the POS tag, http://www.mozart-oz.org/mogul/doc/lager/brill-tagger/penn.html. But often the tagger doesn't tag POS when it's an NNP.

To find Possessive Nouns, look for str.endswith("'s") or str.endswith("s'"):

from nltk.tag import pos_tag

sentence = "Michael Jackson took Daniel Jackson's hamburger and Agnes' fries"
tagged_sent = pos_tag(sentence.split())
# [('Michael', 'NNP'), ('Jackson', 'NNP'), ('took', 'VBD'), ('Daniel', 'NNP'), ("Jackson's", 'NNP'), ('hamburger', 'NN'), ('and', 'CC'), ("Agnes'", 'NNP'), ('fries', 'NNS')]

possessives = [word for word in sentence if word.endswith("'s") or word.endswith("s'")]
# ["Jackson's", "Agnes'"]

Alternatively, you can use NLTK ne_chunk but it doesn't seem to do much other unless you are concerned about what kind of Proper Noun you get from the sentence:

>>> from nltk.tree import Tree; from nltk.chunk import ne_chunk
>>> [chunk for chunk in ne_chunk(tagged_sent) if isinstance(chunk, Tree)]
[Tree('PERSON', [('Michael', 'NNP')]), Tree('PERSON', [('Jackson', 'NNP')]), Tree('PERSON', [('Daniel', 'NNP')])]
>>> [i[0] for i in list(chain(*[chunk.leaves() for chunk in ne_chunk(tagged_sent) if isinstance(chunk, Tree)]))]
['Michael', 'Jackson', 'Daniel']

Using ne_chunk is a little verbose and it doesn't get you the possessives.

GraphViz

Dot and GraphViz are great at automated graphing but not so good for manual graphs.
You can edit the file in a descent GUI - DotEditor-0.3.1-linux
You can drop shadows with xslt post processing

dot -Tsvg hello.dot >hello.svg
xsltproc notugly.xsl hello.svg >hello-notugly.svg

you can include images and reference svgs for the shape of a node, and you can get half good looking diagrams, but it's still feels very constraining trying to draw up something one-off.

http://www.graphviz.org/content/attrs#dimage
http://www.graphviz.org/content/node-shapes

https://en.wikipedia.org/wiki/Diagramming_software