Skip to content

Instantly share code, notes, and snippets.

@stormwatch
Last active June 13, 2020 08:55
Show Gist options
  • Save stormwatch/5a6dd4f05a1a260e79339701087e3dca to your computer and use it in GitHub Desktop.
Save stormwatch/5a6dd4f05a1a260e79339701087e3dca to your computer and use it in GitHub Desktop.
-module(index).
-export([create/1]).
create(Name) ->
{ok,File} = file:open(Name,[read]),
create(File, 1, #{}).
create(File, Line_number, Map) ->
case io:get_line(File,"") of
eof -> file:close(File),
Map;
Line ->
create(File, Line_number + 1,
lists:foldl(
fun(Word, New_map) ->
maps:update_with(
string:uppercase(Word),
fun(Lines) ->
[Line_number|Lines]
end,
[Line_number],
New_map)
end,
Map,
re:split(Line, "\s *|[[:^alnum:]]", [trim])))
end.
@stormwatch
Copy link
Author

stormwatch commented May 31, 2020

Please excuse the delay. I couldn't get to work on the assignment during the last week. As it stands, this is a minimum viable product (“make it work”) with a lot of room for improvement.

I renamed get_file_contents/1 and get_all_lines/2 to create/1 and create/2 and modified them in order to read the file and compute the index in one go.

Questions

Is it possible to replace the foldl with a foreach in line 14?

Some quick notes on the work remaining ahead:

TODO

Removing

all short words (e.g. words of length less than 3)

modify the regexp string to also filter short words.

common words (you‘ll have to think about how to define these).

provide means to define an arbitrary collection of words to ignore when indexing

Sorting the output so that the words occur in lexicographic order

Having this map relationship between characters and integers already provided will come handy; but further normalization wil be required if I'll take accents, other diacritic marks, even related alphabets into account.

Normalising the words so that capitalised ("Foo") and non capitalised versions ("foo") of a word are identified.

I am already uppercasing everything so you might consider this done.

Normalising so that common endings, plurals etc. identified.

This would be nice! I'd have to read about the rule for english plurals. We'd need to be able to distinguish for example “kiss”, “dress” as singular words and not as plurals of “kis” and “dres”.

Harder) Thinking how you could make the data representation more efficient than the one you first chose. This might be efficient for lookup only, or for both creation and lookup.

This is starting to look more and more like a compression problem.

Can you think of other ways that you might extend your solution?

  • Refactor to make it more modular and configurable. Eg: Accept config options like [{ignore, my_dictionary}, {min_word, 3}, distinguish_plurals, ignore_case, {oder, order_fun}], etc.
  • Make it to work with multiline strings, not only files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment