Discreet Cosine Transform

A few thoughts on ruby, video and other things

Python, Ruby and Dart Part 3: CSV Data

Next I thought I would tackle parsing CSV data in all three languages. What could be more exciting right? Once again, this was born out of actual need - I was recently crunching some CSV data at work. But, I like it as an example (despite both the boring subject matter and “just look in the standard library” nature of the question) exactly because its very real world. I envy the developer that has never been called on to write ETL code, but I bet a lot of you have. It is that kind annoying task that comes up again and again, at least in my world!

Ruby

So admittedly in Ruby, this is as easy as reaching into the standard library. Way, way back in the day there were gems that offered more features and faster parsing for CSVs than the code in the stdlib, but the Ruby maintainers smartly just integrated that code directly into std.

The documentation is straightforward and you can see the functionality is quite versatile, allowing for reading, writing, from files, from file-like IO objects, and from strings.

Perhaps most importantly, it correctly handles the first and most troublesome issue you always run into with CSV data - some field contains a comma in the data, rather than as markup, and your parsing trips on it. For example:

1
2
3
4
5
require 'csv'
list_data = %Q["red","blue","green"\r\n"cyan","blue","magenta, purple"\r\n"1","2","3"]
CSV.parse(list_data) do |row|
  puts row.inspect
 end

Will output:

1
2
3
["red", "blue", "green"]
["cyan", "blue", "magenta, purple"]
["1", "2", "3"]

Note how the string "magenta, purple" remains a single string and doesn’t get parsed into a row with 4 fields. Also note we threw it Windows-style line endings and it correctly dealt with that without us having to change the line termination field.

Python

Very similar in Python, you can just reach into the stdlib to parse CSV data. On first glance the Python library is a bit more feature-rich than the Ruby one - offering things like sniffing out the format of the CSV file and reading direct into a dictionary instead of just arrays.

Where I got a little stumped though is that the 2.7.9 version of the library doesn’t support operating directly on strings. They give an example of how to achieve this functionality by wrapping the wring as a 1 item array, but this doesn’t seem to work with line ends embedded in the string. So you have to split the line first, unlike Ruby, then parse each line you find:

1
2
3
4
5
6
7
import csv

list_data = '"red","blue","green"\r\n"cyan","blue","magenta, purple"\r\n"1","2","3"'

for line in list_data.splitlines():
    for row in csv.reader([line]):
        print row

Once you get through that though, you once again get the correct data, that is magenta, purple comes out right. Of course you wouldn’t need such gymnastics if you really were reading from a file and like Ruby, the library also supports parsing one line at a time instead of having to read all the data into memory first.

Dart

Trying this in Dart is an interesting look at the maturity of the community surrounding Dart. Dart doesn’t have a CSV parser in its standard library. That is not unexpected, as I keep going back to, given its client-side focus. So, we turn to pub.dartlang.org which is Dart’s packaging and publishing system.

There are a few options for CSV parsing, so this part of my trial and research really became a “do they work?” review. Note with dartlang.org, you don’t have the tools you do in Ruby or Python to guage the maturity of a library: such as number of downloads, for a tool like ruby-toolbox.

Several of the libraries I tried did indeed work, but you have to watch out for the output of print fooling you into thinking that it failed the test on magenta, purple.

Here is an example using csv:

1
2
3
4
5
6
7
8
import 'package:csv/csv.dart';

void main() {
  final String listData = '"red","blue","green"\r\n"cyan","blue","magenta, purple"\r\n"1","2","3"';
  final decoder = new CsvToListConverter();
  print(decoder.convert(listData)); //Note here the toString on the output makes it look like the test failed, but:
  print(decoder.convert(listData)[1][2]); //shows it actually is a discreet value of 'magenta, purple'
}

This will output:

1
2
[[red, blue, green], [cyan, blue, magenta, purple], [1, 2, 3]]
magenta, purple

Here is a complete example using csv_sheet:

1
2
3
4
5
6
7
8
9
10
11
12
import 'package:csv_sheet/csv_sheet.dart';

void main() {
  final String listData = '"red","blue","green"\r\n"cyan","blue","magenta, purple"\r\n"1","2","3"';

  var sheet = new CsvSheet(listData);
  sheet.forEachRow((row) {
      print(row); //the toString method here makes it look like the magenta, purple test failed, but:
      print ("Third item is: " + row[3]);

  });
}

This will output:

1
2
3
4
5
6
[red, blue, green]
Third item is: green
[cyan, blue, magenta, purple]
Third item is: magenta, purple
[1, 2, 3]
Third item is: 3

Note though it would appear this library has no way to discover the length of a row, so you would have to already know that information in your code. That seems like a shortcoming.

Conclusion

All three languages have options to help you parse CSV data - if they didn’t in this day and age, I guess we would be a little worried. Ruby and Python obviously have some maturity in this area that Dart lacks, but that doesn’t mean you don’t have options in Dart that work well. We can also safely conclude that parsing CSV data is a terrible use of your time and skills, and here is hoping you don’t have to do it often!

Python, Ruby and Dart Part 2: Find All Sublclasses

Continuing on with my previous post, I have found it useful sometimes to be able to programmatically discover all the classes that inherit from a specific class. I have used this as a light form of Inversion of Control where I autodiscover all the objects that implement a specific interface. Its particularly useful if you don’t care what order you then call each object: for instance running a bunch of diagnostic tests on a system (and it doesn’t matter the order the tests are run in), or loading up a bunch of filters for log lines that can be tested against a line in any order.

Ruby

This is easily done in Ruby with the ObjectSpace class, the example below would find every class that inherits from a class named Job, then instantiate an object from it and add it to an array:

1
2
3
4
5
6
7
all_jobs = []

ObjectSpace.each_object(Class) do |possible_job|
  if possible_job.superclass == Job then
      all_jobs << possible_job.new
  end
end

A Note on Namespaces

If the classes you are working with are inside modules, when using modules as namespaces in Ruby, you would indeed need to look for them via the Module::Class syntax (which is obvious, or else what good would the namespace do you, but I felt the need to mention it here since Python also has some particulars when it comes to namespacing).

A Note on Later Generations

This example would not find generations beyond the first. You could fairly easily adapt a recursive form of it that would. For my example use cases above, that is just not important.

Python

I am using the following form in Python, but I am not 100% sure yet this is what everyone would agree is the preferred form. I gather that some built-in methods like __subclasses__ are sometimes not preferred over other forms added to the language later.

Anyway, this works, basically the same example as above:

1
2
3
4
all_jobs = []

for cls in globals()['Job'].__subclasses__():
    all_jobs.append(cls())

However, depending on if you are doing this inside a module or outside of a module, you might want a different global function than globals(). There are a few options that look like they differ based on your current scope and intent.

Dart

Dart has a complete reflection API in dart:mirrors, and after a little reading about it (and some help from stack overflow of course, though looks like a few parts of this answer are now changed in Dart), I was able to piece together the code below. Note this example is a bit more complete (it shows the class declarations) because I wanted something you could actually run in the dart interpreter (again, with no cli REPL, its a little harder to just try things out).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import 'dart:mirrors';

class Job{}
class CheckRam extends Job{}
class CheckDrive extends Job{}

main() {
  final ms = currentMirrorSystem();
  var all_jobs = [];

  ms.isolate.rootLibrary.declarations.forEach((symbol, declarationMirror) {
      if (declarationMirror is ClassMirror) {
          final parentClassName = MirrorSystem.getName(declarationMirror.superclass.simpleName);
          if (parentClassName == 'Job') {
              all_jobs.add(declarationMirror.newInstance(const Symbol(''), []).reflectee);
          }
      }
  });

  print(all_jobs);


}

There was quite a bit I was not familiar with myself here. One was working with dart’s Symbol class, which is familiar from Ruby but unfortunately is implemented as a class rather than a type so you need some extra syntax to work with them. Also, that newIntance returns an ObjectMirror not the object itself (though that is solved with the reflectee property).

This example will look in the root library, as above in both Ruby and Python, if you intend to search only a specific library it would take some modification to do so.

Getting Things Done in Python, Ruby or Dart

There are many, many posts out there about how to accomplish similar steps in Python or Ruby. However, I have found that few cover the exact topics I was looking for. There is a certain set of small routines I find myself doing very often in Ruby (the language I know best), and I wanted to practice and commit to muscle-memory how to do those in Python (which I have been learning for my current job).

I have also been (trying to) learn Dart, mostly for client-side script (you can also use it server-side). Admittedly, when working with it for client-side, not every use case I am thinking of applies. But, I found working with Python this way has really helped me pick up the syntax and common practices that differ between the two languages, so I am hoping there are enough that are relevant to Dart to help there too.

First One - Single-line Assignment Plus Test

Ruby

Ruby supports assignment and test in a single line, like:

1
2
3
4
5
if test = true || false
  puts test #prints true
end

puts test #prints nil

Where the variable test is scoped inside the if block.

That can be useful, not just for the elegant syntax but due to the controlled scope. Another similar style, though not exactly assignment plus test, is Ruby’s =~ comparison operator. It is not the same as assignment plus test because the scope is not controlled - last_match is a class method and available even once you leave the block that started with the original comparison. But it looks and feels much the same, and to me has the same elegance in terms of syntax and readability:

1
2
3
4
5
some_string = "Matches this expression"

if some_string =~ /matches this (\w+)/i
  puts Regexp.last_match(1)
end

Python

This is for Python 2.7.8, the version I have been learning. I understand much is different and improved in Python 3, so its very possible the following answer is not applicable there!

Python does not have single line assignment and test. There are some alternatives but none are very readable or elegant in my opinion (none that I have seen yet).

Specifically for the regular expression use case, the core Python regex methods return either a match object or None, so you can use a if statement to move on only if a match was made. That makes your code look like:

1
2
3
4
5
6
7
8
9
import re

some_string = "Matches this expression"

match = re.search(r"matches this (\w+)",some_string,re.I)

if match:
    print match.group(1)

And while again there is no special scope control here, from a garbage collection standpoint it doesn’t seem bad, as if the regex matched nothing, your variable is only None rather than a more complex (but useless) match object.

Dart

The first thing we have to talk about with Dart is that there is no cli REPL. This makes some sense, again if you think of Dart as mostly a client-side, in-browser language. Of course, you can write server-side Dart so a REPL for that use-case makes sense, and it seems like there is some vote for that within the Dart community.

The is a REPL built into Dartium, but to invoke it you need to load a page that includes Dart code, and then use that REPL in the context of that code (it has access to only the libraries imported on that page, for instance).

So perhaps the easiest way to try simple examples like this is to make a very small cli script and execute it each time with the dart interpreter.

I also tried all this against Dart 1.7.2, so like above, I am not sure if it is exactly the same in other versions.

So now to single-line assignment and test - looks like Dart doesn’t support it either (I am the least sure of this of the three though, as Dart seems new enough and me new enough to it that maybe I am just missing some way to twist the syntax into this behavior).

Similar to Python, your assignment then test code ends up like:

1
2
3
4
var test = true || false;
if(test) {
  print(test);
}

Dart has a RegExp library as part of dart:core (so you don’t need to import anything to get to it). The behavior of methods like firstMatch is to return null if no match is made, however, unlike Python, Dart is extremely strict about truthfulness. So you need to be explicit in your check for the match:

1
2
3
4
5
6
7
RegExp exp = new RegExp(r"matches this (\w+)", caseSensitive: false);

Match match = exp.firstMatch("Matches this expression");

if(match != null) {
  print(match.group(1));
}

Conclusion

Ruby’s assignment-and-test like syntax for regular expression testing and then using parts of the match is a very nice syntax. Python and Dart don’t have similar constructs, but the total complexity to execute a possible match, test results, then work with the match is still fairly low. Also, while Ruby supports true assignment-and-test, Python and Dart do not. Again, not the end of the world, just good to know when hoping between the languages.

Mounting a GlusterFS Mountpoint on Bootup in Ubuntu 14.04

I ran into an issue where despite the presence of the _netdev option in my fstab file, my GlusterFS mount point was failing to mount on system start.

There is a log in /var/log/glusterfs/ that pretty clearly showed it was trying to start before networking had started, as it kept failing on name resolution for whatever node I used in /etc/fstab.

Poking around a bit I found someone with the same problem on Server Fault. He had discovered a further log, in /var/log/upstart/ that showed an error message generated from the upstart script /etc/init/mounting-glusterfs.conf.

A little more reading showed that mounting-glusterfs.conf relies on another upstart job, /etc/init/wait-for-state.conf. This was the last clue I needed - as wait-for-state.conf expects a job name passed in as WAIT_FOR, not an event as mounting-glusterfs.conf tries to use.

So I changed WAIT_FOR in mounting-glusterfs.conf to networking instead of static-network-up and that resolved the issue.

Perhaps the wait-for-state.conf script has an updated version, as it seems some people are saying that mounting-glusterfs.conf works on other distros. So that might be a solution as well. But the change above is working for me on Ubuntu 14.04.

New Home for the Streamingmedia.com Advanced List

Streamingmedia.com decided to no longer continue the Advanced listserv they have maintained for the last decade. However, the members of the list decided to keep that community going. Come join if you are interested in discussing online video delivery and other topics related to video streaming!

Riak-client and :symbolize_keys

Also in the vein of “maybe this saves you a little time”, be careful of riak-client version 1.4.3 or less and setting

1
:symbolize_keys => true

in MultiJson. If you set it for all load opertions, you will break some of riak-client’s methods such as #keys, #buckets and even map-reduce.

For anyone of the “you found it, you fix it” mindset out there I did report the issue but admittedly did not create a pull request with a fix yet. I agree, shame on me!

Dartium’s Fineprint

I am just starting out with Dart and ran into a little hiccup using Dartium that took me longer than it should have to realize what was going on.

Buried on the download page is this line:

The Dartium binary expires after 12 weeks.

And what that line doesn’t detail, is that Chromium will still launch, however, it will just silenty no longer execute Dart scripts.

To fix this, download the latest version of Dartium from the download link above.

Understanding Riak-client’s WalkSpec

I had a little trouble initially grokking riak-client’s options for link walking. There is a wiki page on it on their github and some api docs on it, but I still couldn’t get my mind around the options. Here is a mini walkthrough that might be helpful if you are having the same problem.

Setup

Fire up IRB or your REPL of choice and load riak-client, then connect to your local riak server or wherever you want:

1
2
require 'riak'
client = Riak::Client.new(:protocol => "pbc", :http_backend => :Excon)

I created four objects to toy with, a through d. Note at least when using protocol buffers, riak-client doesn’t let you leave data blank, so I set that to the simplest json doc I could, all we need is the riak links:

1
2
3
4
5
6
7
8
9
10
bucket = client.bucket('test')
a = bucket.get_or_new('a.json')
b = bucket.get_or_new('b.json')
c = bucket.get_or_new('c.json')
d = bucket.get_or_new('d.json')

a.data = '{}'
b.data = '{}'
c.data = '{}'
d.data = '{}'

Then I chained them together with some simple links. This part is pretty easy, thanks to the #to_link method on riak-client’s RObject:

1
2
3
a.links << b.to_link('foo')
b.links << c.to_link('foo')
c.links << d.to_link('bar')

So you can get to c from a through the tag “foo” but you can’t get all the way to d, as that is tagged with “bar”.

Last part of setup, store all your objects:

1
2
3
4
a.store
b.store
c.store
d.store

Walking

There is a shortcut #walk method on RObject, though you can also call #walk from the client itself. I like the shortcut, and at least one use of it is reasonably clear:

1
a.walk('test','_',true)

returns:

1
=> [[#<Riak::RObject {test,b.json} [#<Riak::RContent [application/json; charset=UTF-8]:(4 bytes)>]>]]

So this means walk the links on RObject a, keeping all results to the bucket test, but following all tags (the _ character is the wildcard in riak’s scheme). The true means to return the results of this walk leg, which was one of the parts that was confusing for me at first but makes more sense later.

The returns make sense: it returns b but not c which is one more link away. It only walks the first link, in other words. But why are the results an array of an array? It makes it seem like the return could be two dimensional, but how to achieve that wasn’t real obvious at first.

Note that the api docs do make clear there is an alternate syntax here, using a hash:

1
a.walk(bucket: 'test', tag: '_', keep: true)

Which is nice because its more explicit.

See that syntax and reading the source for the #normalize method on WalkSpec had things making more sense to me. To walk two links out from your current node, just pass it to hashes, both of which will be turned into WalkSpecs:

1
2
a.walk({bucket: 'test', tag: '_', keep: true},{bucket: 'test', tag: '_', keep: true})
=> [[#<Riak::RObject {test,b.json} [#<Riak::RContent [application/json; charset=UTF-8]:(4 bytes)>]>], [#<Riak::RObject {test,c.json} [#<Riak::RContent [application/json; charset=UTF-8]:(4 bytes)>]>]]

Now the keep: true adds up, as does the double array. If the first walk had returned more than 1 result, then the second level of the walk would now be branching, and two dimensions of results would be returned. If we wanted to get to c from a but not return b:

1
2
a.walk({bucket: 'test', tag: '_', keep: false},{bucket: 'test', tag: '_', keep: true})
=> [[#<Riak::RObject {test,c.json} [#<Riak::RContent [application/json; charset=UTF-8]:(4 bytes)>]>]]

And if we want to get to d:

1
2
a.walk({bucket: 'test', tag: '_', keep: false},{bucket: 'test', tag: '_', keep: false},{bucket: 'test', tag: '_', keep: true})
=> [[#<Riak::RObject {test,d.json} [#<Riak::RContent [application/json; charset=UTF-8]:(4 bytes)>]>]]

But that only works because we have allowed for any tag, remember we tagged the link to d with bar so if we try:

1
2
a.walk({bucket: 'test', tag: '_', keep: false},{bucket: 'test', tag: '_', keep: false},{bucket: 'test', tag: 'foo', keep: true})
=> [[]]

I might write a larger test script and test data set to really play with multiple levels of walking and those nested results, but overall the logic of #walk is a lot clearer to me now.

Marshal Ruby Objects to Riak and Back

Thought I would share a little bit of work I have been messing around with to marshal ruby objects to the Riak NoSQL store and back. There is Ripple, a library for exactly this purpose, but it implements the ActiveRecord model and that is not exactly what I wanted.

riak-client

Riak client supports automatic serialization based on what you set as an object’s content-type. So for example, the follow code would automatically store this Ruby hash as JSON in Riak:

1
2
3
4
5
6
client = Riak::Client.new(:protocol => "pbc")
bucket  = client.bucket("my_bucket")
new_one = Riak::RObject.new(bucket, "hash_object.json")
new_one.content_type = "application/json"
new_one.data = one
new_one.store

The difference here is in the use of #data and not #raw_data. There is a way to define your own serializers for different content-types, but that seems to be undocumented by the riak-client team for now. By default it supports both YAML and JSON.

That will work fine for the typical JSON data types such as Hash and Array. But if you want to store a more complex object, and get it back, you will need a little more.

Using the JSON Gem

riak-client uses multi_json as you might expect. So, if you have the JSON gem installed or loaded, you can use their method of storing a ruby object and marshaling it back from JSON.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
class ComplexObject
  attr_accessor :place,:time,:characters

  def initialize(stuff={})
    @place = stuff[:place]
    @time = stuff[:time]
    @characters = stuff[:characters]

  end

  #we have to define our own #to_json and self.json_create methods
  def to_json(*a)
    {'json_class'=>self.class.name,'place'=>@place,'time'=>@time,'characters'=>@characters}.to_json(*a)
  end

  def self.json_create(data)
    new(:place=>data['place'],:time=>data['time'],:characters=>data['characters'])
  end
end


one = ComplexObject.new(:place=>'woods',:time=>'tomorrow',:characters=>'burt')


client = Riak::Client.new(:protocol => "pbc")
bucket = client.bucket("my_bucket")


#in case I have other JSON libraries loaded that MultiJson would favor
MultiJson.use :json_gem

#this is a key, as it does not default to true 
#and without it the JSON gem won't call json_create
MultiJson.load_options = {:create_additions=>true}


new_one = Riak::RObject.new(bucket, "complex_object.json")
new_one.content_type = "application/json"
new_one.data = one
new_one.store

#viola, you get back a new object, not just a Hash
puts client['my_bucket']['complex_object.json'].data

Using Oj

It appears Oj has even more innate support for serializing and deserializing ruby objects to JSON, but I can’t say I am deeply familiar with Oj yet so take that with a grain of salt.

Here is the example above using Oj:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
class ComplexObject
  attr_accessor :place,:time,:characters

  def initialize(stuff={})
    @place = stuff[:place]
    @time = stuff[:time]
    @characters = stuff[:characters]

  end

  #with oj, no need for our own serialize and deserialize methods

end

one = ComplexObject.new(:place=>'woods',:time=>'tomorrow',:characters=>'burt')

client = Riak::Client.new(:protocol => "pbc")
bucket = client.bucket("my_bucket")

#MultiJson would default to oj if present, just being explicit here
MultiJson.use = :oj

#we want oj's object mode to do its automatic serialization
MultiJson.load_options = {:mode=>:object}
MultiJson.dump_options = {:mode=>:object}

#Store a ruby hash as json in riak

new_one = Riak::RObject.new(bucket, "complex_object.json")
new_one.content_type = "application/json"
new_one.data = one
new_one.store

#viola, again, the full ruby object
puts client['my_bucket']['complex_object.json'].data